SPSS_text
SPSS_text
SPSS was developed by Norman H. Nie, Dale H. Bent, and C. Hadlai Hull in 1968. Originally SPSS stood
for “Statistical Package for the Social Sciences.” It was acquired by IBM in 2009 and is now known
officially as IBM SPSS Statistics. We’ll refer to it as SPSS in this tutorial. New versions of SPSS are
introduced annually or biannually. The current version is version 26 which was released in 2019.
There are several options for obtaining SPSS. It can be purchased for $99 monthly. There are two less
expensive options for students and faculty. A faculty pack is available for $260 annually and a GradPack
(for students) is available for anywhere between $39.95 and $89 for a period of six or twelve months.
For more information on purchasing SPSS, click here.
Many colleges and universities have site licenses for SPSS which means that you can use SPSS for free in
their campus labs. Faculty can get SPSS installed on their office computer and for most campuses on
their home computer. Students who don’t have an individual license will have to use SPSS in campus
computer labs.
Using SPSS
SPSS can be used in two ways. It can be accessed through a series of menus in an interactive manner
and it can also be used in a syntax mode where you write commands. While the syntax mode is more
powerful, the menu mode is much simpler and easier to learn. We’re going to use the menu mode in
this tutorial.
Menu Bar
The best way to get an idea of what SPSS is like is by looking at the menu bar (see Figure 1-1).
Figure 1-1
Clicking on any of these options will open a drop-down menu and offer you a further series of options.
We’ll be discussing some of them (but not all) in this tutorial.
FILE
o Opening a new data or output file.
o Opening an existing data or output file.
o Importing data from other statistical packages such as Excel, SAS, and Stata.
o Saving a data or output file.
o Exporting a data file into a file that other statistical packages can read.
EDIT
2
o There are a number of options under EDIT including inserting new variables and new
cases into a data file and searching a data file.
o Of particular interest is OPTIONS where you can change the way SPSS does things. One
very useful option allows you to change the way the list of variables is presented.
They can be listed using the variable labels or the variable names.
They can be listed alphabetically or in the order they occur on the data file.
We’ll have more to say about this later in Chapter 3.
VIEW
o Here you can turn on and turn off the various bars that show on your screen.
o One option worth mentioning is VALUE LABELS. Checking the box for this option tells
SPSS to display the value labels instead of the values. Unchecking this box will show
the values themselves.
DATA
o There are many options under DATA including sorting the variables or the cases in
various ways.
o Of particular interest are the SELECT CASES and WEIGHTS options. SELECT Cases allows
you to select particular cases such as only males or only females for further analysis.
We’ll return to this option in Chapter 3.
o WEIGHT is particularly useful when using sample data where you want to make the
data more representative of the population from which the sample was selected. We’ll
use WEIGHT a lot and return to this option in Chapter 3.
TRANSFORM
o TRANSFORM will change your data in various ways. We’ll discuss them more fully in
Chapter 3. Briefly:
o RECODE can be used to combine particular values in a variable. For example, if age is
coded into years such as 18, 19, and so on, you might want to combine these values
into categories such as 18 to 29, 30 to 49, 50 to 69, and 70 and over.
o COMPUTE VARIABLE is used to create new variables out of existing variables. For
example, if you have two variables – political party identification and political ideology
(liberal, moderate, conservative) – you could create a new variable that has categories
such as conservative Republican, moderate-liberal Republican, moderate, moderate-
conservative Democrat, and liberal Democrat.
o COUNT VALUES WITHIN CASES allows you to count the number of times that particular
values occur within a set of variables. For example, if you had seven questions that
asked if a person thought that abortion should be legal under various scenarios,
COUNT can be used to create a new variable that tells you how many times each
respondent thought abortion should be legal or not legal.
o SELECT CASES will select out groups that you want to analyze separately from the rest
of the cases. For example, you might want to focus on young males
ANALYZE
o ANALYZE is where you go to carry out your statistical analysis.
o We’ll cover some of these statistical procedures in Chapters 4 through 8.
GRAPHS
3
o SPSS has several ways of creating graphs and charts. We’re going to use CHART
BUILDER.
o We’ll talk about pie charts, bar graphs, histograms, and boxplots in chapter 4 and again
in Chapter 9, and about scatterplots in Chapter 7.
UTILITIES
o The only option under UTILITIES that we’re going to discuss is VARIABLES which is like a
mini codebook for the variables in our data file.
EXTENSIONS – We’re going to skip this option.
WINDOW
o WINDOW shows the different windows that SPSS opens.
o The two that are relevant for us is the window containing the data file and the output
window where SPSS shows you the results of your statistical analysis.
o All you have to do is click on the window that you want to view and SPSS will show it.
HELP
o As the name implies, HELP is where you go to get help in using SPSS.
o There’s even a Statistics Coach.
Besides the Help menu, there are other ways to get more information.
The Social Science Research and Instruction Council’s website has a section that contains links to
other instructional sites for SPSS as well as other topics.
YouTube has a number of instructional videos on using SPSS. Be warned that some are better
than others.
You can also use Google to find other sources of information on SPSS.
Chapter 2 – Creating your own SPSS data file and opening data files in various formats (e.g.,
Excel, SAS, Stata, text).
Chapter 3 – Transforming data (recode, compute, count, if, select cases))
Chapter 4 – Describing data (frequency distributions, measures of central tendency, measures of
dispersion, skewness, kurtosis)
Chapter 5 – Two-variable crosstabulation including Chi Square and measures of association
Chapter 6 – Comparing means (independent-sample t test, paired or dependent-samples t test,
one-way analysis of variance)
Chapter 7 – Correlation and regression with two variables
Chapter 8 – Multivariate analysis (crosstabulation, correlation, and regression) using more than
two variables
Chapter 9 – Presenting your results including charts and tables
Chapter 10 – Writing research reports
4
Next Chapter
Chapter 2 we’ll discuss how to create your own SPSS data file and how to open data files in other
formats such as Excel, SAS, Stata as well as text files.
5
You’ll need to have SPSS installed on your computer to complete the rest of this tutorial. If you’re a
student or faculty at a college or university that has a site license for SPSS, you’ll be able to use SPSS for
free at one of their computer labs. If you want to purchase SPSS, look back at Chapter 1 for information.
Once you have SPSS installed and opened on your computer, the first thing you need to do is to either
create your own data file or open an existing data file. Let’s start by assuming that you have data that
you have collected and want to analyze in SPSS. So, you need to create your own data file. As an
example, let’s assume your data includes the following variables.
Case identification number (i.e., a unique identification number for each case in your data) (id)
Support or oppose same-sex marriage (same_sex_marriage)
o 1 = Support same-sex marriage
o 2 = Oppose same-sex marriage
o 3 = Undecided
o 9 = Don’t know or refuse to answer
Political party preference (partyid)
o 1 = Democrat
o 2 = Republican
o 3 = Independent
o 4 = Other party
o 9 = Don’t know or refuse to answer
Political views (polviews)
o 1 = Conservative
o 2 = Middle-of-the-road
o 3 = Liberal
o 9 = Don’t know or refuse to answer
Age in years (age)
o 98 = 98 or older
o 99 = Don’t know or refuse to answer
Subjective social class (class)
o 1 = Upper
o 2 = Middle
o 3 = Lower
o 9 = Don’t know or refuse to answer
Gender (gender)
o 1 = Male
o 2 = Female
o 9 = Refuse to answer
6
Education (educ)
o 1 = Less than high school
o 2 = High school degree
o 3 = Some college
o 4 = Bachelor’s degree
o 5 = Some postgraduate work
o 6 = Postgraduate degree
o 9 = Don’t know or refuse to answer
Notice that all these variables are numeric. While SPSS can handle string variables, most of the time we
use numeric variables. Sting variables may contain letters, numbers, and special characters while
numeric variables contain only numbers. We’ll only consider numeric variables in this tutorial.
Notice also that we always allow for missing data. For various reasons, some information may not be
available. In the case of a survey, this may be because respondents don’t know the answer or it may be
that they don’t want to answer. We’ll have more to say later about how SPSS handles missing data, but
for now keep in mind that we always have one or more codes to account for missing data.
Before we enter the data, we’re going to give each variable a name and, in most instances, a label. Each
variable also has codes to account for the different ways respondents answer the questions. These are
called values. For example, political views has four values – 1, 2, 3, and 9. Each value can be assigned a
value label. Value 1 could be assigned the value label “conservative” and value 2 the label “middle-of-
the-road.” Most variables will have one or more codes for missing data. For political views, 9 is our
missing value code and could be assigned the label “don’t know or refused”.
Variable Names
Each variable must have a unique name. Names can be as long as 64 characters but it’s advisable to use
relatively short names. Here are some simple rules to follow in naming your variables.
Look back at the list of variables in the example on the first page of this chapter. Possible variable
names are in parentheses. Note that the parentheses are not part of the variable name.
Variable Labels
Variable names are typically short and sometimes don’t supply much information about the variable.
Sometimes users use variable names like q1 or var1. To make the nature of the variable clearer, you can
create a variable label that can be up to 256 characters. In our example, the variable named partyid
could be given the label “political party preference”. Variable labels can contain letters (lower or upper
case), numbers, special characters, and blank spaces. Variable labels are optional.
7
Value Labels
Values are the numbers that you use to represent different characteristics of the case. In our example,
for the variable partyid, the values are 1, 2, 3, 4, and 9. For the variable gender, the values are 1, 2, and
9.
To tell the user what these values stand for we could give each value an extended value label. For the
variable partyid, 1 could be given the label “Democrat” and 2 the label “Republican.” Value labels can
contain letters (lower and upper case), numbers, special characters, and blank spaces. Value labels are
optional.
Missing Values
Sometimes the information for a particular variable is unavailable. This can be for a number of reasons.
If the cases are respondents to a survey, the respondent may not know how to respond to a particular
question. If the question asks for the respondents’ yearly family income, they may not know their
income. Another possibility is that they don’t want to tell us their income. If the cases are geographical
areas such as counties or states, a particular piece of information might be unavailable. If the variable
describes the violent crime rate of the area, the information might be unavailable for various reasons.
There are two types of missing values -- user-defined missing values and system missing values. In our
example, 9 is the missing value for partyid and 99 is the missing value for age. There can also be more
than one missing value. For example, we might want to use 8 for don’t know and 9 for refused. SPSS
limits you to three missing value specifications.
Sometimes the user may use a blank space for missing information. SPSS automatically treats blank
spaces as missing values. This is referred to as a system missing value. There are other examples of
system missing values that we will discuss later.
Think of the variable names, variable labels, value labels, and missing values as information that defines
your data. How do we actually enter that information into SPSS?
SPSS opens in one of two views – data view or variable view. Data view displays the values or the value
labels for the cases and the variables in your data set. To tell SPSS to display the values or the value
labels, click on VIEW in the menu bar and check or uncheck the VALUE LABELS box. We’ll display the
values in this tutorial. Variable view displays the information that defines your data. Click on the
VARIABLE VIEW tab at the bottom of the screen. Figure 2-1 shows you what VARIABLE VIEW looks like.
8
Figure 2-1
Enter the variable names for all eight variables in the example at the beginning of this chapter
into the NAME box. Once you enter the variable name SPSS will enter the default values into
some of the remaining cells. These defaults can be edited. After you have entered the variable
name, press the ENTER key to move down to the next variable.
Enter the variable labels into the LABEL box and click ENTER to go to the next variable label.
Enter the value labels into the VALUE LABELS box. Click in the far right-hand part of the box and
a dialog box will pop up. Enter the value in the VALUE box and then enter the value label in the
VALUE LABEL box. If you want to make a change, click on the label you want to change, make
your change, and then click on APPLY. If you want to delete the label, click on REMOVE. When
you are done, click on OK.
Enter the missing values into the MISSING VALUES box. Click in the far right-hand part of the
box and a dialog box will pop up. You can enter up to three different values (e.g., 9, 99) or you
can enter one range of values and one value. Notice that the default is no missing values. When
you are done, click on OK.
There’s one other box that deserves our attention – the MEASURE box. Click anywhere in that
cell. Now click on the drop-down arrow and you will have three choices – scale, ordinal, and
nominal. Notice that scale is the default.
o Scale refers to a continuous variable such as age. In a continuous variable, the values
have the properties of real numbers. They can be added, subtracted, divided, and
multiplied like real numbers.
o Ordinal refers to categories that have an inherent order to them. Some categories are
higher or lower than other categories. But you can’t treat them like real numbers. All
you can say is that some categories are higher and others are lower. For example, think
of social class. Upper class is higher that middle class and middle class is higher than
lower class. We can use numbers to represent these different categories. But we can’t
carry out mathematical operations such as addition and subtraction with them.
o Nominal refers to categories that have no inherent order to them. For example, political
party preference has four categories – Democrat, Republican, Independent, other. We
can’t say that one category is higher or lower than another category. All we can do is
say they are different.
o Treat dichotomies as ordinal.
o Enter the type of measure for each variable.
9
You can probably use the default values for the other columns in Figure 2-1. You might want to
change the decimal value. The default value is two for all variables. If your values are integers,
you might want to change the decimal value to zero.
Once you have filled in all the cells in Figure 2-1, your matrix should look like Figure 2-2.
Figure 2-2
Now that you have defined all your variables, it’s time to start entering the data values for each case.
One way to do this is to enter them in SPSS. Click on the DATA VIEW tab at the bottom of your screen.
Notice that SPSS has filled in the variable names at the top. The variables are in the columns of the
matrix and the cases are in the rows. Since this is a hypothetical data set, make up values for four cases
and enter them into the matrix. Make sure that you are entering values that are within the ranges
specified in the codebook. Include some missing values as well.
All that is left now is to save the data file. Click on FILE and then on SAVE AS. Browse to where you want
to save the file on your computer and enter the file name toward the bottom of the screen. Press enter
and SPSS has saved your data file. Open your file manager and make sure you saved it where you want
to store it. Now close SPSS. SPSS will save your file as a .sav file.
Here’s how you can open the data file in SPSS that you just created.
Some users prefer to enter their data in Microsoft Excel instead of SPSS. To do this, open Excel on your
computer. Use the first row on your spreadsheet for the variable names. Then starting with row 2,
enter the values for each case. Once you have entered the values for all the cases, save your Excel file
wherever you want to store it on your computer. Later in this chapter we’ll show you how to open an
Excel file in SPSS. The variable names will be entered in the NAME column on the VARIABLE VIEW tab.
You’ll have to enter the other data definitions yourself.
10
Another possibility is that you already have a data file that has been saved as a SPSS data file. 1 SPSS has
two different types of data files -- .sav and .por.2 Here’s how to open a .sav file.
Open SPSS.
Click on FILE in the menu bar at the top of your screen.
Click on OPEN.
Click on DATA in the pop-up menu to indicate that you want to open a data file.
Browse to where your .sav file is located.
Double click on the file name.
Open SPSS.
Click on FILE in the menu bar at the top of your screen.
Click on OPEN.
Click on DATA in the pop-up menu to indicate that you want to open a data file.
Browse to where your .por file is located.
Click on the dropdown arrow in the FILES OF TYPE: box and select PORTABLE (.POR) file. Note
that .sav is the default option. That’s why you have to change it to .por.
Double click on the file name.
Still another possibility is that you have a data file that was saved as an Excel file. Here’s how you open
an Excel file.
1
There are many data archives where you will find data files in SPSS format. There are membership consortiums
such as the Inter-university Consortium for Political and Social Research (ICPSR) and the Roper Center for Public
Opinion Research. There are other data archives that you don’t have to join such as the Pew Research Center and
the Public Policy Institute of California. For an extensive list of data archives, click here.
2
There are two types of data files in SPSS -- .sav and .por files. Portable (.por) files are often used when you send a
data file to someone else. Save (.sav) files are typically used when you are working with your data file. Files that
you download from the ICPSR will typically be .sav files and files that you download from Roper will be .por files so
you need to know how to open both types of files.
11
When you click on the downward pointing arrow at the far right of the FILES OF TYPE: box you will see a
list of other types of data files that you can open in SPSS. Included in that list are SAS and Stata files
which are two commonly used statistical packages. SPSS will also open text files with the following
extensions: .txt, .dat, .csv, .tab. When you tell SPSS to open any of these text files, SPSS will open a TEXT
IMPORT WIZARD which will guide you through importing your file into SPSS.
Next Chapter
Chapter 3 will discuss various ways that you can transform or change your data. These include WEIGHT
to weight your cases, RECODE to combine categories of your variables, COMPUTE and IF to create new
variables out of existing variables, and COUNT to count the number of times that respondents select
particular responses from a list of variables.
12
Covered in this chapter are recode, compute, if, count, and weight.
Recoding Variables
Recoding is a way of combining the values of a variable into fewer categories. Let’s say you have
conducted a survey and one of the questions in your survey was the age of the respondent. Entering the
actual age in years would be the simplest way of recording the data. But what if you wanted to compare
people of different age categories? Using SPSS, you could reorganize the data into categories such as
younger, middle age, and older. There are two things you need to know before you recode the values.
First, you need to decide the number of categories you want to end up with. Generally, this will be
determined by the way you plan to use the information. If you are going to analyze the data using a
table where you crosstabulate two variables (see Chapter 5), you probably want to limit the number of
new categories to three or four. The second thing you need to know is which of the old values are going
to be combined into new categories. For example, you might do something like this.
The actual age of the respondent as originally The new, collapsed, category.
recorded in the data file.
Another example might be if respondents were asked how often they prayed, and the original responses
were several times a day, once a day, several times a week, once a week, less than once a week, or
never. With recode we can combine the people who said, “several times a day” with the people who
said “once a day” and put these respondents into a new category which we could call “often.” Similarly,
we could combine the people who said “several times a week” with those who said “once a week” and
call this category “sometimes” and combine those who said “less than once a week” and “never” and
call this category “infrequently.” Recoding is the process in SPSS that will carry out the above examples.
13
Start SPSS and open the data file named GSS18A. We’re going to recode the variable called age, which
is, of course, the respondent's age.
Figure 3-1
Now we have two options: RECODE INTO DIFFERENT VARIABLES and RECODE INTO SAME VARIABLES. It
is strongly suggested that the beginning student only use the RECORDING INTO DIFFERENT VARIABLES
option. If you make an error, your original variable is still in the file and you can try again. If you make
an error using RECODE INTO THE SAME VARIABLES, you have changed the original variable. If you also
saved the file after doing this, and you did not have another copy of the file, you have just eliminated
any chance of correcting your error.
Recoding into a different variable starts with giving the new variable a name. For example, if we recode
into different variables, we could combine ages into one set of categories and call this new variable
age1 and then recode ages into a different set of categories called age2. To do that, click on RECODE
INTO DIFFERENT VARIABLES. Your screen will look like Figure 3-2. If SPSS displays the variable labels
instead of the variable names, click on EDIT in the menu bar and then on OPTIONS. Click on DISPLAY
NAMES and on SORT BY NAME. Now it will display the variable names in alphabetical order. This
will make it easier to find variables in the list.
14
Figure 3-2
Find age in the list of variables on the left and click on it to highlight it, and then click on the arrow just
to the left of the big box in the middle of the window. This will move age into the list of variables to
recode. Notice that when the arrow points to the right, it moves the variable from the list on the left to
the list of the right. When it points to the left, it moves the variable from the list on the right to the list
on the left.
You want to give a name to this new variable so click in the NAME box under OUTPUT VARIABLE and
type the name age1 in this box. You can even type a variable label for this new variable in the LABEL
box just below the NAME box. Try typing “Age in Four Categories” as your label. Click on the CHANGE
button to tell SPSS to make these changes. Your screen will look like Figure 3-3.
Figure 3-3
Now we have to tell SPSS how to create these categories. Click on the OLD AND NEW VALUES button at
the bottom of the window. Your screen will look like Figure 3-4.
15
Figure 3-4
There are several options. You can change a particular value into a new value by entering the value to
be changed into the OLD VALUE box and the new value into the NEW VALUE box and then clicking on
ADD. You can also change a range of values into a new value. For example, you could change 18 thru 35
into value 1. (The next paragraph tells you how to do this.) There are also other options3.
Click on the fourth bubble from the top labeled RANGE. Notice how it marks your choice by filling in the
bubble. Then type “18” (the youngest age in the data set) in the box above THROUGH, click on the box
below THROUGH, and type “29” in that box. Then click on VALUE just below NEW VALUE and type “1” in
that box. This will tell SPSS to combine all ages from 18 through 29 into a single category and give it the
value of 1. Then click on ADD.
Repeat this process for the other categories. Click on the box under RANGE and type “30” in the box
above THROUGH, click on the box below THROUGH, and type “49” in that box. Click on VALUE just
below NEW VALUE and type “2” in that box and click on ADD. Do the same thing for the category 50 to
69 (give this a new value of “3”) and the category 70 to 89 (the largest age in the data set). Give this last
category a new value of “4”. Your screen should look like Figure 3-5.
3
For example, you can work with what SPSS calls “system-missing” values. All blanks will automatically be
changed to system-missing values. You can change these system-missing values into another value, or you can
change both the system-missing values and the missing values that you define into still another value.
16
Figure 3-5
To change one of your categories, highlight the category in the OLD->NEW box that you want to change,
make whatever changes you want to make, and then click on CHANGE. The new category should appear
in the OLD->NEW box. To remove a category, highlight it and click on REMOVE.
Now we want SPSS to carry out the recoding. Click on CONTINUE at the bottom of the window. This will
take you back to the RECORD INTO DIFFERENT VARIABLES box. Click on OK and SPSS will carry out your
commands. SPSS will show you the command it just executed in the Syntax window.
Click on ANALYZE, then point your mouse at DESCRIPTIVE STATISTICS, and then click on FREQUENCIES.
Notice that age1 has appeared in the list of variables on the left. Click on it to highlight it and click on
the arrow to move it to the VARIABLES box. Then click on OK. An output window will open. Your screen
will look like Figure 3-6.
Figure 3-6
Let's take a look at the data matrix. Click on WINDOW in the menu bar and you will see a list of all the
windows you have opened. One of these windows will be called GSS18A – IBM SPSS STATISTICS DATA
EDITOR. Click on that line and you should see the data matrix window on your screen. Use the scroll bar
in the lower-right part of the window to scroll to the right until you see the column titled age1. (It will
be the last column in the matrix.) This is the new variable you just created. Your screen should look like
Figure 3-7.
17
Figure 3-7
If you want the output to give you more information about what each category means, you need to
insert value labels. To do this, point your mouse at the variable name at the top of the column (age1)
and double click. This will open the VARIABLE VIEW tab in the DATA EDITOR. Now you’re going to enter
labels for the values in the recoded variable using what you learned in Chapter 2.
Click in the VALUES box and you will see a small blue button in the right-hand side of the box. Point your
mouse at this button and click. This will open the VALUE LABELS box. You will see two more boxes,
VALUE and LABEL. Click in the VALUE box and type the value “1”. Then click in the LABEL box and type
the label for the first category, “under 30”. Then click on ADD and the new label will appear in another
box just to the right of the ADD button. Then click in the VALUE box and type the value “2” and type the
label for the second category, “30 to 49”, and click on ADD. Do this for values “3” and “4.” If you make
a mistake you can use the CHANGE and REMOVE buttons, which work the same way we just described.
Your screen should look like Figure 3-8.
Figure 3-8
Click on OK. Now click on ANALYZE, point your mouse at DESCRIPTIVE STATISTICS, and then click on
FREQUENCIES and rerun the frequencies distribution for age1. This time it should have the value labels
you just entered on the output. Your screen should look like Figure 3-9.
18
Figure 3-9
We said that recoding into different variables allowed you to recode a variable in more than one way.
Let's recode age again, but this time let's recode age into three categories – 18 through 34, 35 to 59, and
60 and over. Call this new variable age2. Retracing the steps you used to create age1, recode age into
age2.
Be sure to click on RESET in the RECODE INTO DIFFERENT VARIABLES box to get rid of the recoding
instructions for age1. When you are done, do a frequency distribution for age2.
There are two more important points to discuss. Look back at Figure 3-4. It shows the RECODE INTO
DIFFERENT VARIABLES: OLD AND NEW VALUES box. There are three options in the OLD VALUE box that
we haven't discussed. Two are different ways of entering ranges. You can enter the lowest value of the
variable through some particular value and you can enter some particular value through the highest
value of the variable. Make sure that you do not include your missing values in these ranges, or your
missing values will become part of that category. For example, if 99 is the missing value for age, then
recoding 70 through highest would include the missing values with the oldest age category. This is
probably not what you want to do. So be careful.
Here is another important point. What happens if you don't recode a particular value? Any value that
is not recoded is changed into a system-missing value. If you want to leave the other values in their
original form, then click on ALL OTHER VALUES in the OLD VALUE box and click on COPY OLD VALUE in
the NEW VALUE box and click on ADD.
Now we are going to recode and have the recoded variable replace the old variable. This means that we
will not create a new variable. We will replace the old variable with the recoded variable, but remember
the warning given you earlier in this chapter. Click on TRANSFORM and then click on RECODE INTO
SAME VARIABLES. Let's recode the variable called pray. Find pray on the list of variables on the left,
click on it to highlight it, and then click on the arrow to the left of the VARIABLE box. This will move the
variable pray into the big box in the middle of the window. Click on the OLD AND NEW VALUES button.
This will open the RECODE INTO SAME VARIABLES: OLD AND NEW VALUES box. Your screen should look
like Figure 3-10.
19
Figure 3-10
This looks very much like the box you just used (see Figure 3-4). Combine the values 1 and 2 by clicking
on the fourth circle from the top under OLD VALUE and entering “1” in the box above THROUGH and “2”
in the box below THROUGH and then entering “1” in the NEW VALUE box and then clicking on ADD.
Now combine values 3 and 4 into a category called “2”. Then combine values 5 and 6 into a third
category called “3”. Click on CONTINUE and then on OK. Since this is not a new variable, it will still be
called pray.
You will want to change the value labels. Find the variable pray in DATA VIEW by scrolling to that
variable. Point your mouse at the variable name (pray) and double click. This will open the VARIABLE
VIEW tab in the DATA EDITOR. Click in the VALUES box and then click on the small blue box and make
the changes in the labels. You will have to use the CHANGE and REMOVE buttons to do this. Follow the
instructions we just went through for recoding into different variables. When you finish, click on
ANALYZE, then point your mouse at DESCRIPTIVE STATISTICS, then click on FREQUENCIES and move pray
over to the VARIABLES box and click on OK. Your screen should look like Figure 3-11.
Figure 3-11
When you recode into the same variable, a value that is not recoded stays the same as it was in the
original variable. If we had decided to keep “never” (value 6) as a separate category, we could have left
it alone and it would have stayed a 6. Or we could have changed it to another value such as 4. This is an
important difference between recoding into the same and different variables.
Recoding is a very useful procedure and one that you will probably use a lot. It's worth spending time
20
practicing how to recode so you will be able to do it with ease when the time comes.
You can also create new variables out of old variables using COMPUTE. There are seven variables in the
data set we have been using that ask respondents if they think a woman ought to be able to obtain a
legal abortion under various scenarios. These are the variables abany (woman wants abortion for any
reason), abdefect (possibility of serious birth defect in baby), abhlth (woman's health is seriously
threatened), abnomore (woman is married and doesn't want any more children), abpoor (woman is
poor and can't afford more children), abrape (pregnant as result of rape), and absingle (woman is not
married). Each variable is coded 1 if the respondent says yes (ought to be able to obtain a legal
abortion) and 2 if the person says no. The missing values are 0 (not applicable, question wasn't asked), 8
(don't know), and 9 (no answer).
COMPUTE will allow us to combine these seven variables, creating a new variable that we will call
abortion. If the respondent said yes to all seven questions, the new variable would equal 7 and if the
respondent said no to all seven questions, the new variable would equal 14. But what about missing
values? If any of the seven variables have a missing value, then the new variable will be assigned a
system-missing value.
To use COMPUTE, click on TRANSFORM and then click on COMPUTE. Your screen should look like Figure
3-12.
Figure 3-12
Type the name of the new variable, abortion, in the TARGET VARIABLE box. Then enter the formula for
this new variable in the NUMERIC EXPRESSION box. There are two ways to do this. One method is to
click on the first of the seven variables, abany, in the list of variables on the left, then click on the arrow
to the right of this list. This will move abany into the NUMERIC EXPRESSION box. Now click on the plus
sign and the plus sign moves into the box.
21
Continue doing this until the box contains the following formula: abany + abdefect + abhlth + abnomore
+ abpoor + abrape + absingle. (Don't type the period after absingle.) If you make a mistake, just click in
the NUMERIC EXPRESSION box and use the arrow keys and the delete and backspace keys to make
corrections. A second way to enter the formula in the NUMERIC EXPRESSION box is to click in the box
and type the formula directly into the box using the keyboard. Your screen should look like Figure 3-13.
Figure 3-13
Click on OK to indicate that you want SPSS to create this new variable. You can use the scroll bar to
scroll to the far right of the data matrix and view the variable you just created.
You can add variable and value labels to this variable by pointing your mouse at the variable name
(abortion) at the top of the column in the data matrix and double clicking. This will open the VARIABLE
VIEW tab in the DATA EDITOR. You can enter the variable and value labels the way you were taught
earlier in this chapter.
Enter the variable label “Sum of Seven Abortion Variables”. Enter the value label “High Approval” for
the value 7 and “Low Approval” for the value 14. (Remember that seven means they approved of
abortion in all seven scenarios and fourteen means they disapproved all seven times.) Click on OK.
You should check your new variable to see that it was calculated correctly. Go to ANALYZE, then
DESCRIPTIVE STATISTICS, and then FREQUENCIES. Click on RESET to get rid of what is already in the box.
Find the variable abortion, highlight it and click on the arrow to the left of the VARIABLES box. Then click
on OK. Your screen should look like Figure 3-14. The lowest number should be 7 and the highest
number should be 14.
22
Figure 3-14
One of the problems with this approach is that the new variable (abortion) will be assigned a system
missing value if one or more of the original variables have a missing value. We can avoid this problem
by summing the values of the original variable and dividing by the number of variables with valid values.
For example, if six of the seven original variables had valid values, then we would divide the sum by six.
We can also tell SPSS to create this new variable only if at least four (or whatever number we choose) of
the original variables have valid values. If fewer than four of the original variables have valid values,
SPSS will assign it a system missing value.
We can do this by clicking on TRANSFORM and then on COMPUTE and entering the new variable name
in the TARGET VARIABLE box. Let’s call this variable abort. In the FUNCTION GROUP box, scroll down
and click on STATISTICAL. This will list the statistical functions in the FUNCTIONS AND SPECIAL
VARIABLES box. Double-click on Mean. Your screen should look like Figure 3-15.
Figure 3-15
23
Notice that MEAN(?,?) has been inserted in the NUMERIC EXPRESSION box. What you want to do is to
replace the (?,?) with the list of the seven original variables. It should now read (abany, abdefect,
abhlth, abnomore, abpoor, abrape, absingle). Be sure to separate the variable names with commas. All
that is left is to tell SPSS that you want to create this new variable only if at least four of the original
variables have valid values. Do this by entering “.4” following MEAN so the expression reads “MEAN.4
(abany, abdefect, abhlth, abnomore, abpoor, abrape, absingle)”. Your screen should look like Figure 3-
16.
Figure 3-16
Click on OK and run a frequency distribution to see what your new variable looks like. You screen should
look like Figure 3-17.
24
Figure 3-17
Try creating another variable. Two of the variables in the data set are the number of years of education
of the respondent's father (paeduc) and of the respondent's mother (maeduc). If we divide paeduc by
maeduc we will get the ratio of the father's education to the mother's education. Any value greater than
one will mean that the father has more education than the mother and any value less than 1 means the
mother has more education than the father. Any value close to 1 means that the father and mother
have about the same education.
We have a small problem though. If the mother's education is zero, then we will be dividing by zero,
which is mathematically undefined. Let's recode any value of zero for maeduc so it becomes a one. This
will avoid dividing by zero and still give us a useful ratio of father's to mother's education. Click on
TRANSFORM, and then click on RECODE INTO SAME VARIABLES. (You may need to click on RESET to get
rid of the recoding instructions used earlier.) Move maeduc into the VARIABLES box by highlighting it in
the list of variables on the left and clicking on the arrow to the right of this list. Click on OLD AND NEW
VALUES and type “0” into the VALUE box under OLD VALUE and then click in the VALUE box under NEW
VALUE. Type “1” in this box and click on ADD. Your screen should look like Figure 3-18.
25
Figure 3-18
Now click on CONTINUE and then on OK in the RECODE VARIABLES box. Now we have changed each 0
for maeduc into a 1. There is one more thing you need to do which is to change the value label for 0 to
so it reads 0-1.
To create our new variable, click on TRANSFORM and then on COMPUTE. (If necessary, click on RESET to
get rid of the formula for the abort variable you just created.) Call this new variable ratio. So, type
“ratio” in the TARGET VARIABLE box. Now we want to write the formula in the Numeric Expression box.
Click in the list of variables on the left and scroll down until you see paeduc. Click on it to highlight it and
click on the arrow to the right of the list to move it into the NUMERIC EXPRESSION box.
SPSS uses the slash (/) to indicate division, so click on the / in the box in the center of the window. Click
on the list of variables again and scroll up until you see maeduc and click on it to highlight it. Move it to
the NUMERIC EXPRESSION box by clicking on the arrow. Your screen should look like Figure 3-19.
Figure 3-19
Click on OK and SPSS will create your new variable. Use the scroll bar to scroll to the right in the data
matrix until you can see the new variable you called ratio. Scroll up and down so you can see what the
values of this variable look like. You may want to do a frequencies distribution as a check to make sure
26
After looking at the frequency distribution, it is obvious that it would be easier to understand if we
grouped some of the scores together, so create a new variable by recoding it into a different variable.
Click on TRANSFORM and then click on RECODE INTO DIFFERENT VARIABLES. Find the variable ratio in
the list of variables on the left and click on it to highlight it. (Again, you may have to click RESET if there
is old information still in the boxes.) Click on the arrow to the right of this list to move it into the box in
the middle of the window. Type “ratio1” in the NAME box under OUTPUT VARIABLE and type “Recoded
Ratio” in the LABEL box. Then click on CHANGE.
Click on OLD AND NEW VALUES to open the box. Click on the fifth bubble from the top under OLD VALUE
and then type “0.89” in the box to indicate that you want to recode the lowest value through 0.89. Click
on the Value box under New Value and type “1” in that box, and then click on ADD. Click on the fourth
bubble from the top under OLD VALUE and type “0.90” in the box above THROUGH and “1.10” in the
box below. Then type “2” in the VALUE box under NEW VALUE and click on ADD. Finally, click on the
sixth bubble from the top under OLD VALUE and type “1.11” in the box. Type”3” in the VALUE box
under NEW VALUE and click on ADD. Your screen should look like Figure 3-20. Click on CONTINUE and
then on OK in the RECODE INTO DIFFERENT VARIABLES box.
Figure 3-20
Let’s add value labels to the new values. Find the variable ratio1 in the data matrix and double click on
the variable name, ratio1. This will open the VARIABLE VIEW tab in the DATA EDITOR. Click the VALUES
box and then click in the small blue box and enter the labels. Type “1” in the VALUE box and “under
0.90” in the VALUE LABEL box and then click on ADD. Do this twice more to add the label “0.90 through
1.10” to the value 2 and “over 1.10” to the value 3.
Click on OK in the VALUE LABELS box. Run a frequencies distribution on the new variable to double-
check your work. Your screen should look like Figure 3-21.
27
Figure 3-21
The first category (under 0.90) means that Father's Education was less than 90% of Mother's Education.
The second category (0.90 through 1.10) means that Father's and Mother's Education were about the
same, while the third category (over 1.10) means that father's education was more than 110% of
Mother's Education. You can see that about 48% of the respondents have fathers and mothers with
similar education, while about 27% have fathers with substantially less education than the mother and
another 26% have fathers with substantially more education than the mother.
You have already seen that SPSS uses + for addition and / for division. It also uses - for subtraction, * for
multiplication, and ** for exponentiation. There are other arithmetic operators and a large number of
functions (e.g., square root) that can be used in compute statements.
The IF command is another way to create new variables out of old variables. Perhaps we want to
compare the level of education of each respondent's father to that of his or her mother. Now, however,
we're not interested in the precise ratio, but just want to know if the father had more education than
the mother, the same amount, or less. We'll create a new variable that will have the value 1 when the
father has more education than the mother, 2 when both have the same amount of education and 3
when the mother has more education.
Click on TRANSFORM and then click on COMPUTE. (You may need to click on RESET to get rid of the
instructions for creating ratio.) Type the name of the new variable, compeduc, in the TARGET VARIABLE
box. Then click on the NUMERIC EXPRESSION box and enter “1”. So far, this is similar to what you did in
the previous section. Your screen should look like Figure 3-22.
28
Figure 3-22
Click on IF and then click on INCLUDE IF CASE SATISFIES CONDITION. Find paeduc in the list of variables
on the left and click on it to highlight it. Then click on the arrow to the right of this list. This will move
paeduc into the box to the right of the arrow. Now click on > (greater than). Find maeduc in the list of
variables on the left, click on it, and click on the arrow to add maeduc to the formula. (Alternatively, you
could click on the box to the right of the arrow and directly enter the formula, paeduc > maeduc). Your
screen should look like Figure 3-23.
Figure 3-23
Click on CONTINUE and then click on OK. Now repeat the same procedures as above, but this time
setting the value of compeduc to 2 (instead of 1) and the formula to paeduc = maeduc. When you are
asked if you want to CHANGE EXISTING VARIABLE, click on OK. Now repeat the procedures a third time
but change the value of compeduc to 3 and the formula to paeduc < maeduc.
29
You can add variable and value labels to this variable, just as you did earlier in this chapter and in
Chapter 2. To do this, point your mouse at the variable name at the top of the column (compeduc) and
double click. This will open the VARIABLE VIEW tab in the DATA EDITOR. Click in the VALUES box and
then in the small blue button in the right-hand side of the box. Point your mouse at this box and click.
This will open the VALUE LABELS box. Click in the box next to VALUE and type “1”. Click on the box next
to VALUE LABEL (or press the Tab key) and type “Dad More”. Now click on ADD. Repeat this procedure
for values 2 and 3, labeling them “Same” and “Mom More” respectively. Click on CONTINUE, then on
OK. Now run Frequencies for your new variable to double-check your work.
SPSS can also select subsets of cases for further analysis. One of the variables in the data set is the
respondent's religious preference (relig). The categories include Protestant (value 1), Roman Catholic
(2), Jewish (3), none (4), Christian unspecified (5), and other (6). The missing values are 98 (don't know)
and 99 (no answer). We might want to select only those respondents who have a religious preference
for analysis. We can do this by using the SELECT CASES option in SPSS.
Click on DATA and then on SELECT CASES. This will open the SELECT CASES box. Your screen should look
like Figure 3-24. Notice that ALL CASES is currently selected. (The circle to the left of ALL CASES is filled
in to indicate that it is selected.) We want to select a subset of these cases so click on the circle to the
left of IF CONDITION IS SATISFIED to select it. At the bottom of the window it says DO NOT FILTER
CASES. This means that the cases you do not select are not filtered out. If you had selected FILTER OUT
UNSELECTED CASES, these unselected cases would be deleted and could not be used later. You should
be very careful about saving a file after you have deleted cases because they are gone forever in that
file. (You could, of course, get another copy of the data file by clicking on FILE and on OPEN.)
Figure 3-24
Select IF CONDITION IS SATISFIED by clicking in the circle to left of it. Now click on IF (below the button
30
that says IF CONDITION IS SATISFIED) and this will open the SELECT CASES: IF box. Scroll down the list of
variables on the left until you come to relig and then click on it to highlight it. Click on the arrow to the
right of this list to move relig into the box in the middle of the window. We want to select all cases that
are not equal to 4 so click on the ~= sign. This symbol means “not equal to.” Now click on 4 and the
expression in the box will read relig ~= 4 which means that the variable relig does not equal 4 (the code
for no religious preference). Your screen should look like Figure 3-25. Click on CONTINUE and then on
OK in the SELECT CASES box.
Figure 3-25
Run a frequencies distribution and check that your new variable gives you the range of values that you
want. Your screen should look like Figure 3-26.
Figure 3-26
There are no respondents with a religious preference of 4 (none) in this table because you selected only
those cases with values not equal to four. Click on WINDOW in the menu bar and then click on GSS18A.
Notice that all the cases that were not selected are lined out.
What if we wanted to analyze only Protestants and Catholics? Click on DATA and then on SELECT CASES.
Select ALL CASES and click on OK. This will cancel the last selection and will make all cases active. Now
click on RESET to eliminate what you had entered previously. Click on IF CONDITION IS SATISFIED and
then on IF. Scroll down the list of variables and click on relig and then click on the arrow to the right of
the list to move it into the box. Click on = and then on 1 so the expression in the box reads relig = 1.
31
SPSS uses the symbol & for and and the symbol | for or. We want all cases for which relig is 1 or 2. Now
click on |. Click on relig in the list of variables again and then on the arrow to move it into the box. Then
click on = and then on 2 so the expression in the box reads relig = 1 | relig = 2 which means that relig will
equal 1 or 2. Your screen should look like Figure 3-27.
Figure 3-27
Click on CONTINUE and on OK in the SELECT CASES box. Run a frequencies distribution for the new
variable to see what it looks like. Your screen should look like Figure 3-28. You will only have
Protestants (1) and Catholics (2) in your table because you selected only those cases with values one and
two on relig.
Figure 3-28
After you have selected cases for analysis, you might want to continue your analysis with all the cases.
To do this, remember to click on DATA, then on SELECT CASES, and then click on the circle to the left of
ALL CASES. Click on OK and SPSS will select all the cases in the data file. This is very important. If you
don't do this, you will continue to work with just the cases you have selected. This will work only if you
selected DO NOT FILTER CASES when you began using SELECT CASES. If you selected FILTER OUT
UNSELECTED CASES and saved the file, you will have to get another copy of the data file by clicking on
FILE and then on OPEN.
32
Using Count
The COUNT command counts the number of times a particular value or values occur in a set of
variables. There are six questions in which respondents are asked if they would allow various
categories of people to give a speech in their community. The value of 1 means that they would
allow that person to speak and 2 indicates means they would not allow. Let’s count the number
of times that respondents would allow a speech.
But first we have to think about missing values. The missing values for the six speech variables
are 0, 8, and 9. We need to eliminate all cases with missing values for any of these six variables
from our analysis. We’re going to do this by recoding these three values into a single value (let’s
use 9) for each of the six variables. Then we’re going to use SELECT CASES to select those
cases for which each of the six variables are something other than 9.
Click on TRANSFORM and then on RECODE INTO DIFFERENT VARIABLES. We’ll start
with spkath. Give your new variable the name of spkath1. Click on CHANGE and then on OLD
AND NEW VALUES. Select the third option from the top and change all the system or user
missing values into a 9. Then use the last option to copy all other values into their old values.
Your screen should look like Figure 3-29.
Figure 3-29
Repeat this recoding process for each of the six speech variables. Run a frequency distribution
for the original six variables and the recoded variables to make sure you did it correctly. Don’t
bother adding value labels for these recoded values.
Now we’re going to use SELECT CASES to select those cases for which each of the six
variables is not equal to 9. Click on DATA and then on SELECT CASES. Click on RESET to
erase what you had previously entered in the dialog box. Click on IF and then select IF
CONDITION IS SATISFIED. Enter the following expression: “spkath1 ~= 9 & spkcom1 ~= 9
& spkhomo1 ~= 9 & spkmil1 ~= 9 & spkmslm1 ~= 9 & spkrac1 ~= 9”. (Don’t enter the
quotation marks.) Your screen should look like Figure 3-30.
33
Figure 3-30
Click on CONTINUE and then on OK. Run frequency distributions for the six recoded speech
variables. The only values that should show on the distributions are 1 and 2. There should 1,451
cases in each distribution. These are the cases for which there are no missing values for any of
the cases.
Now we’re ready to use COUNT. Click on TRANSFORM and then on COUNT VALUES
WITHIN CASES. Your screen should look like Figure 3-31.
Figure 3-31
Enter the name of the variable you’re going to create in the TARGET VARIABLE box. Let’s
call this variable spk. You can add a variable label by putting the label in the TARGET LABEL
box. Let’s label this “number of speak variables answered yes.” In the VARIABLES box put
the variables that you want to include in the count. In this example it would be the variables
34
splath1, spkcom1, spkhomo1, spkmil1, spkmslm1, and spkrac1. Your screen should look like
Figure 3-32.
Figure 3-32
Click on the DEFINE VALUES box and enter the values that you want to count. In our example
this would be the value 1. Notice that you can add as many values as you want. Enter the value
“1” in the VALUE box and then click on ADD. Your screen should look like Figure 3-33.
Figure 3-33
Now click on CONTINUE and then on OK. Run FREQUENCIES for the variable spk and your
screen should look like Figure 3-34.
35
Figure 3-34
The output tells us that 48 respondents did not want to allow any of the people in these groups to
speak in their community and 510 thought all should be allowed to speak.
Weighting
Sometimes you want to weight the cases so that they better represent the population from which you
selected your sample. For example, if our sample has more females and fewer males that our
population, you would want to weight on the variable sex. The General Social Survey provides us with a
weight variable called wtss.
To weight the cases in your sample, click on DATA and then on WEIGHT CASES. Your screen should look
like Figure 3-35.
Figure 3-35
To use wtss as your weight variable, click on the circle to the left of WEIGHT CASES BY, scroll down until
you find wtss, click on it to highlight it and click on the arrow to the left of WEIGHT CASES BY. Your
screen should look like Figure 3-36. Now click on OK.
36
Figure 3-36
The data set that you are using in this tutorial has already been weighted by wtss. In the lower-right
hand corner of the DATA EDITOR screen, you will see WEIGHT ON which tells you that the data set has
already been weighted.
RECODE Exercises
There are two variables that refer to the highest year of school completed by the respondent's mother
and father (maeduc and paeduc). Do a frequency distribution for each of these variables. Now recode
each of them (into a different variable) into three categories: under 12 years of school, 12 years, and
over 12 years. Create new value labels for the recoded categories. Do a frequency distribution again to
make sure that you recoded correctly.
Income16 is the total family income for the previous year (2018). Do a frequency distribution to see
what the variable looks like before recoding. Recode (into a different variable) into eight categories:
under $10,000, $10,000 to $19,999, $20,000 to $29,999, $30,000 to $39,999, $40,000 to $49,999,
$50,000 to $59,999, $60,000 to $74,999, and $75,000 and over. Be very careful that you recode the
values, not the labels associated with the values. Call the new variable inc1. Create new value labels for
the recoded categories. Do another frequency distribution to make sure you recoded correctly.
Now recode income16 again (into a different variable). This time use only four categories: under
$20,000, $20,000 to $39,999, $40,000 to $59,999, and $60,000 and over. Call the new variable inc2.
Create new value labels for the recoded categories. Do another frequency distribution to make sure you
recoded correctly.
37
COMPUTE Exercises
In this chapter we created a new variable called abortion, which was the sum of the seven abortion
variables in the data set. Create a new variable called ab1, which is the sum of abdefect, abhlth, and
abrape. Do a frequency distribution for this new variable to see what it looks like. How is this
distribution different from the distribution for the abortion variable based on all seven variables?
There are six variables that measure tolerance for letting someone speak in your community who may
have very different views than your own (spkath, spkcom, spkhomo, spkmil, spkmslm, and spkrac). For
each of these variables, 1 means that they would allow such a person to speak and 2 means that they
would not allow it. Create a new variable (call it speak), which is the sum of these six variables. This
new variable would have a range from 6 (would allow a person to speak in each of the six scenarios) to
12 (would not allow a person to speak in any of the six scenarios). Do a frequency distribution for this
new variable to see what it looks like.
Repeat the exercise above on letting someone speak in your community but this time compute the
mean score for all six variables. If respondents answered less than three of the six questions tell SPSS to
assign them a system missing value. Call this new variable spkmean. Run a frequency distribution for
this new variable.
Select males and run a frequency distribution for fear. Now select females and run a frequency
distribution for fear. Were males or females more fearful of walking alone at night in their
neighborhood?
IF Exercises
There are two variables that describe the highest educational degree of the respondent's father and
mother (padeg and madeg). Create a new variable (call it mapaeduc) that indicates if the father and
mother have a college education. This variable should equal 1 if both parents have a college education,
2 if only the father has a college education, 3 if only the mother has a college education, and 4 if neither
parent has a college education. Create new value labels for the recoded categories. Do a frequency
distribution for this new variable to see what it looks like.
One variable indicates how often the respondent prays (pray) and another variable indicates if the
respondent approves or disapproves of the Supreme Court's decision regarding prayer in the public
schools (prayer). Create a new variable (call it pry) that is a combination of these two variables. This
variable should equal 1 if the respondent prays a lot (once a day or several times a day) and approves of
the Supreme Court's decision, 2 if the respondent prays a lot (once a day or several times a day) and
disapproves of the Supreme Court's decision, 3 if the respondent doesn't pray a lot and approves of the
Supreme Court's decision, and 4 if the respondent doesn't pray a lot and disapproves of the Supreme
Court's decision. Do a frequency distribution for this new variable to see what it looks like.
Count Exercises
Use the COUNT command to create a new variable that is the count of the number of times that
38
respondents said they would allow (value 1) people to speak in their community. Use the variables
spkath, spkcom, spkhomo, spkmil, spkmslm, and spkrac that you used in one of the exercises above. Call
this new variable spkcount. Run a frequency distribution for this new variable.
SELECT IF Exercises
Select all males (1 on the variable sex) and do a frequency distribution for the variable fear (afraid to
walk alone at night in the neighborhood). Then select all females (2 on the variable sex) and do a
frequency distribution for fear. Are males or females more fearful of walking alone at night?
Select all whites (1 on the variable race) and do a frequency distribution for the variable pres12. Were
they more likely to vote for Obama or Romney in 2012? Then select all blacks (2 on the variable race)
and do a frequency distribution for pres12. Were whites or blacks more likely to vote for Obama or
Romney?
Next Chapter
In this chapter you learned how to recode, create new variables using compute, if, and count, how to
select particular cases for analysis and how to weight the data. You can do more complicated things
with these commands than we have shown you, but these are the basics. In the rest of this book, we
will show you some of the statistical procedures that SPSS can carry out for you. Chapter 4 we’ll focus
on describing variables one-at-a-time which is typically referred to univariate analysis.
39
This chapter explains how to analyze variables one at a time. We’ll look at three different SPSS
commands.
FREQUENCIES
DESCRIPTIVES
EXPLORE
Frequencies
Frequency distributions show you the number of cases for each category of your variable. They also
convert these frequencies to percents and tell you how many cases had missing information. Click on
ANALYZE, then on DESCRIPTIVE STATISTICS, and finally on FREQUENCIES and you should see Figure 4-1.
The list of variables will be on the far left. If SPSS displays the variable labels instead of the variable
names, click on EDIT in the menu bar and then on OPTIONS. Click on DISPLAY NAMES and on SORT BY
NAME. Now it will display the variable names in alphabetical order. If you are using your own
computer, SPSS will remember your choices and you won’t have to do this each time you open SPSS.
However, if you are working in a computer lab, you may have to do it each time you open SPSS.
Figure 4-1
Now that you have adjusted the list of variables to make it easier to find a particular variable, select the
variable(s) for which you want to get a frequency distribution by clicking on them and then clicking on
the arrow pointing to the right to move them to the VARIABLES box. We’re going to use educ in this
example so move educ over to the VARIABLES box and you should see Figure 4.2.
40
Figure 4-2
To get the frequency distribution click on OK and you should see Figure 4-3. SPSS has five columns of
information.
The first column shows the responses to this question. Valid responses are cases where the
respondents answered the question by telling us their age. Missing data are cases where the
respondent did not answer the question but rather said they didn’t know (value 98) or refused
to answer (value 99).
The second column shows the number or frequency of respondents that gave specific
responses.
The third column converts the frequencies to percents. Note that these percents are computed
by divided each frequency by the total number of cases in the sample (2,348).
The fourth column converts the frequencies to percents which are computed by dividing the
frequencies by the number of valid responses (2,345). The number of valid responses is the
total number of cases in the sample minus the number of cases with missing information. These
are called valid percents and typically are the percents we want to use in describing the data. In
this variable, the percents and valid percents are basically the same because there are so few
cases (3) with missing information. When there are more cases with missing information these
percents can be quite different.
The fifth column shows the cumulative percents which cumulate the valid percents. Each
cumulative percent is the sum of the valid percents above or equal to that category.
41
Figure 4-3
Statistics
SPSS will compute various statistics. Click on the STATISTICS button in the upper right of your screen.
These statistics include the following:
percentiles
measures of central tendency (mode, median, mean),
measures of dispersion (range, standard deviation, variance), and
measures of skewness and kurtosis.
The statistics you choose are partially dictated by the level of measurement of the variables. Levels are
often classified as nominal, ordinal, interval, and ratio.4
A nominal measure is one in which respondents are sorted into a set of categories which are
qualitatively different from each other. The categories in a nominal level measure have no
inherent order to them. This means that it wouldn’t matter how we ordered the categories.
They could be arranged in any number of different ways. In our data file, marital is an example
of a nominal measure.
An ordinal measure is a nominal measure in which the categories are ordered from low to high
or from high to low. In our data file, class is an example of an ordinal measure. But notice that
while the categories are ordered they lack an equal unit of measurement. That means that the
4
See Dan Osherson and David M. Lane, “Levels of Measurement”,
http://onlinestatbook.com/2/introduction/levels_of_measurement.html. See also S. S. Stevens, “On the Theory of
Scales of Measurement”, 1946, Science, volume 2013, pp. 677-80.
42
differences between categories are not necessarily equal. For example, the class difference
between upper class (1) and middle class (2) is probably not the same as the difference between
middle class (2) and lower class (3).
An interval measure is an ordinal measure with equal units of measurement. Temperature
measured in degrees Fahrenheit would be an example of an interval measure. The difference
between 20 degrees and 40 degrees is the same as the difference between 70 degrees and 90
degrees. Now these numbers have the properties of real numbers and we can add them and
subtract them. But notice one thing about the Fahrenheit scale. There is no absolute zero
point. There can be both positive and negative temperatures. That means that we can’t
compare values by taking their ratios. For example, we can’t divide 80 degrees Fahrenheit by 40
degrees and conclude that 80 is twice as hot at 40. To do this we would need a measure with an
absolute zero.5
A ratio measure is an interval measure with an absolute zero point. The variable educ is an
example of a ratio measure. Notice that it has an absolute zero point; you can’t have less than
zero years of school.
Since educ is a ratio variable, we could use the mean, median, and mode as our measures of central
tendency and the standard deviation and variance as our measures of variability. If our variable was
class (i.e., ordinal), then we couldn’t use the mean but could use the median and the mode as our
measures of central tendency and the standard deviation and variance wouldn’t be appropriate
measures of variability. If our variable was marital (i.e., nominal), then we could only use the mode as
our measure of central tendency.
So, for educ, we’re going to ask for the mean, median, mode, minimum value, maximum value, standard
deviation, and variance. We’re also going to ask for quartiles. The first quartile is the 25th percentile;
the second quartile is the 50th percentile, and the third quartile is the 75th percentile. The second
quartile and the 50th percentile are also the same as the median. Figure 4-4 shows the SPSS output for
these choices.
5
You might wonder why we didn’t use an example from the GSS data file. There isn’t one. They don’t occur in
social science research very often. There are examples from the field of business. Think about profit for
businesses over a fiscal year. There is no absolute zero. Profit could be positive or negative.
43
Figure 4-4
There are two other options in Figure 4-2 which we should mention.
Unchecking the box for DISPLAY FREQUENCY TABLES tells SPSS not to show the frequency
distribution.
You can also select CHARTS. We’ll discuss charts and graphs next.
In this Chapter, we’ll show you how to construct basic pie charts, bar charts, and histograms as
byproducts of the FREQUENCIES procedure, and basic boxplots as a byproduct of EXPLORE. We’ll
provide a fuller explanation of these graphics in Chapter 9. Scatterplots, used to describe relationships
between interval or ratio variables, will be covered in Chapter 76.
A pie chart is a chart that shows the frequencies or percents of a variable with a small number of
categories. Let’s run the pie chart for marital. It is presented as a circle divided into a series of slices.
The area of each slice is proportional to the number of cases or the percent of cases in each category. It
is normally used with nominal or ordinal variables but can be used with interval or ratio variables which
have a small number of categories. Figure 4-5 is a pie chart for class.
6
You can also use CROSSTABS to produce clustered bar charts, but we won’t be covering this.
44
Figure 4-5
A bar chart is a chart that shows the frequencies or percents of a variable and is presented as a series of
vertical bars that do not touch each other. The height of each bar is proportional to the number of cases
or the percent of cases in each category. It is normally used with nominal or ordinal variables. Figure 4-
6 is a bar chart of this same variable (class).
Figure 4-6
A histogram is a graph that shows the frequencies or percents of a variable with a larger number of
categories. It is presented as a series of vertical bars that touch each other. The height of each bar is
proportional to the number of cases or the percent of cases in each category. It is used with interval or
ratio variables. Figure 4-7 is a histogram of educ.
45
Figure 4-7
To get a chart from SPSS, click on the CHARTS button and check the box for the type of chart you want.
If you don’t want to get the frequency distribution, uncheck the box for DISPLAY FREQUENCIES TABLES.
We’ll discuss editing these charts to make them more useful in Chapter 9.
Descriptives
The DESCRIPTIVES procedure is similar to FREQUENCIES except that it does not produce frequency
distributions. It should be used when you only want the statistics. Click on ANALYZE, then on
DESCRIPTIVE STATISTICS, and finally on DESCRIPTIVES. You should see Figure 4-8.
Figure 4-8
Use DESCRIPTIVES the same way you used FREQUENCIES. Move the variables (age in our example) you
want to use into the VARIABLES box. Click on OPTIONS and select the statistics you want to use and click
46
on OK. This time we’re going to use the default statistics (i.e., mean, standard deviation, minimum,
maximum). Your output should look like Figure 4-9.
Figure 4-9
Explore
Figure 4-10
Move the variables that you want to describe or explore into the DEPENDENT LIST box. For this chapter,
ignore why it calls them dependent variables. We’ll come back to that question in Chapter 5 on cross
tabulation. Let’s focus on age so move the variable age into the DEPENDENT LIST box.
SPSS computes several sets of statistics to describe the variables you chose. Click on OPTIONS in the
upper right of the dialog box and check the boxes for DESCRIPTIVES, OUTLIERS, PERCENTILES.
47
In Figure 4-11 DESCRIPTIVES computes a wide array of different ways of describing central
tendency, variability, skewness, and kurtosis.
Figure 4-11
EXTREMES (see Figure 4-12) shows you the five largest and five lowest values in your variables.
Figure 4-12
Figure 4-13
Boxplot is a graph that shows you quite a bit of information about age in Figure 4-14.
o The top line in the blue box is the 75th percentile and the bottom line is the 25th
percentile.
o The middle line is the median.
o The height of the box is called the Inter-Quartile Range (IQR) which is the difference
between the 75th and the 25th percentiles.
o The vertical lines give you a visual picture of the amount of dispersion and extend from
the top and bottom of the box to 1.5 times the IQR.
o If there were values that extended beyond the end of the vertical lines, they would be
displayed as circles and would represent extreme outliers. In this example, there aren’t
any such values.
Figure 4-14
Factors
Let’s say you want to explore the distributions separately for men and for women. In this case, you
would enter the variable sex in the FACTORS LIST box so go ahead and move sex into it and click on OK.
Now SPSS will display your output once for males and a second time for females.
49
Frequencies Exercises
Run FREQUENCIES for hrs1 (number of hours worked last week. Ask for the following statistics: mode,
median, mean, minimum, maximum, range, variance, standard deviation, quartiles. Tell SPSS to
construct a histogram. What do these statistics and the graph tell you about this variable?
Run FREQUENCIES for attend (how often respondent attends religious services). Ask for the following
statistics: mode and median. Tell SPSS to construct a bar graph and a pie chart. What do these statistics
and the graphs tell you about this variable?
Descriptives Exercises
Run DESCRIPTIVES for maeduc (education for respondent’s mother). Ask for the default statistics.
Run DESCRIPTIVES for paeduc (education for respondent’s father). Ask for the default statistics.
Is there much of a difference between respondents’ mothers and fathers in terms of education?
Explore Exercises
Run EXPLORE for hrs1. Ask for the following statistics: DESCRIPTIVES, OUTLIERS, PERCENTILES. What
do these statistics and the box plot tell you about this variable?
Run EXPLORE for hrs1 but this time add sex to the FACTORS box which will allow you to compare males
and females. What are the differences in the two boxplots for males and females?
NEXT CHAPTER
In Chapter 5 we’re going to start looking at bivariate analysis which involves focusing on the relationship
between pairs of variables. One way to do that for interval and ratio variables is to compare means.
SPSS offers several different ways of comparing means.
50
To make it easier to follow the instructions in this chapter, we recommend that you set certain options
in SPSS in the same way that we have. First, click on EDIT in the menu bar, then on OPTIONS, and
GENERAL. Under VARIABLE LISTS, click on DISPLAY NAMES, and ALPHABETICAL. Now variables will be
listed by their variable names in alphabetical order.
Crosstabs
Crosstabs are particularly useful for exploring the relationship between variables. Open the GSS18A
data file and click on ANALYZE, DESCRIPTIVE STATISTICS, and CROSSTABS. This will open the dialog box
shown in Figure 5-1.
Figure 5-1
We’re going to try to explain why some people think that abortion should be legal and others feel that it
should be illegal. Dependent variables are the variables that you want to explain. So, in this example,
the abortion variables will be our dependent variables.
Independent variables are the variables that you think will help explain the variation in the dependent
variables. If you think that the variable sex might account for this variation, then sex will be your
independent variable. Another way to say this is that the independent variable is the possible causal
51
In Appendix A, you will see that there are seven variables that deal with opinions about abortion. Let’s
choose abhlth (abortion if the woman’s health is seriously endangered) as our dependent variable and
sex as our independent variable. We’re going to follow the convention of putting our independent
variable in the columns and our dependent variable in the rows. To do this, select abhlth from the list
on the left by clicking on it, then use the arrow key to the right of the list box to move the variable into
the ROW box. Now move sex into the COLUMN box. For now, ignore the bottom box – more about it in
Chapter 8. If you’ve done everything correctly, your screen will look like Figure 5-2.
Figure 5-2
Now click on CELLS. The OBSERVED box should already be selected—it shows the actual number of
cases in each cell. This is the default. We need to get percentages so we can compare columns with
varying numbers of cases. An easy rule to follow is if your independent variable is in the columns, then
use the column percents and if it is in the rows, then use the row percents. Since we decided to put the
independent variable in the columns, you should select the column percents. So, check the box for
COLUMNS as in Figure 5-3.
52
Figure 5-3
Now click on CONTINUE to get back to the CROSSTABS dialog box. Once you are back there, click OK.
SPSS will now open the OUTPUT WINDOW, which will display your table (see Figure 5-4).
Figure 5-4
The CASE PROCESSING SUMMARY shows the valid, missing, and total cases. The high percent of missing
cases here reflects the people who were not asked this particular question in the survey. Only the valid
cases appear in the crosstab.
The crosstab shows the percent of men and women who said that abortion should be legal and not legal
in the case of a woman’s health being seriously endangered. We see that 90.7% of men and 89.0% of
53
Your initial conclusion here might be that on abortion issues, there’s virtually no difference between
men and women in their responses. Is this correct or did you stop your analysis a little too soon? Let’s
look at a different abortion variable. Repeat the steps above but use abnomore as your dependent
variable this time. Your results should look like Figure 5-5.
Figure 5-5
Now we see that 55.5% of the men and 46.0% of the women said Yes to “Abortion if a woman is married
and wants no more children,” a percentage point difference of 9.5. When we compare Figure 5-4 with
Figure 5-5, we see that there is a larger percent difference for abnomore than there is for abhlth. We
also see that a much larger percent of all respondents (both men and women) think that abortion
should be legal in the case of a woman’s health being seriously endangered than in the case of a woman
who is married and doesn’t want any more children.
We might also want to know if the relationship in Figure 5-5 is statistically significant. To answer that
question, we need to use Chi-Square as our test of significance. We might also want to get a measure of
how strong the relationship between the two variables is. Here we need a measure of association.
Let’s run the crosstab again and get Chi-Square and a measure of association. Click on ANALYZE,
DESCRIPTIVE STATISTICS, and CROSSTABS. In the CROSSTABS dialog box place abnomore as the row
variable and sex as the column variable. Now click on the STATISTICS button, then click on Chi Square to
obtain a test of statistical significance, and on Phi and Cramer’s V, which are measures of the strength of
association we could use when the two variables are both at the nominal level of measurement. Phi is
appropriate for tables with two rows and two columns, while Cramer’s V is appropriate otherwise. Your
dialog box should look like Figure 5-6.
54
Figure 5-6
Click on CONTINUE, then OK. The table in Figure 5-5 reappears, but with some additional information
(you might have to scroll down to see it)—look for “Chi-Square Tests” (Figure 5-7).
Figure 5-7
The Pearson Chi-Square test indicates that the relationship is statistically significant. It would occur by
chance less than 5 times out of 10,000.7 The Cramer’s V of .095 in Figure 5-8 indicates that there this is
at most a very weak relationship.
7
The significance value is a rounded value. So, .000 means that it is less than .0005 or less than five out of ten
thousand.
55
Figure 5-8
Let’s look at a somewhat different table. We’re going to consider the relationship between education
and political views. Click on ANALYZE, DESCRIPTIVE STATISTICS, and CROSSTABS. If the variables you
used before are still there, click on the RESET button, then move polviews to the ROW box and degree to
the COLUMN box. Since both of these variables are ordinal, we’ll want to obtain different statistics to
measure their relationship. Click on STATISTICS and then on Chi-square and Kendall’s tau c. (Tau c is a
measure of association that is appropriate when both variables are ordinal and do not have the same
number of categories.)
Now click on CONTINUE and then on CELLS and then on COLUMN PERCENTS. Click on CONTINUE and
then click on OK. What do the results show? While the Chi-square statistic is statistically significant, the
value of Kendall’s tau c is quite low indicating that there is virtually no relationship between these two
variables. The pattern to the percents shows the same lack of relationship.
Chapter 5 Exercises
Suppose we measure class by what people perceive their social class to be (using the variable named
class). How closely is this measure related to a person’s self-identified political views (polviews)? Note:
before running this crosstab, look at the frequency distribution for class. (See Chapter 4 on univariate
statistics.) You may want to recode this variable before proceeding. (See Chapter 3 on transforming
data.) Describe the relationship in the table. Be sure to use Chi Square and an appropriate measure of
association.
Consult the codebook in Appendix A describing this dataset. Other than education and self-perceived
class, what other background variables (such as age, marital status, religion, sex, race, or income) might
help explain a person’s political views? Run CROSSTABS to see the tables. (Here as well, you may need
to recode some variables before proceeding.) Describe the relationship in the tables. Be sure to use Chi
Square and an appropriate measure of association.
Is trust related to race? Run CROSSTABS for trust (Can people be trusted?) with race and see what you
find. Describe the relationship in the table. Be sure to use Chi Square and an appropriate measure of
association.
Is ideology a general characteristic or is it issue specific? That is, are people who are liberal (or
conservative) on one issue (such as capital punishment) also liberal (or conservative) on other issues
(such as gun control or legalizing marijuana)? Run CROSSTABS to see the tables. Describe the
56
relationship in the table. Be sure to use Chi Square and an appropriate measure of association.
Next Chapter
This chapter has focused on exploring the relationship between two nominal and/or ordinal variables.
In the next chapter we’ll look at describing the relationship between two interval and/or ratio variables.
57
Cross tabulation is a useful way of exploring the relationship between variables that contain only a few
categories. For example, we could compare how men and women feel about abortion. Here our
dependent variable consists of only two categories—approve or disapprove. But what if we wanted to
find out if the average age at birth of first child is younger for women than for men? Here our
dependent variable is a continuous variable consisting of many values. We could recode it so that it only
had a few categories (e.g., under 20, 20 to 24, 25 to 29, 30 to 34, 35 to 39, 40 and older), but that would
result in the loss of a lot of information. A better way to do this would be to compare the mean age at
birth of first child for men and women.
Open GSS18A and click on ANALYZE, point your mouse at COMPARE MEANS, and then click on MEANS.
We want to put age at birth of first child (agekdbrn) in the DEPENDENT LIST and the variable sex in the
INDEPENDENT LIST. Highlight agekdbrn in the list of variables on the left of your screen, and then click
on the arrow next to the DEPENDENT LIST box. Now click on the list of variables on the left and use the
scroll bar to find the variable sex. Click on it to highlight it and then click on the arrow next to the
INDEPENDENT LIST box. Your screen should look like Figure 6-1.
Figure 6-1
Click on OK and the OUTPUT Window should look like Figure 6-2. On the average, women are a little
less than 2 years younger than men at the birth of first child.
Figure 6-2
58
Independent-Samples T Test
If women are, on average, a little less than 2 years younger than men at birth of first child, can we
conclude that this is also true in our population? Can we make an inference about the population (all
adults in the U.S.) from our sample (about 2,300 people selected from the population)? To answer this
question, we need to do a t test. This will test the hypothesis that men and women in the population do
not differ in terms of their mean age at birth of first child. By the way, this is called a null hypothesis.
The particular version of the t test that we will be using is called the independent-samples t test since
our two samples are completely independent of each other. In other words, the selection of cases in
one of the samples does not influence the selection of cases in the other sample. We’ll look later at a
situation where this is not true.
We want to compare our sample of men with our sample of women and then use this information to
make an inference about the population. Click on ANALYZE, then point your mouse at COMPARE
MEANS and then click on INDEPENDENT-SAMPLES T TEST. Find agekdbrn in the list of variables on the
left and click on it to highlight it, then click on the arrow to the left of the TEST VARIABLE box. This is the
variable we want to test so it will go in the TEST VARIABLE box. Now click on the list of variables on the
left and use the scroll bar to find the variable sex. Click on it to highlight it and then click on the arrow to
the left of the GROUPING VARIABLE box. Sex defines the two groups we want to compare so it will go in
the GROUPING VARIABLE box. Your screen should look like Figure 6-3.
Figure 6-3
Now we want to define the groups so click on the DEFINE GROUPS button. This will open the DEFINE
GROUPS box. Since males are coded 1 and females 2, type “1” in the GROUP 1 box and ”2“in the GROUP
2 box. (You will have to click in each box before typing the value.) This tells SPSS what the two groups
are that we want to compare.8 Now click on CONTINUE and on OK in the INDEPENDENT-SAMPLES T
TEST box. Your screen should look like Figure 6-4.
8
(If you don’t know how males and females are coded, click on UTILITIES in the menu bar, then on VARIABLES and
scroll down until you find the variable sex and click on it. The box to the right will tell you the values for males and
females. Be sure to close this box.
59
Figure 6-4
This table shows you the mean age at birth of first child for men (25.62) and women (23.70), which is a
mean difference of 1.92. It also shows you the results of the two t tests. Remember that this tests the
null hypothesis that men and women have the same mean age at birth of first child in the population.
There are two versions of this test. One assumes that the populations of men and women have equal
variances (for agekdbrn), while the other doesn’t make any assumption about the variances of the
populations. The table also gives you the values for the degrees of freedom and the observed
significance level. The significance value is .000 for both versions of the t test. Actually, this means less
than .0005 since SPSS rounds to the nearest third decimal place. This significance value is the
probability that the t value would be this big or bigger simply by chance if the null hypothesis was true.
Since this probability is so small (less than five in 10,000), we will reject the null hypothesis and conclude
that there probably is a difference between men and women in terms of average age at birth of first
child in the population. Notice that this is a two-tailed significance value. If you wanted the one-tailed
significance value, just divide the two-tailed value in half.
Let’s work another example. This time we will compare males and females in terms of average years of
school completed (educ). Click on ANALYZE, point your mouse at COMPARE MEANS, and click on
INDEPENDENT-SAMPLES T TEST. Click on RESET to get rid of the information you entered previously.
Move educ into the TEST VARIABLE box and sex into the GROUPING VARIABLE box. Click on DEFINE
GROUPS and define males and females as you did before. Click on CONTINUE and then on OK to get the
output window. Your screen should look like Figures 6-5.
Figure 6-5
60
There isn't much of a difference between men and women in terms of years of school completed. This
time we do not reject the null hypothesis since the observed significance level is greater than .05.
Paired-Samples T Test
We said we would look at an example where the samples are not independent. (SPSS calls these paired
samples. Sometimes they are called matched samples.) Let’s say we wanted to compare the
educational level of the respondent’s father and mother. Paeduc is the years of school completed by
the father and maeduc is years of school for the mother. Clearly our samples of fathers and mothers are
not independent of each other. If the respondent’s father is in one sample, then his or her mother will
be in the other sample. One sample determines the other sample. Another example of paired samples
is before and after measurements. We might have a person’s weight before they started to exercise and
their weight after exercising for two months. Since both measures are for the same person, we clearly
do not have independent samples. This requires a different type of t test for paired samples.
Click on ANALYZE, then point your mouse at COMPARE MEANS, and then click on PAIRED SAMPLES T
TEST. Scroll down to maeduc in the list of variables on the left and click on it and click on the arrow to
the left of the PAIRED VARIABLES box to move it to variable 1 in the VARIABLES box. Now click on
paeduc in the list of variables on the left and click on the arrow to the left of the paired VARIABLES box
to move it to variable 2 in the PAIRED VARIABLES box. Your screen should look like Figure 6-6.
Figure 6-6
Click on OK and your screen should look like Figure 6-7. This table shows the mean years of school
completed by mothers (12.02) and by fathers (11.98), as well as the standard deviations. The t-value for
the paired samples t test is 0.546 and the 2-tailed significance value is 0.585. (You may have to scroll
down to see these values.) This is the probability of getting a t-value this large or larger just by chance if
the null hypothesis is true. Since this probability is more than .05, we do not reject the null hypothesis.
This tells us that there probably isn't a difference between men and women in terms of years of school
completed in the population. Notice that if we were using a one-tailed test, then we would divide the
two-tailed significance value of .585 by 2 which would be .2925. For a one-tailed test, we would also not
reject the null hypothesis since the one-tailed significance value is more than .05. 9
9
We’ve deleted the paired-samples correlation since we haven’t discussed correlation yet.
61
Figure 6-7
In this chapter we have compared two groups (males and females). What if we wanted to compare
more than two groups? For example, we might want to see if age at birth of first child (agekebrn) varies
by educational level. This time let’s use the respondent’s highest degree (degree) as our measure of
education. To do this we will use One-Way Analysis of Variance (often abbreviated ANOVA). Click on
ANALYZE, then point your mouse at COMPARE MEANS, and then click on MEANS. Click on Reset to get
rid of what is already in the box. Click on agekdbrn to highlight it and then move it to the DEPENDENT
LIST box by clicking on the arrow to the left of the box. Then scroll down the list of variables on the left
and find degree. Click on it to highlight it and move it to the INDEPENDENT LIST box by clicking on the
arrow to the left of this box. Your screen should look like Figure 6-8.
Figure 6-8
Click on the OPTIONS button and this will open the MEANS: OPTIONS box. Click in the box labeled
ANOVA TABLE AND ETA. This should put a check mark in the box indicating that you want SPSS to do a
One-Way Analysis of Variance. Your screen should look like Figure 6-9.
62
Figure 6-9
Click on CONTINUE and then on OK in the MEANS box and your screen should look like Figure 6-10.
Figure 6-10
In this example, the independent variable has five categories: less than high school, high school, junior
college, bachelor, and graduate. Figure 6-10 shows the mean age at birth of first child for each of these
groups and their standard deviations, as well as the Analysis of Variance table including the sum of
squares, degrees of freedom, mean squares, the F-value and the observed significance value. (You will
have to scroll down to see the Analysis of Variance table.) The significance value for this example is the
probability of getting a F-value of 86.898 or higher if the null hypothesis is true. Here the null hypothesis
is that the mean age at birth of first child is the same for all five-population groups. In other words, the
mean age at birth of first child for all people with less than a high school degree is equal to the mean age
for all with a high school degree and all those with a junior college degree and all those with a bachelor’s
63
degree and all those with a graduate degree. Since this probability is so low (<.0005 or less than 5 out of
10,000), we would reject the null hypothesis and conclude that these population means are probably
not all the same.10
There is another procedure in SPSS that does One-Way Analysis of Variance and this is called ONE-WAY
ANOVA. This procedure allows you to use several multiple comparison procedures that can be used to
determine which groups have means that are significantly different.
1. Compute the mean age (age) of respondents who voted for Clinton or Trump (pres16) in 2016.
Which group had the youngest mean age and which had the oldest mean age?
2. Use the independent-samples t test to compare the mean family income (income16) of men and
women (the variable sex). Note that this variable isn’t actually interval in measurement but in this
exercise we’re going to treat it as interval and compute means. Which group had the highest mean
income? Was the difference statistically significant (i.e., was the significance value less than .05)?
3. Use the independent-samples t test to compare the mean age (age) of respondents who believe and
do not believe in life after death (postlife). Which group had the highest mean age? Was the
difference statistically significant (i.e., was the significance value less than .05)?
4. Use the paired samples t test to compare the mean socioeconomic status of mothers (masei10) and
fathers (pasei10). Which group had the highest mean socioeconomic status? Was the difference
statistically significant (i.e., was the significance value less than .05)?
5. Use One-Way Analysis of Variance to compare the mean years of school completed (educ) for
liberals, moderates, and conservatives (polviews). Which group had the most education and which
had the least education? Was the F-value statistically significant (i.e., was the significance value less
than .05)?
Next Chapter
This chapter has explored ways to compare the means of two or more groups and statistical tests to
determine if these means differ significantly. These procedures would be useful if your dependent
variable was continuous and your independent variable contained a few categories. Chapter 7 looks at
ways to explore the relationship between pairs of variables that are both continuous.
10
We’ve deleted the measures of association table.
64
To illustrate these techniques, we’ll use the COUNTRIES file, derived from several sources and containing
data on the countries of the world. See Appendix B for a codebook with information on the variables
included in this file.
We’ll begin by considering the relationship between perceived lack of government corruption (in other
words, perceived honesty in government) and Internet freedom. Our hypothesis will be that in
countries where Internet freedom is high, people will have a greater sense that they can hold
government accountable (the technical term for this sense is called “political efficacy”) and they will
tend to regard their system as less corrupt. We’ll also add a measure of political rights (rights to
“participate freely in the political process”) to the mix, primarily for consideration in Chapter 8.
Correlation
How close are the relationships among Internet freedom, political rights, and perceived honesty in
government? Open SPSS and then click on ANALYZE, CORRELATE, and finally on BIVARIATE. A dialog
box will appear on your screen. Click on honestgov and then click the arrow to move it into the box. Do
the same with ifreedom and polrights. The dialog box should look like Figure 7–1.
11
To make using them more intuitive (and more consistent with honestgov) ifreedom, polrights, and civillib have
been recoded from the original sources so that the higher the number, the higher the level of perceived honesty in
government, Internet freedom, political rights, and civil liberties.
65
Figure 7-1
The most widely used bivariate test is the Pearson’s r correlation coefficient. It is intended to be used
when both variables are measured at either the interval or ratio level and each variable is normally
distributed. However, sometimes we violate these assumptions. If you do histograms of our three
variables (see Chapters 4 and 9), you will notice that none are actually normally distributed.
Furthermore, variables we are using could arguably be considered ordinal, not interval, measures. We’ll
use the Pearson’s r, but will need to proceed with caution. SPSS includes another correlation test,
Spearman’s rho, which is designed to analyze variables that are not normally distributed, or are ranked
(i.e., ordinal rather than interval). We will conduct both tests to see how much the results differ
depending on the test used—in other words, whether those who use Pearson’s r for these variables are
seriously off base.
In the dialog box, click on OPTIONS and, in the resulting box, on EXCLUDE CASES LISTWISE. The result
should look like Figure 7–2. The reason for doing this is that ifreedom is based on many fewer cases
than the other two variables, and we want to be able to make “apples to apples” comparisons
Figure 7-2
66
Click on CONTINUE and this will take you back to Figure 7-1. The box next to PEARSON is already
checked, as this is the default. Click in the box next to SPEARMAN. Click the button next to ONE-TAILED
TEST OF SIGNIFICANCE. (This is because we will be testing “directional” hypotheses, that is, not just the
idea that two variables are related but, for example, that the higher the value of the ifreedom index, the
higher the value of the honestgov index.) Therefore, we would expect the correlation to be positive.
Your dialog box should now look like the one in Figure 7–3.
Figure 7-3
Click OK to run the tests. Your output screen will show two tables (called matrices): one for Pearson’s r
and one for Spearman’s rho. The Pearson’s correlation matrix should look like the one in Figure 7–4.
The cells of the table show the Pearson’s r correlation between each variable and each other variable,
the level of statistical significance of the relationship (that is, the likelihood that it could have occurred
by chance), and the number of cases on which the correlation is based.
Figure 7-4
The correlation coefficient may range from -1 to 1, where -1 or 1 indicates a “perfect” relationship. The
further the coefficient is from 0, regardless of whether it is positive or negative, the stronger the
67
relationship between the two variables. Thus, a coefficient of .467 is exactly as strong as a coefficient of
-.467. Positive coefficients tell us there is a positive relationship: when one variable increases, the other
increases. Negative coefficients tell us that there is an inverse relationship: when one variable increases,
the other decreases. Notice that the Pearson’s r for the relationship between Internet freedom and
perceived honesty in government is .467. This tells us that, just as we predicted, as Internet freedom
increases, perceived honesty in government increases as well. But should we consider the relationship
strong? We’ll revisit this question later in the chapter.
The correlation matrix also gives the probability that the relationship we have found could have
occurred just by chance. (Labeled as Sig. [1-tailed]). The probability value is .001, which is well below
the conventional threshold of p < .05. Thus, our hypothesis is supported. There is a relationship (the
coefficient is not 0), it is in the predicted direction (positive), and is statistically significant.
Recall that we had some concerns about using the Pearson’s r coefficient. Figure 7–5 shows the results
using Spearman’s rho. Notice that the coefficient for the relationship between ifreedom and honestgov
is .414, or about the same as the value of Pearson’s r for this relationship. Similarly, the other values of
Spearman’s rho are similar to those for Pearson’s r. This is reassuring.
Figure 7-5
Regression
Let’s look more closely at the relationship between ifreedom and honestgov graphically by creating a
scatterplot. Click on GRAPHS, CHART BUILDER. This will open up the dialog box shown in Figure 7-6. (If
you get a message telling you to be sure that the measurement levels of each variable have been set
properly, click on OK, since this has already been done for you for the COUNTRIES file.)
68
Figure 7-6
Next, in the CHOOSE FROM list at the lower left, click on SCATTER/DOT. Then, shift your attention to the
sample graph patterns in the lower part of the window, and click on the first one (upper left). When
your mouse hovers over it, it will say SIMPLE SCATTER. Holding down the mouse button, drag the
sample chart to the large chart preview window in the upper part of the window. Your screen should
look like Figure 7-7.
69
Figure 7-7
Now, add variables to the chart preview window. From the list of variables, click on ifreedom and drag it
to the box located on the horizontal (X) axis (because it is the independent variable in our hypothesis
and the independent variable belongs on the horizontal axis). Next, click on honestgov and drag it into
the box located on the vertical (Y) axis. Finally, add data labels to make it easier to read. From the
menu in the middle of the CHART BUILDER, click on GROUPS/POINT ID, select POINT ID LABEL and, from
the list of variables, click on name and drag it to the box on the chart called POINT LABEL VARIABLE?
(Note: POINT ID LABELS aren’t a good idea if you have a large number of cases but will work well here.)
Your dialog box should now look like the one in Figure 7–8.
70
Figure 7-8
Now, click OK. What you see is a plot of perceived honesty in government for each country included in
the chart by each country’s level of Internet freedom. Your scatterplot should look like the one in Figure
7–9.
Figure 7-9
You can edit your graph to make it easier to interpret. First, double-click anywhere in the graph. This
71
will cause the graph to open in its own window. On the menu bar, click on ELEMENTS, then FIT LINE AT
TOTAL. You will get a dialog box that looks like the one in Figure 7–9.
Figure 7-10
In the FIT LINE section, click on LINEAR (it is the default) and then click on APPLY and close the box. (If
the Apply button is not active, select a different Fit Method, then change back to Linear before clicking
on Apply. If your graph doesn’t show country names, click on ELEMENTS again, then ON SHOW DATA
LABELS.) Your graph now looks like the one in Figure 7–11.
72
Figure 7-11
Notice the line variously known as the “least squares line,” the “line of best fit,” or the “regression
line”—we’ll go with the last of these— that is now drawn on the graph. Regression and correlation
analyze linear relationships between variables, finding the regression line that best fits the data (that is,
keeps the errors, the squared distances of each point from the line, to a minimum). Also notice the
formula (y=21.16+35*x), called the “regression equation,” superimposed on the line, and the R-square
Linear statistic (.218) to the right of the graph. We’ll return a bit later to the regression equation and
the R-square linear statistic (usually just called “r2”).
In general, countries to the right on the graph (that is, those that have freer Internet access) tend also to
be higher on the graph (that is, have more perceived honesty in government). This is just what we
hypothesized. We can now do some “deviant case analysis.” Countries that appear above the
regression line are those with more perceived honesty in government than we would expect given their
level of Internet freedom, while those below the line have less.
Some countries are pretty much where we’d expect (in that they are close to the line), while some
others are well above or below. Can you think of any other factors that might explain the “deviant”
cases? We’ll return to this question in Chapter 8.
Multiplied by 100, r2 tells us the percentage of the variation in the dependent variable (honestgov, on
the Y-axis) that is explained by the scores on the independent variable (ifreedom, on the X-axis). Thus,
Internet freedom explains 21.8% of the variation in perceived honesty in government. Recall that the
Pearson’s r coefficient was .467. If you take the square root of .218, you get .467, the same as the value
of r. (If the relationship were negative, you’d take the negative square root.) Though the r statistic is
the one most commonly reported, r2 is extremely useful, since it tells us the “proportional reduction in
73
error” we achieve in “predicting” the value of the dependent variable by knowing that of the
independent variable.
How strong a relationship is this? There’s no firm answer to this question. One scholar (Karl Deutsch)
once suggested that, if you can explain at least 10% of the variance of a variable, you have something
worth talking about. If your r2 exceeds .5 (that is, it explains over 50% of variance), then your knowledge
exceeds your ignorance! We would probably consider anything between an r 2 of .1 and .5 (or an r
between about ±.3 and ±.7) to be a moderately strong relationship.
Doing a regression analysis can help us to understand the regression line in more detail. Close the SPSS
CHART EDITOR. Click on ANALYZE, REGRESSION, and LINEAR. This opens up the dialog box shown in
Figure 7-12.
Figure 7-12
Move honestgov to the Dependent box, and ifreedom to the Independent(s) box. Click OK. The results
should look like those shown in Figure 7-13.
74
Figure 7-13
The first table (which we have not displayed here) just shows the variables that have been included in
the analysis. The second table, “Model Summary,” shows the R-square statistic, which is .218. (Where
have you seen this before? What does it mean?) (Note: the “Adjusted R Square,” .200, is slightly lower
because it takes into account the number of independent variables in the equation.) The third table,
ANOVA, gives you information about the model as a whole. ANOVA is discussed briefly in Chapter 6.
Note that if you take the Regression Sum of Squares (the variance explained by the relationship) and
divide by the Total Sum of Squares, the result is equal to R2. The final table, Coefficients, gives the
results of the regression analysis that are not available using only correlation techniques. Look at the
“Unstandardized Coefficients” column. SPSS provides two versions of the regression equation; both are
important.
The unstandardized equation shows the relationship between the dependent variable and the
independent variable using the original units of analysis. For our purposes, we’ll skip over the Std
(standard) Error.
There are two statistics reported under B, one for the “(Constant),” the other for the “ifreedom Internet
Freedom Index”. The first number (21.164) is the Y-intercept, that is, the value of the dependent
variable when the independent variable is equal to zero. The second number (.348) is the regression
coefficient, which is the slope of the line that you saw on the scatterplot. A more standard way to
present this information (which you may have learned in an introductory statistics course) is Ŷ = 21.164
+ .348X. This indicates the predicted value of Y for a given value of X. In other words, for an increase of
1 unit on the ifreedom scale, we would, all else being equal, predict an increase of .348 units on the
honestgov scale.
We know that the linear relationship between X and Y (ifreedom and honestgov) is not perfect. The
correlation coefficient was not 1 (or –1), and the scatterplot showed plenty of cases that did not fall
75
directly on the line. Thus, it is clear to us that knowing a country’s level of Internet freedom will not tell
us without fail its level of perceived honesty in government. It is clear that there is some error built into
our findings.
What can we do with this formula? One thing we can do is make predictions about particular values of
the dependent variable, using just a little arithmetic. All we have to do is plug the values from our
output into the formula for a line. Plugging the numbers from Figure 7–13 into the formula for a straight
line, we obtain Ŷ=21.164+.348*X, the same equation we saw earlier in Figure 7–11, except that, here,
numbers have been carried out to three decimal places. We can then plug in the value of X (ifreedom)
for any given country, multiply by .348, and add that to 21.164. The result will be the predicted value of
the honestgov variable for that country.
For example, looking at the file in DATA VIEW mode (see Chapter 1), we see that South Africa, the
United Kingdom, and Ukraine all have similar ifreedom scores (74, 75, and 73 respectively). Plugging
these values into the equation we obtain:
These numbers represent the predicted values of honestgov for these three countries, that is, what the
values would be if all three countries fell right on the regression line. In other words, we would predict
that, since all three countries have similar ifreedom scores, they will also have similar honestgov scores.
Going back to DATA VIEW, however, we see that the actual scores are 43, 74, and 26 respectively. If we
subtract the predicted scores from the actual scores (Y-Ŷ), we obtain the “residual,” which is a measure
of the error in our prediction for a given case. In this example, the residuals are:
In other words, as can be seen in Figure 7-11 above, perceived honesty in government in South Africa is
about what we would expect, whereas it is much higher than predicted in the United Kingdom, and
much lower than predicted in Ukraine.
We won’t go into it here, but you can, for all cases, add the predicted values of the dependent variable
and the residuals as additional variables in the data file. To do this, click on SAVE in the regression dialog
box, and select UNSTANDARDIZED PREDICTED VALUES AND UNSTANDARDIZED RESIDUALS.
The standardized equation shows the relationship between the dependent variable and the
independent variable in which variables have been converted to standardized scores with means of 0
and standard deviations of 1. This allows us to compare the relative importance of different
independent variables that have been measured using different units of analysis. We’ll return to this in
the next chapter.
76
1. Can you think of any other variables included in the codebook in Appendix B that might help explain
levels of perceived government honesty among countries? Repeat the analysis presented in this
chapter, but substitute your variable for ifreedom.
2. Pick another variable from the codebook (for example, adult obesity rate). Pick another variable
that you think might help explain why some countries have a much higher rate than others. Repeat
the analysis presented in this chapter, but substitute your variables for ifreedom and honestgov.
3. The variables in the General Social Survey are mostly nominal or ordinal, but there are some
exceptions. In this exercise, we’ll use the data set GSS18A and work with two of these variables,
the number of hours per week a respondent reports watching television (tvhours), and the
respondent’s age (age).
a. It is likely that people of different ages watch different amounts of television. How do you think
these may be related? Write a hypothesis that predicts the direction of the relationship between
age and tvhours.
b. Do a Pearson correlation to test your hypothesis. Was your hypothesis supported? Explain.
Remember that whether or not your hypothesis is supported depends on three things: whether or
not the coefficient was 0, whether your prediction of the hypothesized direction of the relationship
(+ or -) was correct, and the significance (the probability that you will be wrong if you generalize
your finding to the population from which the sample was drawn). Be sure to discuss all three in
your explanation.
c. Discuss the strength of the relationship between age and tvhours. Then, speculate about a
second factor that might also influence the amount of television that people watch.
d. How much of the variance in tvhours is explained by age? Tell how you found out.
e. Do a regression analysis of the relationship between age and tvhours. Be sure to place your
variables into their proper boxes (in other words, correctly identify the independent and dependent
variable). If you were writing a scholarly report, how would you describe the relationship between
age and tvhours based on your results? (Note: If it is small, SPSS may have expressed your
regression coefficient in scientific notation in order to save space. If you see something like 2.035E-2
on your SPSS output, that is scientific notation. The E-2 is telling you to move the decimal point two
places to the left. Thus, 2.035E-2 becomes .02035. If you don’t want to move the decimal yourself,
click rapidly several times on the coefficient in the output screen and SPSS will show you the actual
value of the coefficient.)
f. Do the results of the regression analysis suggest that your hypothesis is supported? Be sure to
discuss the magnitude of the regression coefficient, the direction (+ or -), and the probability.
77
g. How many hours of television does your model predict that people aged 21 tend to watch each
day? People aged 42? Show how you calculated these predicted scores.
4. Repeat exercise 3, but this time use income16 as the dependent variable, and educ as the
independent variable
Next Chapter
In this chapter we focused on exploring the relationship between two interval and ratio variables.
The next chapter will focus on describing relationships among sets of three variables.
78
Crosstabs Revisited
Recall from Chapter 5 that the crosstabs procedure is used when variables are nominal (or ordinal).
Simple crosstabs, which examine the influence of one variable on another, should be only the first step
in the analysis of social science data. We might begin this first step by hypothesizing that women are
more strongly religious than men, and that African Americans and Hispanics are more strongly religious
than Anglos.
The 2018 General Social Survey provides data that we can use to test these hypotheses. The measure of
sex (or gender) is relatively straightforward. A variable we can use to measure religiosity (reliten) was
obtained by asking respondents about the strength of their religious affiliation (“strong,” “somewhat
strong,” “not very strong,” or “no religion”). Finally, the variable ethnicity was created by combining a
question asking respondents to identify their race with one asking whether the respondent was Hispanic
(which can be of any race). This yields four categories: White (the term used by the U.S. Census in 2010
was “non-Hispanic whites”), Black (“non-Hispanic blacks”), Hispanic, and Other (“non-Hispanic other).
Open GSS18A and select all respondents except “Other”12 for analysis. (Review the procedures described
in Chapter 3 for selecting cases.)13
Following the instructions in chapter 5, crosstabulate reliten with the variable sex and with ethnicity
selecting column percentages for the cells. You’ll obtain the results shown in Figures 8–1 and 8–2 14.
(We’ve left out the Case Processing Summary.)
12
Because there are relatively few cases in this category, and because it combines people who may have little in
common in terms of their ethnicity, we are not including them in this analysis.
13
It’s important to weight the cases so they better represent the population from which the sample is selected.
Our data set – GSS18A – has already been weighted so you don’t need to weight it again.
14
Note that, since we have elected to exclude “Other” respondents, they will be excluded from all tables, including
those crosstabulating reliten and sex. This has the advantage of basing all tables on the same respondents, but at
the price of eliminating some we might have wanted to include in our comparison of males and females.
79
Figure 8-1
Figure 8-2
As the results show, women are more likely than men to report a strong or somewhat strong religious
affiliation, and are less likely to report that they have no religious affiliation. Black respondents are
more likley to report having a strong religious affiliation than other groups. (In the interest of
conserving space, we haven’t carried out measures of association or statistical significance, but you may
wish to do so yourself.)
This one-step method of hypothesis testing is, however, very limited. It does not, for example, tell us
whether Black men differ from Black women in religious intensity, whether there are differences in this
regard between White men and White women or between Hispanic men and Hispanic women.
To answer this question, we will do a multivariate cross tabulation, also called an elaboration analysis.
80
Recall that your original crosstabs procedure produces one contingency table, with as many rows as
there are categories (or values) of the dependent variable, and as many columns as there are categories
of the independent variable. When you start using control (sometimes called test) variables, you will get
as many separate tables as there are categories of the control variable. There are three categories of
the ethnicity variable; thus, we should expect to get three contingency tables, each one showing the
relationship between sex and reliten for Whites, for Blacks, and for Hispanics.
Open up the crosstabulation dialog box you used for Figures 8–1 and 8–2, but this time adding ethnicity
in the third box on the right under “Layer 1 of 1.” To make the table more compact, click on cells and
unselect “Count.” The dialog box should now look like Figure 8–3.
Figure 8-3
Click OK and your results should look like the table shown in Figure 8-4.
81
Figure 8-4
Notice that the relationship between reliten and sex is roughly the same within each ethnic group.
Try other variables as a control (i.e., in place of ethnicity) to see what happens. As a general rule, here is
how to interpret what you find from this elaboration analysis:
If the relationship between the independent and dependent variables shown in the partial
tables is similar to that shown in the zero-order (original bivariate) table you have replicated
your original findings, which means that in spite of the introduction of a particular control
variable, the original relationship persists. This is indeed the case here: the differences between
men and women shown in the partial tables of Figure 8–4 are similar to those shown in Figure
8–1.
If the difference shown in all the partial tables (the separate tables for each category of the
control variable) are significantly smaller than those found in the original AND IF your control
variable is antecedent (occurs prior in time) to both the other variables, you have found a
spurious relationship and explained away the original. In other words, the original relationship
was due to the influence of that control variable, not the one you first hypothesized.
If the differences you see in the partial tables are less than you saw in the original table AND IF
your control variable is intervening (that is, the control variable occurs in time after the original
independent variable), you have interpreted the relationship. If the time sequence between the
independent and control variable is not determinable (or otherwise unclear), then you don't
know whether you have explanation or interpretation, but you do know that the control
82
variable is important.
If one or more of the differences shown in the partial tables is stronger than in the original and
one or more is weaker, you have discovered the conditions under which the original relationship
is strongest. This is referred to as specification or the interaction effect.
It’s unlikely that your tables will fit neatly into one and only one of these types. It’s more likely your
tables will approximate them.
Multiple Regression
Another statistical technique estimating the effects of two or more independent variables on a
dependent variable is multiple regression analysis. This technique is appropriate when your variables
are measured at the interval or ratio level, although researchers sometimes use multiple regression with
ordinal variables as well. Multiple regression also assumes that there is a linear relationship between
each independent variable and the dependent variable, and that the distribution of values in your
variables follows a normal distribution.
Recall from Chapter 7 that we investigated the impact that Internet freedom had on perceived honesty
in government and found evidence consistent with our hypothesis that high levels of Internet freedom
seem to increase people’s sense that they can hold government accountable, thus leading to
perceptions of greater honesty in government. It may be, however, that holding government
accountable requires more than the ability to publicize corrupt activities, but also requires the ability to
exercise political rights, such as the right to vote in contested elections. In recent years, for example,
protesters in some countries have used the Internet to help bring down corrupt regimes, but the
absence of effective means to participate in ordinary political institutions has sometimes led to the
emergence of new leaders as corrupt as those they replaced.
To test this, open the COUNTRIES file and add the variable polrights to the regression equation we ran in
Chapter 7. From the menu, click ANALYZE, REGRESSION, LINEAR. Click on honestgov and move it into
the DEPENDENT box at the top of the dialog box. Click on ifreedom and polrights and move them into
the INDEPENDENT(S) box. The dialog box should look like the one shown in Figure 8-5.
83
Figure 8-5
Click on OK and your results should look like those shown in Figure 8-6.
Figure 8-6
Looking first at the Model Summary table, you will see that the adjusted R-squared value is .334. This
84
means that 33.4% of the variation in the dependent variable (perceived honesty in government) is
explained by knowing a country’s level of Internet freedom and political rights. The ANOVA table shows
that the overall model is highly statistically significant. Next, we need to look at the Coefficients table. If
you look at the B coefficient for ifreedom, you will see that it is -.076. How do we interpret this
coefficient? Recall the discussion in Chapter 7: a one-unit change in the independent variable (ifreedom)
is associated with a change in the dependent variable (honestgov) equal to the value of B. So, if we
increase the value of ifreedom by 1, on average, we get a change of -.076 units in perceived honest in
government. Since the higher the level of the Internet freedom variable, the lower the level of
perceived honesty in government, the results are actually in the opposite direction than we had
hypothesized. However, the regression coefficient is not statistically significant so we cannot conclude
that ifreedom is related to honestgov. On the other hand, the value of B for polrights is 5.110, meaning
that an increase of one unit on the political rights index is associated with an increase of 5.11 points on
perceived honesty in government. The result is in the hypothesized direction.
However, one problem with interpreting the B coefficients is that the units of measurement we are
using are quite different for different variables. Internet freedom and lack of perceived corruption are
measured on scales of 0 to 100, whereas political rights are measured on a scale of 1 to 7. We’re
comparing apples to oranges.
To address this problem, look at the standardized (Beta) coefficients, which we’ve ignored to this point.
Beta coefficients in effect convert all variables to standard scores (with means of 0 and standard
deviations of 1). The Beta coefficient for polrights (.685) has an absolute value almost seven times as
large as that for ifreedom (.101). In other words, when each independent variable is controlled for the
other, an increase of one standard deviation in polrights has an impact on corruption that is much
greater than that of the same increase in the ifreedom measure. Finally, note that the polrights is highly
statistically significant (p = .003) while ifreedom is not at all statistically significant (p=.644).
If we convert the information in the Coefficients table to standard algebraic notation, we get, for the
unstandardized equation:
Ŷ=24.688-.076*X1-5.110*X2 where
X1=ifreedom and
X2=polrights.
Ŷ=-.101*X1-.685*X2.
The reason why the constant has dropped out of this equation is that, with variables converted to
standard scores, it is equal to zero by definition.
Finally, note that the model as a whole only explains about a third of the variance among countries in
perceived corruption. Does the dataset include any other variables that you think might explain some of
the rest? Add these variables to the equation and see if they help.
85
Repeat the crosstabs we ran earlier in this chapter, but this time use race as the independent variable
and sex as the control variable.
How would you hypothesize the relationship between fear (Afraid to walk at night in neighborhood) and
sex?
a. Write out your hypothesis.
b. Run a crosstab to test your hypothesis and report your results.
c. Now, do a second crosstab, this time controlling for class. Report your results.
d. Now run fear and sex but control for trust. Report your results.
Choose three independent variables from the General Social Survey subset that you think influence the
number of hours people watch television (tvhours, the dependent variable).
a. Write up your hypotheses (how and why each independent variable is associated with the
dependent variable).
b. Run a multivariate regression to test your hypotheses and report your results.
Using the unstandardized regression equation for predicting honestgov based on ifreedom and polrights,
calculate the residuals for South Africa, the United Kingdom, and Ukraine. You can either do this
manually or, when running the regression analysis, click on SAVE and save the unstandardized residuals
as an additional variable, then go to DATA VIEW to find the values of this new variable (which SPSS will
call “RES_1”) for these countries. (Note that this new variable is calculated only for those countries for
which there is no missing data for any of the variables in the equation.) Are the residuals for the United
Kingdom and Ukraine less than those we calculated in Chapter 7? Are there other variables that, if
added to the equation, might reduce them further?
From Appendix B select three variables that you think might help explain inequality of income
distribution (inequality). Using the COUNTRIES file, run a multiple regression analysis. Which of the
three independent variables is the best predictor of inequality? How much of the variance among
countries in inequality is explained by the model as a whole?
Next Chapter
The next chapter will discuss creating and editing graphs and tables. We’ll look some graphs
and tables discussed in previous chapters and explore some new ones.
86
Charts
Some of the statistical procedures in SPSS provide optional graphic as well as tabular output. In Chapter
4, we saw that FREQUENCIES can be used to produce bar charts, pie charts, and histograms, and that the
Explore procedure can yield boxplots. In addition, the CROSSTABS procedure can display clustered bar
charts.
Producing charts as a byproduct of these procedures has some limitations. The variety of charts
available is quite limited, and those that are provided give users limited control over the output.
Fortunately, SPSS also provides a chart builder that is much more powerful when it comes to graphics.
In Chapter 7, we saw one example when we used the chart builder to produce a scatterplot. In this
chapter, we’ll explain how the chart builder works in general, and then provide three examples in
addition to the scatterplots already discussed: bar charts, boxplots (also known as “box and whiskers”
plots), and line charts.
General
We'll use the COUNTRIES file. Open the data file and click on GRAPHS, then CHART BUILDER, and then
on OK. The Chart Builder box is shown in Figure 9-1.
87
Figure 9-1
First, you might want to familiarize yourself with the items contained in the CHART BUILDER. Making
sure that the GALLERY tab is active (this is the default), click on the various choices of chart types (i.e.,
Bar, Line, Area, etc.) to review the forms that each type takes. Also, notice the other tabs located to the
right of the Gallery tab. We will return to these tabs.
Notice that the text contained in the chart’s upper window indicates that there are two ways to build a
chart (by dragging a GALLERY CHART or by clicking on the BASIC ELEMENTS tab). We will be using the
first method, dragging a GALLERY CHART to use as a starting point.
Bar Charts
We’ll use the COUNTRIES file to show what Gross Domestic Product per capita looks like in the various
regions of the world. From the GALLERY tab, click on BAR from the list of chart types. Click on the first
subtype (SIMPLE BAR) that then appears to the right (you can see the names of each chart as you hold
your mouse over it) and drag it up into the CHART BUILDER window). In addition to seeing the simple
bar chart show up in the CHART BUILDER window, you will see that a second dialog box, called ELEMENT
PROPERTIES, has opened. Figure 9-2 shows what you should be seeing on your screen. For the
moment, ignore the ELEMENT PROPERTIES box.
88
Figure 9-2
Locate region in the VARIABLE LIST, click on it, and drag it to the box labeled X-AXIS?. Drag gdpcapita
into the box labeled Y-AXIS?. When you do that, your screen should look like the one shown in Figure 9-
3.
89
Figure 9-3
The final step is to give the chart a title. Click on the TITLES/FOOTNOTES tab, then click the box next to
TITLE 1. In the right-hand pane, select the circle for CUSTOM and type in your title (for example, “Gross
Domestic Product per Capita by Region”). You may have to uncheck the box next to TITLE 1 and then
recheck it again to open the CUSTOM box. You should notice that the dialog boxes now look like Figure
9-4.
90
Figure 9-4
We are now finished defining what our chart should look like. Now click on OK. Your finished chart
should look like Figure 9-5.
Figure 9-5
91
If you wish, you may continue to edit your chart from the Output screen. To do this, double-click
anywhere in the chart, and it will open in the CHART EDITOR. Explore the menus in the Chart Editor to
experiment with what you can do. Try this:
Note: the unit of analysis in the data file is the country, not the individual. This means that a small
country contributes as much to the result as does a large one. The mean averages shown in Figure 9.5
are, therefore, for the average country in each region, not the average person. We could obtain the
latter by weighting the per capita GDP in each country by its population, using a Compute
transformation (see Chapter 3, “Creating New Variables Using COMPUTE”). Try it.
Boxplots
Figure 9-5 shows that there are substantial differences in per capita GDP from one region to another. If
we want to look at differences within as well as between regions, we can do so using boxplots (a.k.a.
“box and whiskers” plots).
As before, click on GRAPHS, then on CHART BUILDER, and then on OK. From the GALLERY tab, click on
BOXPLOT from the list of chart types. Click on the first subtype (SIMPLE BOXPLOT) that then appears to
the right, and drag it up into the CHART BUILDER window. Locate region in the VARIABLE LIST, click on
it, and drag it to the box labeled X-AXIS?. Drag gdpcapita into the box labeled Y-AXIS?. Click on the
TITLES/FOOTNOTES tab, then click the box next to TITLE 1. Click on CUSTOM in the right-hand pane and
then type in your title (for example, “Gross Domestic Product per Capita Between and Within Regions”).
You may have to uncheck the box next to TITLE 1 and then recheck it again to open the CUSTOM
window.
One additional step is needed. Click on the GROUPS/POINT ID tab, then click the box next to POINT ID
LABEL. Notice that a new box (POINT LABEL VARIABLE?) has opened up in the CHART PREVIEW window.
Locate name in the VARIABLE LIST, click on it, and drag it into that box. You should notice that the dialog
boxes now look like Figure 9-6.
92
Figure 9-6
Click OK. Your finished chart should look like Figure 9-7.
Figure 9-7
1. The box represents the Interquartile Range (IQR), that is, the middle two quartiles of the
distribution for the region, with the top of the box indicating the 75th percentile, and the bottom
representing the 25th percentile. Some regions (e.g., Central America) have small boxes,
93
indicating that they are relatively homogenous, while the boxes for others (e.g. Europe) are
larger, indicating that countries within the region vary considerably from one another.
2. The thick line inside each box represents the median, or 50th percentile. Half of the countries in
the region have a per capita GDP this high or higher, and half this low or lower.
3. The lines extending above and below the box are the “whiskers.” These represent countries
above or below the IQR, but within 1.5 times the IQR. The longer the whiskers, the greater the
range within these parts of the distribution.
4. The circles above or below the whiskers are outliers: countries that are outside the box by 1.5 to
3 times the IQR. Gabon, for example, is a poor country, but is relatively well off compared to
other countries in Africa.
5. The asterisks above or below the whiskers are extreme outliers: countries outside the box by
more than 3 times the IQR. While Asia is, in general, relatively poor, there are several Asian
countries that are much wealthier than the region as a whole.
Line Charts
We’ll illustrate line charts using the GSS18A file, and will look at the relationship between respondents’
political ideology (a seven point scale called polviews with values ranging from lowest “extremely
liberal,” to highest “extremely conservative”), and party identification (another seven point scale, this
one called partyid, with values ranging from lowest, “strong Democrat,” to highest, “strong
Republican”). We want to see whether, and to what degree, respondents’ choices of political parties
depend on their political ideology. We’ll then examine whether this pattern is the same or different for
Whites, Blacks, Hispanics, and those in other racial or ethnic categories.
Simple Line Charts: As before, click on GRAPHS, then CHART BUILDER, and then on OK. From the
GALLERY tab, click on LINE from the list of chart types. Click on the first subtype (SIMPLE LINE) that then
appears to the right, and drag it up into the Chart Builder window. Locate polviews in the VARIABLE
LIST, click on it, and drag it to the box labeled X-AXIS?. Drag partyid into the box labeled Y-AXIS?.15 Click
on the TITLES/FOOTNOTES tab, and then click the box next to TITLE 1. Click on CUSTOM in the right-
hand pane and then type in your title (for example, “Party Identification by Political Ideology”). You may
have to uncheck the box next to TITLE 1 and then recheck it again to open the CUSTOM window. In the
ELEMENT PROPERTIES window, under EDIT PROPERTIES OF:, highlight LINE 1, select MEDIAN from the
drop-down menu under STATISTIC:. The result should look like Figure 9-8.
15
When you drag polviews to the X-AXIS the name of the Y-AXIS shifts from Y-AXIS to COUNT.
94
Figure 9-8
Near the bottom of the Chart Builder Dialog box, click OK. You should now see a chart like the one
shown above. Your finished chart should look like Figure 9-9. The chart shows that, as expected, the
more conservative respondents are, the more Republican they tend to be.
Figure 9-9
Multiple Line Charts: Let’s again examine the relationship between party identification and political
ideology, but with a separate chart for respondents in each of several racial/ethnic categories. We’ll use
95
As before, click on GRAPHS, then CHART BUILDER, then on OK. Click on RESET to delete what your
previously did. Being sure that the GALLERY tab is highlighted above the lower portion of the dialog box,
click on LINE.
Now select the second image to the right (MULTIPLE LINE), and drag it to the window in the upper right
portion of the dialog box. As before, click on polviews from the list in the upper left of the box, and drag
it to the X-AXIS? box on the right. In a similar way, click on partyid and drag it to the Y-Axis? box16. Click
on the TITLES/FOOTNOTES tab, and then click the box next to TITLE 1. Click on CUSTOM in the right-
hand pane and then type in your title (for example, “Party Identification by Political Ideology, Controlling
for Ethnicity”). You may have to uncheck the box next to TITLE 1 and then recheck it again to open the
CUSTOM window. Under EDIT PROPERTIES OF:, highlight LINE 1, select MEDIAN from the drop-down
menu under STATISTIC:.
Now click on the GROUPS/POINT ID tab in the chart builder box. Check COLUMNS PANEL VARIABLE and
uncheck GROUPING/STACKING VARIABLE. From the list in the upper left of the Chart Builder, drag
ethnicity to the PANEL? box to the right. The result should look like Figure 9-10.
Figure 9-10
16
As in the previous example, when you drag polviews to the X-AXIS box the name of the Y-AXIS shifts from Y-AXIS
to COUNT.
96
Near the bottom of the Chart Builder dialog box, click OK. You should now see a chart like the one
shown in Figure 9-11. Note that, among Whites, party identification is heavily influenced by political
ideology. The same is true, but to a lesser degree, among Hispanics. Among Blacks, conservatives are not
much more likely than liberals to identify with the Republican party.
Figure 9-11
Tables
Using the GSS18A file, let’s create a cross tabulation of the variables sex and fear. Click on ANALYZE,
then DESCRIPTIVE STATISTICS, then CROSSTABS. Put fear in the ROW box and sex in the COLUMN box
(recall that in cross tabulations, the independent variable goes in the column position). Now click on
CELLS and select COLUMN in the Percentages box, and then click on CONTINUE, then OK. The Output
Window will appear, and your screen should look like Figure 9-12.
Figure 9-12
Right click on any part of the table, then click on EDIT CONTENT and on either IN VIEWER or IN
SEPARATE WINDOW. Double click on the part you wish to edit, then type in the changes you wish to
97
make. Figure 9-13 shows what the table might look like after we’ve changed the title and eliminated
some details that would not normally be included in a published essay.
Figure 9-13
Since you will probably be using a word processing program to prepare the report of your results, it will
be useful to copy your charts and tables from SPSS into your word-processing document. Let’s start
with the table we just created. There are two ways to do this. The simplest way is to click on the table
using the right mouse button. A small menu will appear; click on COPY. Then, go to your word-
processing document, and right-click where you want the table to appear. The small menu will appear
again; click PASTE.
The second way to copy the table is by using the menu commands. Make sure the table you want is
selected (you will see the red arrow pointing to it in the output log on the left side of your screen and
the table will have an outline around it). Click on EDIT on the menu bar, then click on COPY. Switch over
to your word-processing document. Click the mouse where you want to paste your table. Click on EDIT
on the menu bar, then click on PASTE. You might want to paste your graph into a Text box. This will
make your graph easier to move.
You could also copy the table as an image. When you copy it as an image, click on PASTE SPECIAL
instead of PASTE and choose the format you want. This would give you choices about the format for
your table. Note: The method for copying and pasting charts is exactly the same as the method as for
copying and pasting tables.
Make a bar chart of trust. Then, edit the chart by giving it a proper title. Copy and paste the chart into a
word processing file. Write a few sentences that describe the pattern shown in the chart.
98
While views on political issues can influence a person’s party identification, some have suggested that,
with American politics having become increasingly partisan in recent years, the reverse may be true as
well. In other words, one’s party (as an important reference group) may shape how one feels about
political issues. Make a multiple line chart similar to that shown above, but use partyid as the
independent (x-axis) variable, and polviews as the dependent (y-axis) variable. As before, add a control
for ethnicity (See Figures 9-10 and 9-11).
Do a cross-tabulation of hapmar and trust. Since hapmar is the independent variable, place it in the
column location, and show column percentages (see Chapter 5 for a review). Be sure that your table is
properly titled. Copy and paste the table into a word processing file. Write a few sentences that discuss
the relationship of the information shown in the table to the information shown in the chart you created
for the first exercise above.
Next Chapter
In the previous chapters we discussed how to use SPSS to analyze your data. We talked about using
SPSS to describe your data, analyzing the relationship between pairs of variables, and extending our
analysis to include sets of three or more variables. Now we need to think about how to write a research
report so that others may read it and learn from our analysis. This report might be for a class you are
taking or it might be a report that you are submitting to a research conference. If you are going to
submit your report to a journal for possible publication, you need to look carefully at the instructions
that all journals provide on preparing a manuscript for publication.
Here's an outline for your report. Don't think that this is the only way you can organize your report, but
this is one way to do it.
Title page including your name, date, and class or institutional affiliation.
Abstract – An abstract is a short summary of what you did in the paper and the major findings of
your analysis. Abstracts are really short, so you need to be succinct. It should be less than 200
words or even shorter depending on the requirements of your professor or the research
conference to which you are submitting your paper.
Table of contents (optional).
Body of the paper.
o An introduction to the paper which explains why you wrote the report and provides an
introduction to the topic of the paper.
o Your review of the literature that summarizes what others discovered about this topic.
Virtually everything you might do has been written about by others. You should review
the relevant literature and summarize what others have found. You don't want to
simply list the relevant literature and consider the articles and books one by one.
Rather you want to summarize what others have done and look for themes around
which you can organize your literature review. If you are having trouble finding relevant
literature, go to the library at your university or a nearby university and talk with a
reference librarian. They are trained in searching for relevant literature and will be able
to help you.
o The methodology of your study.
100
If you collected your own data, discuss how you chose your sample, how you
measured the concepts, and how you collected your data.
If you used an existing data set, discuss the sampling, measurement, and data
collection used in that study. Studies that are part of data archives such as the
Inter-university for Political and Social Research at the University of Michigan
and the Roper Center for Public Opinion Research at Cornell University provide
good summaries for all data sets that are housed at their archive.
o Theory and Hypotheses – If you are using a theoretical perspective and/or testing
hypotheses, describe the theory and state the hypotheses you plan to test. Be sure to
cite supporting literature that form the basis for your theory and hypotheses.
o Empirical findings and interpretation – What are the empirical findings that came out of
your data analysis and what did they tell you? If you are testing hypotheses, did your
analysis support your hypotheses? Remember that you are telling a story. Start simple
and build up. That means starting with looking at variables one at a time (i.e., univariate
analysis), then proceeding to relationship between pairs of variables (i.e., bivariate
analysis), and then looking at sets of three or more variables (i.e., multivariate analysis)
to consider such things as spuriousness.
o Conclusions and summary. This is a little like your abstract but not as short. What did
you do, what did you find in your study and what does it mean?
Tables. You may choose to put your tables in the body of your paper, or you may decide to put
them all at the end of your paper.
References. For every article or book that you cite, you need to provide a full bibliographic
reference at the end of the report.
Tables
There are advantages and disadvantages to putting your tables in the body of the report or at the end of
the report. Putting them in the body of the report keeps them front and center for the reader but they
often are bulky and get in the way of reading the report. Putting them at the end of the report gets
them out of the way and allows the reader to spread them out and look at them as he or she is reading
the paper. Your instructor or the research conference will usually tell you where to put your tables.
If they are placed at the end of the paper, put a note in the body of the report that says something like
"Table 1 about here." That will let the reader know where the table fits into your report.
Constructing a good table is important. Sometimes your instructor will tell you to copy tables from the
program you are using for statistical analysis (e.g., SPSS) into your paper. Other times you will construct
the tables yourself. A good reference on creating tables is The Chicago Guide to Writing About Numbers
by Jane E. Miller.17 Your word processing program (e.g., Word in Microsoft Office) will provide you with
templates that you can choose for your tables.
Footnotes or Endnotes
Often you want the reader to be aware of something, but you don't want to put it in the body of the
paper. It may be a technical issue such as how you recoded a variable or why you chose a particular
statistic. Or you may want to tell the reader that you will discuss something later in the paper. You can
17
Jane E. Miller. 2015. The Chicago Guide to Writing About Numbers. Chicago: University of Chicago Press.
101
put comments like these in either a footnote or an endnote. A footnote goes at the bottom of the page
and an endnote goes at the end of the paper. Your word processing program will allow you to enter
either footnotes or endnotes in your paper. Which you use is up to you unless your instructor or the
research conference tells you that one or the other is required.
There are many styles such as American Psychological Association (APA) or Modern Language
Association (MLA) that you could use to cite materials that you refer to in your paper. Remember that
anytime you refer to someone else's work, you must acknowledge the source. Your instructor or
research conference will often specify which style you should use.
Plagiarism
Plagiarism is using someone else's words or ideas without acknowledging the source. If you are quoting
from a document, you must cite the source. Even if you are paraphrasing, you must acknowledge the
source. If you are using someone else's ideas, you must also acknowledge the source. There is a good
review of plagiarism written by Earl Babbie that can be found on the internet by clicking here. Click on
the red arrow at the top to go forward or backward in this review of plagiarism.
Proofreading
Be sure to proofread your paper several times before submitting it. Use the spell and grammar checker
in your word processing program. You could also ask a friend to read it and tell you about any errors or
parts that are confusing.
There are many other guides to writing research reports. One that is commonly used in Sociology is the
Guide to Writing Sociology Papers.18 You can find others on the internet by entering "writing research
reports" in the search box.
18
Sociology Writing Group. A Guide to Writing Sociology Papers. 2013 (7th edition). Worth Publishers.
102
Note: Some variables are defined differently by different countries and data for a given variable
may be from different years.
Additional Notes:
Religion. The Wikipedia essay from which these variables are taken (accessed November 27, 2012) draws on a wide
variety of sources, resulting in inconsistencies of classification both within and between countries. Because of
double-counting and other factors, percentages for different religious categories within a county do not always total
to 100 and in a few cases are well above or below that number. Note also that, where the essay provides a range,
sometimes a very wide one, the midpoint has been used here. France does not include overseas departments, and
Tanzania does not include Zanzibar.
Political Rights Index, 2012 (1 = lowest; 7 = highest). Source: “Freedom House Country Rankings.” Accessed
November 20, 2012. Note: To avoid confusion in analysis, scores have been reversed both from previous versions
of this data subset and from the codes used by Freedom House.
Civil Liberties Index, 2012 (1 = lowest; 7 = highest). Source: Ibid. Note: To avoid confusion in analysis, scores have
been reversed both from previous versions of this data subset and from the codes used by Freedom House.
Internet Freedom Index, 2012 (0 = lowest; 100 = highest). Source: Ibid. Note: To avoid confusion in analysis, scores
have been reversed both from previous versions of this data subset and from the codes used by Freedom House.
Lack of Perceived Corruption, 2012. A measure of the degree to which lack of corruption is perceived to exist
among public officials and politicians. (0 = highly corrupt; 100 = very clean). Source: “Transparency International
Corruption Perceptions Index, 2012.” Accessed December 5, 2012. In previous versions of this data subset, the
variable was called “CORRUPTION.” Note: The name has been changed to avoid confusion in analysis.
i