Course ONE: Experimental Research
Course ONE: Experimental Research
Introduction
Experimental research is one of the most powerful research methodologies that researchers can use. Of
the many types of research that might be used, the experiment is the best way to establish cause-and-
effect relationships among variables. Yet experiments are not always easy to conduct. In this chapter, we
will show you both the power of, and the problems involved in, conducting experiments.
The Uniqueness of Experimental Research
Of all the research methodologies described in this book, experimental research is unique in two very
important respects: It is the only type of research that directly attempts to influence a particular variable ,
and when properly applied, it is the best type for testing hypotheses about cause-and-effect relationships.
In an experimental study, researchers look at the effect(s) of at least one independent variable on one or
more dependent variables. The independent variable in experimental research is also frequently referred
to as the experimental, or treatment, variable. The dependent variable, also known as the criterion, or
outcome, variable, refers to the results or outcomes of the study. The major characteristic of experimental
research that distinguishes it from all other types of research is that researchers manipulate the
independent variable. They decide the nature of the treatment (that is, what is going to happen to the
subjects of the study), to whom it is to be applied, and to what extent. Independent variables frequently
manipulated in educational research include methods of instruction, types of assignment, learning
materials, rewards given to students, and types of questions asked by teachers. Dependent variables that
are frequently studied include achievement, interest in a subject, attention span, motivation, and attitudes
toward school.
After the treatment has been administered for an appropriate length of time, researchers observe or
measure the groups receiving different treatments (by means of a posttest of some sort) to see if they
differ. Another way of saying this is that researchers want to see whether the treatment made a difference.
If the average scores of the groups on the posttest do differ and researchers cannot find any sensible
alternative explanations for this difference, they can conclude that the treatment did have an effect and is
likely the cause of the difference. Experimental research, therefore, enables researchers to go beyond
description and prediction, beyond the identification of relationships, to at least a partial determination of
what causes them. Correlational studies may demonstrate a strong relationship between socioeconomic
level and academic achievement, for instance, but they cannot demonstrate that improving socioeconomic
level will necessarily improve achievement. Only experimental research has this capability. Some actual
examples of the kinds of experimental studies that have been conducted by educational researchers are as
follows:
1. The effect of small classes on instruction.
2. The effect of early reading instruction on growth rates of at-risk kindergarteners.
3. The use of intensive mentoring to help beginning teachers develop balanced instruction.
4. The effect of lotteries on Web survey response rates.
5. Introduction of a course on bullying into preservice teacher-training curriculum.
Essential Characteristics of Experimental Research
The word experiment has a long and illustrious history in the annals of research. It has often been hailed
as the most powerful method that exists for studying cause and effect. Its origins go back to the very
beginnings of history when, for example, primeval humans first experimented with ways to produce fi re.
One can imagine countless trial-and-error attempts on their part before achieving success by sparking
rocks or by spinning wooden spindles in dry leaves. Much of the success of modern science is due to
carefully designed and meticulously implemented experiments. The basic idea underlying all
experimental research is really quite simple: Try something and systematically observe what happens.
Formal experiments consist of two basic conditions. First, at least two (but often more) conditions or
methods are compared to assess the effect(s) of particular conditions or “treatments” (the independent
1
variable). Second, the independent variable is directly manipulated by the researcher. Change is planned
for and deliberately manipulated in order to study its effect(s) on one or more outcomes (the dependent
variable). Let us discuss some important characteristics of experimental research in a bit more detail.
Comparison of Groups
An experiment usually involves two groups of subjects, an experimental group and a control or a
comparison group, although it is possible to conduct an experiment with only one group (by providing all
treatments to the same subjects) or with three or more groups. The experimental group receives a
treatment of some sort (such as a new textbook or a different method of teaching), while the control group
receives no treatment (or the comparison group receives a different treatment). The control or the
comparison group is crucially important in all experimental research, for it enables the researcher to
determine whether the treatment has had an effect or whether one treatment is more effective than
another. Historically, a pure control group is one that receives no treatment at all. While this is often the
case in medical or psychological research, it is rarely true in educational research. The control group
almost always receives a different treatment of some sort. Some educational researchers, therefore, refer
to comparison groups rather than to control groups. Consider an example. Suppose a researcher wished to
study the effectiveness of a new method of teaching science. He or she would have the students in the
experimental group taught by the new method, but the students in the comparison group would continue
to be taught by their teacher’s usual method. The researcher would not administer the new method to the
experimental group and have a control group do nothing. Any method of instruction would likely be more
effective than no method at all!
Manipulation of the Independent Variable
The second essential characteristic of all experiments is that the researcher actively manipulates the
independent variables. What does this mean? Simply put, it means that the researcher deliberately and
directly determines what forms the independent variable will take and then which group will get which
form. For example, if the independent variable in a study is the amount of enthusiasm an instructor
displays, a researcher might train two teachers to display different amounts of enthusiasm as they teach
their classes. Although many independent variables in education can be manipulated, many others cannot.
Examples of independent variables that can be manipulated include teaching method, type of counseling,
learning activities, assignments given, and materials used; examples of independent variables that cannot
be manipulated include gender, ethnicity, age, and religious preference. Researchers can manipulate the
kinds of learning activities to which students are exposed in a classroom, but they cannot manipulate, say,
religious preference—that is, students cannot be “made into” Protestants, Catholics, Jews, or Muslims, for
example, to serve the purposes of a study. To manipulate a variable, researchers must decide who is to get
something and when, where, and how they will get it. The independent variable in an experimental study
may be established in several ways—either (1) one form of the variable versus another; (2) presence
versus absence of a particular form; or (3) varying degrees of the same form. An example of (1) would be
a study comparing the inquiry method with the lecture method of instruction in teaching chemistry. An
example of (2) would be a study comparing the use of PowerPoint slides versus no PowerPoint slides in
teaching statistics. An example of (3) would be a study comparing the effects of different specified
amounts of teacher enthusiasm on student attitudes toward mathematics. In both (1) and (2), the variable
(method) is clearly categorical. In (3), a variable that in actuality is quantitative (degree of enthusiasm) is
treated as categorical (the effects of only specified amounts of enthusiasm will be studied) in order for the
researcher to manipulate (that is, to control for) the amount of enthusiasm.
Randomization
An important aspect of many experiments is the random assignment of subjects to groups. Although there
are certain kinds of experiments in which random assignment is not possible, researchers try to use
randomization whenever feasible. It is a crucial ingredient in the best kinds of experiments. Random
assignment is similar, but not identical, to the concept of random selection. Random assignment means
that every individual who is participating in an experiment has an equal chance of being assigned to any
of the experimental or control conditions being compared. Random selection, on the other hand, means
2
that every member of a population has an equal chance of being selected to be a member of the sample.
Under random assignment, each member of the sample is given a number (arbitrarily), and a table of
random numbers is then used to select the members of the experimental and control groups. Three things
should be noted about the random assignment of subjects to groups. First, it takes place before the
experiment begins. Second, it is a process of assigning or distributing individuals to groups, not a result of
such distribution. This means that you cannot look at two groups that have already been formed and be
able to tell, just by looking, whether or not they were formed randomly. Third, the use of random
assignment allows the researcher to form groups that, right at the beginning of the study, are equivalent –
that is, they differ only by chance in any variables of interest. In other words, random assignment is
intended to eliminate the threat of extraneous, or additional, variables – not only those of which
researchers are aware but also those of which they are not aware – that might affect the outcome of the
study. This is the beauty and the power of random assignment. It is one of the reasons why experiments
are, in general, more effective than other types of research for assessing cause-and-effect relationships.
This last statement is tempered, of course, by the realization that groups formed through random
assignment may still differ somewhat. Random assignment ensures only that groups are equivalent (or at
least as equivalent as human beings can make them) at the beginning of an experiment. Furthermore,
random assignment is no guarantee of equivalent groups unless both groups are sufficiently large. No one
would expect random assignment to result in equivalence if only five subjects were assigned to each
group, for example. There are no rules for determining how large groups must be, but most researchers
are uncomfortable relying on random assignment with fewer than 40 subjects in each group.
Control of Extraneous Variables
Researchers in an experimental study have an opportunity to exercise far more control than in most other
forms of research. They determine the treatment (or treatments), select the sample, assign individuals to
groups, decide which group will get the treatment, try to control other factors besides the treatment that
might influence the outcome of the study, and then (finally) observe or measure the effect of the treatment
on the groups when the treatment is completed. It is very important for researchers conducting an
experimental study to do their best to control for— that is, to eliminate or to minimize the possible effect
of—these threats. If researchers are unsure whether another variable might be the cause of a result
observed in a study, they cannot be sure what the cause really is. For example, if a researcher attempted to
compare the effects of two different methods of instruction on student attitudes toward history but did not
make sure that the groups involved were equivalent in ability, then ability might be a possible alternative
explanation (rather than the difference in methods) for any differences in attitudes of the groups found on
a posttest. In particular, researchers who conduct experimental studies try their best to control any and all
subject characteristics that might affect the outcome of the study. They do this by ensuring that the two
groups are as equivalent as possible on all variables other than the one or ones being studied (that is, the
independent variables). How do researchers minimize or eliminate threats due to subject characteristics?
Many ways exist. Here are some of the most common.
Randomization: As we mentioned before, if subjects can be randomly assigned to the various groups
involved in an experimental study, researchers can assume that the groups are equivalent. This is the best
way to ensure that the effects of one or more possible extraneous variables have been controlled.
Holding certain variables constant: The idea here is to eliminate the possible effects of a variable by
removing it from the study. For example, if a researcher suspects that gender might influence the
outcomes of a study, she could control for it by restricting the subjects of the study to females and by
excluding all males. The variable of gender, in other words, is held constant. However, there is a cost
involved (as there almost always is) for this control, as the generalizability of the results of the study are
correspondingly reduced.
Building the variable into the design: This solution involves building the variable(s) into the study to
assess their effects. It is the exact opposite of the previous idea. Using the preceding example, the
researcher would include both females and males (as distinct groups) in the design of the study and then
analyze the effects of both gender and method on outcomes.
3
Matching: Often pairs of subjects can be matched on certain variables of interest. If a researcher felt that
age, for example, might affect the outcome of a study, he might endeavor to match students according to
their ages and then assign one member of each pair (randomly if possible) to each of the comparison
groups.
Using subjects as their own controls: When subjects are used as their own controls, their performance
under both (or all) treatments is compared. Thus, the same students might be taught algebra units first by
an inquiry method and later by a lecture method. Another example is the assessment of an individual’s
behavior during a period of time before and after a treatment is implemented to see whether changes in
behavior occur.
Using analysis of covariance: Analysis of covariance can be used to equate groups statistically on the
basis of a pretest or other variables. The posttest scores of the subjects in each group are then adjusted
accordingly.
Group Designs in Experimental Research
The design of an experiment can take a variety of forms. Some of the designs we present in this section
are better than others, however. Why “better”?
Poor Experimental Designs
In addition to the independent variable, there are a number of other plausible explanations for any
outcomes that occur. As a result, any researcher who uses one of these designs has difficulty assessing the
effectiveness of the independent variable.
The One-Shot Case Study
In the one-shot case study design, a single group is exposed to a treatment or event and a dependent
variable is subsequently observed (measured) in order to assess the effect of the treatment. A diagram of
this design is as follows:
The one-shot case study design
X O
Treatment Observation (dependent
variable)
The symbol X represents exposure of the group to the treatment of interest, while O refers to observation
(measurement) of the dependent variable. The placement of the symbols from left to right indicates the
order in time of X and O. As you can see, the treatment, X, comes before observation of the dependent
variable, O. Suppose a researcher wishes to see if a new textbook increases student interest in history. He
uses the textbook (X) for a semester and then measures student interest (O) with an attitude scale. A
diagram of this example is shown in Figure 1. The most obvious weakness of this design is its absence of
any control. The researcher has no way of knowing if the results obtained at O (as measured by the
attitude scale) are due to treatment X (the textbook). The design does not provide for any comparison, so
the researcher cannot compare the treatment results (as measured by the attitude scale) with the same
group before using the new textbook, or with those of another group using a different textbook. Because
the group has not been pretested in any way, the researcher knows nothing about what the group was like
before using the text.
X O
New textbook Attitude scale
To measure interest
(dependent variable)
4
Thus, he does not know whether the treatment had any effect at all. It is quite possible that the students
who use the new textbook will indicate very favorable attitudes toward history. But the question remains,
were these attitudes produced by the new textbook? Unfortunately, the one-shot case study does not help
us answer this question. To remedy this design, a comparison could be made with another group of
students who had the same course content presented in the regular textbook. (We shall show you just such
a design shortly.) Fortunately, the flaws in the one-shot design are so well known that it is seldom used in
educational research.
The One-Group Pretest-Posttest Design
In the one-group pretest-posttest design, a single group is measured or observed not only after being
exposed to a treatment of some sort, but also before. A diagram of this design is as follows:
The one-group- pretest-posttest design
O X O
Pretest Treatment Posttest
Consider an example of this design. A principal wants to assess the effects of weekly counseling sessions
on the attitudes of certain “hard-to-reach” students in her school. She asks the counselors in the program
to meet once a week with these students for a period of 10 weeks, during which sessions the students are
encouraged to express their feelings and concerns. She uses a 20-item scale to measure student attitudes
toward school both immediately before and after the 10-week period. Figure 2 presents a diagram of the
design of the study. This design is better than the one-shot case study (the researcher at least knows
whether any change occurred), but it is still weak.
O X O
Pretest: 20- Treatment: Posttest:
item 10 weeks of 20-item
attitude scale counseling attitude
completed by scale
students completed
(dependent by
students
(dependent
variable)
variable)
X
O ............................
O
The dashed line indicates that the two groups being compared are already formed – that is, the subjects
are not randomly assigned to the two groups. X symbolizes the experimental treatment. The blank space
in the design indicates that the “control” group does not receive the experimental treatment; it may
receive a different treatment or no treatment at all. The two Os are placed exactly vertical to each other,
5
indicating that the observation or measurement of the two groups occurs at the same time. Consider again
the example used to illustrate the one shot case study design. We could apply the static-group comparison
design to this example. The researcher would (1) find two intact groups (two classes), (2) assign the new
textbook (X) to one of the classes but have the other class use the regular textbook, and then (3) measure
the degree of interest of all students in both classes at the same time (for example, at the end of the
semester). Figure 3 presents a diagram of this example. Although this design provides better control over
history, maturation, testing, and regression threats,* it is more vulnerable not only to mortality and
location, † but also, more importantly, to the possibility of differential subject characteristics.
X O
O X
O ...................................................
O O
In analyzing the data, each individual’s pretest score is subtracted from his or her posttest score, thus
permitting analysis of “gain” or “change.” While this provides better control of the subject characteristics
threat (since it is the change in each student that is analyzed), the amount of gain often depends on initial
performance; that is, the group scoring higher on the pretest is likely to improve more (or in some cases
less), and thus subject characteristics still remains somewhat of a threat. Further, administering a pretest
raises the possibility of a testing threat. In the event that the pretest is used to match groups, this design
becomes the matching-only pretest-posttest control group design (see page 275), a much more effective
design.
True Experimental Designs
The essential ingredient of a true experimental design is that subjects are randomly assigned to treatment
groups.
The Randomized Posttest-Only Control Group Design
The randomized posttest-only control group design involves two groups, both of which are formed by random
assignment. One group receives the experimental treatment while the other does not, and then both groups are
posttested on the dependent variable. A diagram of this design is as follows:
6
As before, the symbol X represents exposure to the treatment and O refers to the measurement of the
dependent variable. R represents the random assignment of individuals to groups. C now represents the
control group. In this design, the control of certain threats is excellent. Through the use of random
assignment, the threats of subject characteristics, maturation, and statistical regression are well controlled
for. Because none of the subjects in the study are measured twice, testing is not a possible threat . This is
perhaps the best of all designs to use in an experimental study, provided there are at least 40 subjects in
each group.
As an example of this design, consider a hypothetical study in which a researcher investigates the effects
of a series of sensitivity training workshops on faculty morale in a large high school district. The
researcher randomly selects a sample of 100 teachers from all the teachers in the district. The researcher
then (1) randomly assigns the teachers in the district to two groups; (2) exposes one group, but not the
other, to the training; and then (3) measures the morale of each group using a questionnaire. Figure 4
presents a diagram of this hypothetical experiment. Again we stress that it is important to keep clear the
distinction between random selection and random assignment. Both involve the process of randomization,
but for a different purpose. Random selection, you will recall, is intended to provide a representative
sample. But it may or may not be accompanied by the random assignment of subjects to groups. Random
assignment is intended to equate groups, and often is not accompanied by random selection.
The use of the pretest raises the possibility of a pretest treatment interaction threat, since it may “alert”
the members of the experimental group, thereby causing them to do better (or more poorly) on the posttest
than the members of the control group. A trade-off is that it provides the researcher with a means of
checking whether the two groups are really similar—that is, whether random assignment actually
succeeded in making the groups equivalent. This is particularly desirable if the number in each group is
small (less than 30). If the pretest shows that the groups are not equivalent, the researcher can seek to
make them so by using one of the matching designs we will discuss shortly. A pretest is also necessary if
the amount of change over time is to be assessed.
7
Fig. 5 Example of randomized posttest-only control group design
The Randomized Solomon Four-Group Design
The randomized Solomon four-group design is an attempt to eliminate the possible effect of a pretest. It
involves random assignment of subjects to four groups, with two of the groups being pretested and two
not. One of the pretested groups and one of the un-pretested groups is exposed to the experimental
treatment. All four groups are then post-tested. A diagram of this design is as follows:
The randomized Solomon four-group design
Treatment group R O X
Control group O
R O C O
Treatment group R X O
Control group R C O
The randomized Solomon four-group design combines the pretest-posttest control group and posttest-only
control group designs. The first two groups represent the pretest-posttest control group design, while the
last two groups represent the posttest-only control group design. Figure 6 presents an example of the
randomized Solomon four-group design. A weakness, however, is that it requires a large sample because
subjects must be assigned to four groups. Furthermore, conducting a study involving four groups at the
same time requires a considerable amount of energy and effort on the part of the researcher.
8
.
Fig. 6 Example of randomized Solomon four-group design
Random Assignment with Matching
In an attempt to increase the likelihood that the groups of subjects in an experiment will be equivalent,
pairs of individuals may be matched on certain variables. The choice of variables on which to match is
based on previous research, theory, and/or the experience of the researcher. The members of each
matched pair are then assigned to the experimental and control groups at random. This adaptation can be
made to both the posttest-only control group design and the pretest-posttest control group design,
although the latter is more common. Diagrams of these designs are provided below:
The randomized posttest-only control group design, using matched subjects
Treatment group Mr X O
Control group Mr C O
The symbol Mr refers to the fact that the members of each matched pair are randomly assigned to the
experimental and control groups. Although a pretest of the dependent variable is commonly used to
provide scores on which to match, a measurement of any variable that shows a substantial relationship to
the dependent variable is appropriate. Matching may be done in either or both of two ways: mechanically
or statistically. Both require a score for each subject on each variable on which subjects are to be
matched.
Mechanical matching is a process of pairing two persons whose scores on a particular variable are
similar.
9
Two girls, for example, whose mathematics aptitude scores and test anxiety scores are similar might be
matched on those variables. After the matching is completed for the entire sample, a check should be
made (through the use of frequency polygons) to ensure that the two groups are indeed equivalent on each
matching variable. Unfortunately, two problems limit the usefulness of mechanical matching. First, it is
very difficult to match on more than two or three variables—people just don’t pair up on more than a few
characteristics, making it necessary to have a very large initial sample to draw from. Second, in order to
match, it is almost inevitable that some subjects must be eliminated from the study because no “matches”
for them can be found. Samples then are no longer random even though they may have been before
matching occurred. As an example of a mechanical matching design with random assignment, suppose a
researcher is interested in the effects of academic coaching on the grade point averages (GPA) of
lowachieving students in science classes. The researcher randomly selects a sample of 60 students from a
population of 125 such students in a local elementary school and matches them by pairs on GPA, finding
that she can match 40 of the 60. She then randomly assigns each subject in the resulting 20 pairs to either
the experimental or the control group.
Statistical matching,* on the other hand, does not necessitate a loss of subjects, nor does it limit the
number of matching variables. Each subject is given a “predicted” score on the dependent variable, based
on the correlation between the dependent variable and the variable (or variables) on which the subjects are
being matched. The difference between the predicted and actual scores for each individual is then used to
compare experimental and control groups.
When a pretest is used as the matching variable, the difference between the predicted and actual score is
called a regressed gain score. This score is preferable to the more straightforward gain scores (posttest
minus pretest score for each individual) primarily because it is more reliable.
If mechanical matching is used, one member of each matched pair is randomly assigned to the
experimental group, the other to the control group. If statistical matching is used, the sample is divided
randomly at the outset, and the statistical adjustments are made after all data have been collected.
Although some researchers advocate the use of statistical over mechanical matching, statistical matching
is not infallible. Its major weakness is that it assumes that the relationship between the dependent variable
and each predictor variable can be properly described by a straight line rather than a curved line.
Whichever procedure is used, the researcher must (in this design) rely on random assignment to equate
groups on all other variables related to the dependent variable.
Correlational research is nonexperimental research that is similar to ex post facto research in that they both employ data
derived from preexisting variables. There is no manipulation of the variables in either type of research. They differ in that in ex
post facto research, selected variables are used to make comparisons between two or more existing groups, whereas correlational
research assesses the relationships among two or more variables in a single group. Ex post facto research investigates possible
cause-and-effect relationships correlational research typically does not. An advantage of correlational research
is that it provides information about the strength of relationships between variables. An ex post facto
researcher might define those who make more than $200,000 per year as high earners and those who
make less than $40,000 per year as low earners and then compare the mean percent of income paid in
taxes for each group. The researcher’s data show that the average percent of income paid in taxes by the
low earners, at 19 percent, is greater than the 10 percent paid by the high earners. The conclusion is that
low-income earners pay a higher percent of their income in taxes than do high-income earners. A
10
correlational researcher would record the income and the percent of income paid in taxes for all people in
the study. This researcher might report a correlation coefficient of -.6, indicating a strong negative
correlation between the two variables.
Correlational research produces indexes that show both the direction and the strength of relationships
among variables, taking into account the entire range of these variables. This index is called a correlation
coefficient. Recall from Chapter 6 that in interpreting a coefficient of correlation, one looks at both its
sign and its size. The sign (+ or -) of the coefficient indicates the direction of the relationship. If the
coefficient has a positive sign, this means that as one variable increases, the other also increases. For
example, the correlation between height and weight is positive because tall people tend to be heavier and
short people lighter. A negative coefficient indicates that as one variable increases, the other decreases.
The correlation between outdoor air temperature during the winter months and heating bills is negative; as
temperature decreases, heating bills rise. The size of the correlation coefficient indicates the strength of
the relationship between the variables. The coefficient can range in value from +1.00 (indicating a perfect
positive relationship) through 0 (indicating no relationship) to -1.00 (indicating a perfect negative
relationship). A perfect positive relationship means that for every z-score unit increase in one variable
there is an identical z-score unit increase in the other. A perfect negative relationship indicates that for
every unit increase in one variable there is an identical unit decrease in the other. Few variables ever show
perfect correlation, especially in relating human characteristics.
11
related to college grade point average (GPA). If a student scores high on aptitude tests and has high
grades in high school, he or she is more likely to make high grades in college than is a student who scores
low on the two predictor variables. Researchers can predict with a certain degree of accuracy a student’s
probable freshman GPA based on high school grades and aptitude test scores. This prediction will not
hold for every case because other factors, such as motivation, initiative, or study habits, are not
considered. However, in general, the prediction is good enough to be useful to college admissions officers
3. DESIGN OF CORRELATIONAL RESEARCH
The basic design for correlational research is straightforward. First, the researcher specifies the problem
by asking a question about the relationship between the variables of interest. The variables selected for
investigation are generally based on a theory, previous research, or the researcher’s observations. Because
of the potential for spurious results, we do not recommend the “shotgun” approach in which one
correlates a number of variables just to see what might show up. The population of interest is also
identified at this time. In simple correlational studies, the researcher focuses on gathering data on two (or
more) measures from a single group of subjects. For example, you might correlate vocabulary and reading
comprehension scores for a group of middle school students. Occasionally, correlational studies
investigate relationships between scores on one measure for logically paired groups such as twins,
siblings, or husbands and wives. For instance, a researcher might want to study the correlation between
the SAT scores of identical twins. The following is an example of a typical correlational research
question: What is the relationship between quantitative ability and achievement in science among high
school students? The researcher determines how the constructs, ability and achievement, will be
quantified. He or she may already be aware of well-accepted operational definitions of the constructs,
may seek definitions in sources such as those described in Chapter 4, or may develop his or her own
operational definitions and then assess their reliability and validity. In the example, the researcher may
decide that quantitative ability will be defined as scores on the School and College Ability Test, Series III
(SCAT III), and science achievement will be defined as scores on the science sections of the Sequential
Tests of Educational Progress (STEP III). You learned in Chapters 8 and 9 that it is important to select or
develop measures that are appropriate indicators of the constructs to be investigated, and that it is
especially important that these instruments have satisfactory reliability and are valid for measuring the
constructs under consideration. In correlation research, the size of a coefficient of correlation is
influenced by the adequacy of the measuring instruments for their intended purpose. Instruments that are
too easy or too difficult for the participants in a study would not discriminate among them and would
result in a smaller correlation coefficient than instruments with appropriate difficulty levels. Studies using
instruments with low reliability and questionable validity are unlikely to produce useful results. Following
the selection or development of instruments, the researcher specifies his or her population of interest and
draws a random sample from that population. Finally, the researcher collects the quantitative data on the
two or more variables for each of the students in the sample and then calculates the coefficient(s) of
correlation between the paired scores. Before calculating the coefficient, the researcher should look at a
scatterplot or a graph of the relationship between the variables
4. CORRELATION COEFFICIENTS
There are many different kinds of correlation coefficients. The researcher chooses the appropriate
statistical procedure primarily on the basis of (1) the scale of measurement of the measures used and (2)
the number of variables.
4.1. PEARSON PRODUCT MOMENT COEFFICIENT OF CORRELATION
Pearson coefficient is appropriate for use when the variables to be correlated are normally distributed and
measured on an interval or ratio scale. We briefly mention some of the other indexes of correlation
without going into their computation. Interested students should consult statistics books for the
computational procedures.
4.2. COEFFICIENT OF DETERMINATION
Unsophisticated consumers of research often assume that a correlation indicates percentage of
relationship, for example, that an r of .60 means the two variables are 60 percent related. In fact, r is the
12
mean z-score product for the two variables, not a percentage. The absolute size of the correlation
coefficient (how far it is from zero) indicates the strength of the relationship. Thus, a correlation of -.4
indicates a stronger relationship than a +.2 because it is further from zero. The sign has nothing to do with
the strength of the relationship. Another way to see how closely two variables are related is to square the
correlation coefficient. When you square the Pearson r, you get an index called the coefficient of
determination, r2, which tells you how much of the variance of Y is in common with the variance of X.
A correlation of +.60 or -.60 means that the two variables have (.602) or 36 percent of their variance in
common with each other. If the two variables were caffeine and reaction time, then the amount of caffeine
one has consumed would be associated with 36 percent of the variance in one’s reaction time. That leaves
64 percent of the variance in reaction time associated with factors other than variation in caffeine intake.
The notion of common variance is illustrated in Figure 13.1, in which the total amount of variation in
each variable is represented by a circle. The overlap of the circles represents the common variance.
An increase in the r results in an accelerating increase in r2. A correlation of .20 yields a coefficient of
determination of .04. An r of .4 yields an r2 of .16. An r of .8 yields an r2 of .64, and so on. The
coefficient of determination is a useful index for evaluating the meaning of size of a correlation. It also
reminds one that positive and negative correlations of the same magnitude, for example, r=.5 and r= -.5,
are equally useful for prediction and other uses because both have the same coefficient of determination,
r2=.25. The coefficient of determination ranges from 0 to +1.00. If it is 1.00 (r= +1.00), you can predict
individuals’ scores on one variable perfectly from their scores on the other variable. The Pearson r and r2
are only appropriate where the relationship between X and Y is linear. Linear means that a straight line is
a good fit for showing the tilt of the cloud of data in a scattergram. In Chapter 6, several scattergrams are
shown. All of them are linear except Figure 6.13. Look at those figures and note the difference between
this figure and the rest. Fortunately, most correlations found in the behavioral sciences are linear.
However, before you proceed to calculate and interpret a Pearson r for your data, have your computer
print out a scatter diagram for your data. If the relationship is not linear, the Pearson r is not appropriate
for assessing the relationship between variables.
4.3. SPEARMAN RHO COEFFICIENT OF CORRELATION
Spearman rho (𝛒), an ordinal coefficient of correlation, is used when the data are ranks. For example,
assume the principal and assistant principal have independently ranked the 15 teachers in their school
from first, most effective, to fifteenth, least effective, and you want to assess how much their ranks agree.
You can calculate the Spearman’s rho by putting the paired ranks into the Pearson r formula or by using a
formula developed specifically for rho that is simpler that the Pearson r if you are calculating “by hand.”
Spearman rho is interpreted in the same way as the Pearson r. Like the Pearson product moment
coefficient of correlation, it ranges from -1.00 to +1.00. When each individual has the same rank on both
variables, the rho correlation will be +1.00, and when their ranks on one variable are exactly the opposite
of their ranks on the other variable, rho will be -1.00. If there is no relationship between the rankings, the
rank correlation coefficient will be 0.
4.4. THE PHI COEFFICIENT
The phi (𝛗) coefficient is used when both variables are genuine dichotomies scored 1 or 0. For example,
phi would be used to describe the relationship between the gender of high school students and whether
they are counseled to take college preparatory courses or not. Gender is dichotomized as male=0,
female=1. Being counseled to take college preparatory courses is scored 1, and not being so counseled is
13
scored 0. It is possible to enter the pairs of dichotomous scores (1’s and 0’s) into a program that computes
Pearson r’s and arrive at the phi coefficient. If you find the phi coefficient in school A is -.15, it indicates
that there is a slight tendency to counsel more boys than girls to take college preparatory courses. If in
school B the phi coefficient is -.51, it indicates a strong tendency in the same direction. As with the other
correlations, the phi coefficient indicates both direction and strength of relationships. A variety of
correlation coefficients are available for use with ordinal and nominal data. These include coefficients for
data that are more than just pairs; for example, assessing the agreement of three or more judges ranking
the performance of the same subjects. We highly recommend Siegel and Castellan’s Nonparametric
Statistics (1988). We consider it a remarkably well organized and easy to understand text.
5. CONSIDERATIONS FOR INTERPRETING A CORRELATION COEFFICIENT
The coefficient of correlation may be simple to calculate, but it can be tricky to interpret. It is probably
one of the most misinterpreted and/or overinterpreted statistics available to researchers. Various
considerations need to be taken into account when evaluating the practical utility of a correlation. The
importance of the numerical value of a particular correlation may be evaluated in four ways: (1)
considering the nature of its population and the shape of its distribution, (2) its relation to other
correlations of the same or similar variables, (3) according to its absolute size and its predictive validity,
or (4) in terms of its statistical significance.
5.1. THE NATURE OF THE POPULATION AND THE SHAPE OF ITS DISTRIBUTION
The value of an observed correlation is influenced by the characteristics of the population in which it is
observed. For example, a mathematics aptitude test that has a high correlation with subsequent math
grades in a regular class where students range widely on both variables would have a low correlation in a
gifted class. This is because the math aptitude scores in the gifted class are range restricted (less spread
out) compared to those in a regular class. Range restrictions of either the predictor or the criterion scores
reduce the strength of the observed correlation. Before proceeding to interpret your correlation results,
produce a scattergram to determine if you have range restriction problem. Also, if your population differs
from the population in which a correlation was reported, that correlation only provides an estimate of
correlation in your population of interest. The more your population differs from the original population,
the less useful the estimate becomes. In planning a correlational study, if you think variables such as
home language or gender will influence your correlation of interest, you can draw random samples of
equal numbers from each subgroup to assess the influence of these variables.
5.2. COMPARISON TO OTHER CORRELATIONS
A useful correlation is one that is higher (in either direction) than other correlations of the same or similar
variables. For example, an r of .75 would be considered low for the relationship between the results of
two equivalent forms of an achievement test because equivalent forms of most achievement tests correlate
with each other by more than .90. A correlation of .80 between a measure of academic aptitude and GPA
of middle school students would be considered high because the correlation for other measures of
academic aptitude and GPA for this population is typically approximately .70. As we have previously
stated, a measure that can be used with high school seniors that correlates .60 with their subsequent
college freshman GPA would be excellent because currently available measures correlate between .40 and
.45 with college GPAs.
5.3. PRACTICAL UTILITY
Always consider the practical significance of the correlation coefficient. Although a correlation
coefficient may be statistically significant, it may have little practical utility. With a sample of 1000, a
very small coefficient such as .08 would be statistically significant at the .01 level. But of what practical
importance would this correlation be? Information on X only accounts for less than 1 percent (.082=.0064
or 00.64 percent) of the variance in Y (r2). In this case, it would hardly be worth the bother of collecting
scores on a predictor variable, X, to predict another variable, Y. You want to avoid the significance
fallacy—the assumption that a statistically significant correlation also has practical significance.
Statistical significance alone is not sufficient. How worthwhile a correlation may be is partly a function of
its predictive utility in relation to the cost of obtaining predictor data. A predictor with a high correlation
14
that is difficult and expensive to obtain may be of less practical value than a cheap and easy predictor
with a lower correlation. Also, note that a correlation coefficient only describes the degree of relationship
between given operational definitions of predictor and predicted variables in a particular research
situation for a given sample of subjects. It can easily change in value if the same variables are measured
and correlated using different operational definitions and/or a different sample. Failure to find a
statistically significant relationship between two variables in one study does not necessarily mean there is
no relationship between the variables. It only means that in that particular study, sufficient evidence for a
relationship was not found. Recall from Chapter 6 that other factors, such as reliability of the measures
used and range of possible values on the measures, influence the size of a correlation coefficient.
5.4. STATISTICAL SIGNIFICANCE
In evaluating the size of a correlation, it is important to consider the size of the sample on which the
correlation is based. Without knowing the sample size, you do not know if the correlation could easily
have occurred merely as a result of chance or is likely to be an indication of a genuine relationship. If
there were fewer than 20 cases in the sample (which we would not recommend), then a “high” r of .50
could easily occur by chance. You should be very careful in attaching too much importance to large
correlations when small sample sizes are involved; an r found in a small sample does not necessarily
mean that a correlation exists in the population. To avoid the error of inferring a relationship in the
population that does not really exist, the researcher should state the null hypothesis that the population
correlation equals 0 (H0: ρxy=0) and then determine whether the obtained sample correlation departs
sufficiently from 0 to justify the rejection of the null hypothesis. In Chapter 7, we showed you how to use
Table A.3 in the Appendix, which lists critical values of r for different numbers of degrees of freedom (df
). By comparing the obtained r with the critical values of r listed in the table, you can determine the
statistical significance of a product moment correlation. For example, assume a correlational study
involving the paired math and spelling test scores of 92 students yields a correlation of .45. Recall that for
the Pearson r the degrees of freedom are the number of paired scores minus 2 (n-2). In Table A.3, we find
that for 90 degrees of freedom, r=.2050 or greater is statistically significant at the .05 level of
significance, .2673 or greater is statistically significant at the .01 level, and .3375 or greater is statistically
significant at the .001 level (all two tailed). Therefore, the hypothesis that the population correlation is
zero can be rejected at the .01 level and even at the .001 level; therefore, you conclude that there is a
positive relationship between math and spelling scores.
5.5. DETERMINING SAMPLE SIZE
The Pearson product moment correlation is a form of effect size. Therefore, Table A.3 in the Appendix
can be used to determine the needed sample size for a predetermined level of significance and
predetermined tolerable probability of Type I error. For example, a researcher developed a measure of
how much a person is willing to sacrifice to achieve success and found it had very satisfactory reliability
when administered to high school seniors. He thinks it may be a useful predictor of success in college.
Since previous research has shown that the predictor variables high school GPA, ACT test scores, and
CEEB test scores of high school seniors all correlate around .40 with the criterion variable college
freshmen GPAs, the researcher decides that if his scores correlate .40 or higher with college GPAs it is
worth further investigation. If it is less than .40, it is not worth pursuing. He sets his desired level of
significance at the two-tailed .01 level. You see in Table A.3 that if the true population correlation is .
3932 or greater with 40 degrees of freedom, then 40+2 subjects randomly selected from that population
are needed to reject the null hypotheses that the population correlation is zero
The larger the sample, the more likely the sample statistics are to approximate the population parameters.
Note that this is true only when generalizing results from a random sample back to the population from
which it was drawn. If the researcher drew the sample from high school seniors in Peoria, Illinois, he
could only directly apply results to Peoria, Illinois, seniors. The usefulness of the result for predicting
scores for a different population depends on how similar that population is to the Peoria senior
population. Before disseminating the results of this study, the researcher should calculate the scores on his
sacrifice-for-success test with high school GPA and ACT and CEEB scores. If any or all of these scores
correlate highly with the sacrifice-for-success measure, it is largely repeating information already known.
15
Therefore, it is not adding enough to the prediction of college GPA to be worthwhile. If the correlations
are low, the sacrifice-for-success scores would be useful for increasing the predictive validity of the
combined weighted scores currently in use.
5.6. CORRELATION AND CAUSATION
In evaluating a correlational study, one of the most frequent errors is to interpret a correlation as
indicating a cause-and-effect relationship. Correlation is a necessary but never a sufficient condition for
causation. For example, if a significant positive correlation is found between the number of hours of
television watched per week and above average body weight among middle school pupils, that does not
prove that excessive television watching causes obesity. Recall from Chapter 12 that when the
independent variable is not under the investigator’s control, alternate explanations must be considered. In
this example, reverse causality is plausible. Perhaps the more overweight a child is, the more he or she is
inclined to choose television watching instead of physical activities, games, and interacting with peers.
The common-cause explanation is also plausible. Perhaps differences in family recreational patterns and
lifestyle account for both differences in weight and time spent watching television. Consider another
example. Assume a researcher finds a relationship between measures of self-esteem and academic
achievement (grades) for a sample of students. Table 13.1 summarizes the possibilities for interpreting
this observed relationship. Any number of factors could act together to lead to both self-esteem and
academic achievement: previous academic experiences, parents’ education, peer relationships, motivation,
and so on.
Let us consider the example of the relationship between the amount of violence children watch on
television and their aggression. Most research has shown a relationship between these two variables,
which many people assume is causal. However, Table 13.2 shows other explanations for this relationship.
We must stress, however, that correlation can bring evidence to bear for cause and effect. The Surgeon
General’s warning about the dangers of cigarette smoking is, in part, based on studies that found positive
correlations between the number of cigarettes smoked per day and incidence of lung cancer and other
maladies. Here, reverse causality (cancer leads to cigarette smoking) is not a credible explanation.
Various common-cause hypotheses (e.g., people who live in areas with high air pollution smoke more and
have higher cancer rates) have been shown not to be the case. Although correlational research does not
permit one to infer causality, it may generate causal hypotheses that can be investigated through
experimental research methods. For example, finding the correlation between smoking and lung cancer
led to animal experiments that allowed scientists to infer a causal link between smoking and lung cancer.
Because the results of correlational studies on humans agree with the results of experimental studies on
animals, the Surgeon General’s warning is considered well founded.
5.7. PARTIAL CORRELATION
The correlation techniques discussed so far are appropriate for examining the relationship between two
variables. In most situations, however, a researcher must deal with more than two variables, and we need
procedures that examine the relationship among several variables. Partial correlation is a technique used
to determine what correlation remains between two variables when the effect of another variable is
eliminated. We know that correlation between two variables may occur because both of them are
correlated with a third variable. Partial correlation controls for this third variable. For example, assume
you are interested in the correlation between vocabulary and problem-solving skills. Both these variables
are related to a third variable, chronological age. For example, 12-year-old children have more developed
vocabularies than 8-year-old children, and they also have more highly developed problem-solving skills.
Scores on vocabulary and problem solving will correlate with each other because both are correlated with
chronological age. Partial correlation would be used with such data to obtain a measure of correlation
with the effect of age removed. The remaining correlation between two variables when their correlation
with a third variable is removed is called a first-order partial correlation. Partial correlation may be used
to remove the effect of more than one variable. However, because of the difficulty of interpretation,
partial correlation involving the elimination of more than one variable is not often used.
16
²
17