Chapter - 5 - Measurement in Research
Chapter - 5 - Measurement in Research
Measurement in research
Introduction
Measurement is the process of observing and recording the observations that are collected
as part of a research effort. There are two major issues that will be considered here.
First, you have to understand the fundamental ideas or theory involved in measuring. In
this chapter, we well focus on how we think about and assess quality of measurement. In
the section on the concept of validity, the theory of what constitutes a good measure will
be presented. Reliability of measurement refers to consistency or dependability of
measurement, including consideration of true score theory and a variety of reliability
estimators. In the section on levels of measurement, the meaning of the four major levels
of measurement: nominal, ordinal, interval, and ratio will be discussed.
Learning objectives
After completing this chapter students will be able to:
Prior to starting any research project, it is important to determine how you are going to measure
a particular phenomenon. This process of measurement is important because it allows you to
know whether you are on the right track and whether you are measuring what you intend to
measure. Both reliability and validity are essential for good measurement, because they are
your first line of defense against forming inaccurate conclusions (i.e., incorrectly accepting or
rejecting your research hypotheses).
When people think about validity in research, they tend to think in terms of research
components. You might say that a measure is a valid one, that a valid sample was drawn,
or that the design had strong validity, but all of those statements are technically incorrect.
Measures, samples, and designs don't have validity—only propositions can be said to be
valid. Technically, you should say that a measure leads to valid conclusions or that a
sample enables valid inferences, and so on. It is a proposition, inference, or conclusion
that can have validity.
Figure1, you are given below, Shows that two realms are involved in research, the first,
on the top, is the land of theory. It is what goes on inside your head. It is where you keep
your theories about how the world operates. The second, on the bottom, is the land of
observations. It is the real world into which you translate your ideas: your programs,
treatments, measures, and observations. When you conduct research, you are continually
flitting back and forth between these two realms, between what you think about the world
and what is going on in it. When you are investigating a cause-effect relationship, you
have a theory (implicit or otherwise) of what the cause is (the cause construct). For
instance, if you are testing a new educational program, you have an idea of what it would
look like ideally. Similarly, on the effect side, you have an idea of what you are ideally
trying to affect and measure (the effect construct). But each of these—the cause and the
effect—have to be translated into real things, into a program or treatment and a measure
or observational method. The term operationalization is used to describe the act of
translating a construct into its manifestation. In effect, you take your idea and describe it
as a series of operations or procedures. Now, instead of it being only an idea in your
mind, it becomes a public entity that others can look at and examine for themselves. It is
one thing, for instance, for you to say that you would like to measure self-esteem (a
construct). But when you show a ten-item paper-and-pencil self-esteem measure that you
developed for that purpose, others can look at it and understand more clearly what you
intend by the term self-esteem.
Imagine that you want to examine whether use of a World Wide Web virtual classroom
improves student understanding of course material. Assume that you took these two constructs,
the cause construct (the Web site) and the effect construct (understanding), and operationalized
them, turned them into realities by constructing the Web site and a measure of knowledge of
the course material. Here are the four validity types and the question each addresses:
Conclusion Validity: In this study, is there a relationship between the two variables? In
the context of the example, the question might be worded: in this study, is there a
relationship between the Web site and knowledge of course material? There are several
conclusions or inferences you might draw to answer such a question. You could, for
example, conclude that there is a relationship. You might conclude that there is a
positive relationship. You might infer that there is no relationship. You can assess the
conclusion validity of each of these conclusions or inferences.
Internal Validity: Assuming that there is a relationship in this study, is the relationship a
causal one? Just because you find that use of the Web site and knowledge are
correlated, you can't necessarily assume that Web site use causes the knowledge. Both
could, for example, be caused by the same factor. For instance, it may be that wealthier
students, who have greater resources, would be more likely to have access to a Web site
and would excel on objective tests. When you want to make a claim that your program
or treatment caused the outcomes in your study, you can consider the internal validity
of your causal claim.
Construct Validity: Assuming that there is a causal relationship in this study, can you
claim that the program reflected your construct of the program well and that your
measure reflected well your idea of the construct of the measure? In simpler terms, did
you implement the program you intended to implement and did you measure the
outcome you wanted to measure? In yet other terms, did you operationalize well the
ideas of the cause and the effect? When your research is over, you would like to be able
to conclude that you did a credible job of operationalizing your constructs—you can
assess the construct validity of this conclusion.
External Validity: Assuming that there is a causal relationship in this study between the
constructs of the cause and the effect; can you generalize this effect to other persons,
places, or times? You are likely to make some claims that your research findings have
implications for other groups and individuals in other settings and at other times. When
you do, you can examine the external validity of these claims.
Notice how the question that each validity type addresses presupposes an affirmative
answer to the previous one. This is what I mean when I say that the validity types build
on one another. Figure 2 shows the idea of cumulativeness as a staircase, along with the
key question for each validity type.
Figure 2The validity staircase, showing the major question for each type of validity
Activity-1-
Assuming that you would like to conduct a research on the casual relationship of the research
variables in the research question on “how employee motivation will influence their productivity
in an organization?” identify and discuss the four types of validity and the type of question each
addresses in the course of addressing the answer of the research question.
Commentary
Considering the relationship between employee motivation and employee performance you
need to develop a data collection instrument that allow you to show there is a relationship
between the cause and effect, the change in the dependent variable is due to the change in the
independent variable, the other validity question is whether you are able to generalize to the
construct ie. you are able to correctly measure the construct and the external validity refers to
the fact that you can generalize to other people and other organization.
Terms such as consistency, predictability, dependability, stability, and repeatability are the
terms that come to mind when we talk about reliability. Broadly defined, reliability of a
measurement refers to the consistency or repeatability of the measurement of some
phenomena. If a measurement instrument is reliable, that means the instrument can measure
the same thing more than once or using more than one method and yield the same result.
When we speak of reliability, we are not speaking of individuals; we are actually talking about
scores.
If you think about how we use the word reliable in everyday language, you might get a hint. For
instance, we often speak about a machine as reliable: "I have a reliable car." Or, news people
talk about a "usually reliable source." In both cases, the word reliable usually means dependable
or trustworthy. In research, the term reliable also means dependable in a general sense, but
that's not a precise enough definition. What does it mean to have a dependable measure or
observation in a research context? The reason dependable is not a good enough description is
that it can be confused too easily with the idea of a valid measure. Certainly, when researchers
speak of a dependable measure, we mean one that is both reliable and valid. So we have to be a
little more precise when we try to define reliability.
The observed score is one of the major components of reliability. The observed score is just
that, the score you would observe in a research setting. The observed score comprised of a true
score and an error score. The true score is a theoretical concept. Why is it theoretical? Because
there is no way to really know what the true score is. The true score reflects the true value of a
variable. The error score is the reason why the observed is different from the true score. The
error score is further broken down into method (or systematic) error and trait (or random)
error. Method error refers to anything that causes a difference between the observed score and
true score due to the testing situation. For example, any type of disruption (loud music, talking,
traffic) that occurs while students are taking a test may cause the students to become distracted
and may affect their scores on the test. On the other hand, trait error is caused by any factors
related to the characteristic of the person taking the test that may randomly affect
measurement. An example of trait error at work is when individuals are tired, hungry, or
unmotivated. These characteristics can affect their performance on a test, making the scores
seem worse than they would be if the individuals were alert, well-fed, or motivated.
Reliability can be viewed as the ratio of the true score over the true score plus the error score,
or:
true score
Okay, now that you know what reliability is and what its components are, you're probably
wondering how to achieve reliability. Simply put, the degree of reliability can be increased by
decreasing the error score. So, if you want a reliable instrument, you must decrease the error.
As previously stated, you can never know the actual true score of a measurement. Therefore, it
is important to note that reliability cannot be calculated; it can only be estimated. The best way
to estimate reliability is to measure the degree of correlation between the different forms of a
measurement. The higher the correlation, the higher the reliability.
Before going on to the types of reliability, I must briefly review the three major aspects of
reliability: equivalence, stability, and homogeneity. Equivalence refers to the degree of
agreement between 2 or more measures administered nearly at the same time. In order for
stability to occur, a distinction must be made between the repeatability of the measurement
and that of the phenomena being measured. This is achieved by employing two raters. Lastly,
homogeneity deals with assessing how well the different items in a measure seem to reflect the
attribute one is trying to measure. The emphasis here is on internal relationships, or internal
consistency.
Types of Reliability
Now back to the different types of reliability. The first type of reliability is parallel forms
reliability. This is a measure of equivalence, and it involves administering two different forms to
the same group of people and obtaining a correlation between the two forms. The higher the
correlation between the two forms, the more equivalent the forms.
The second type of reliability, test-retest reliability, is a measure of stability which examines
reliability over time. The easiest way to measure stability is to administer the same test at two
different points in time (to the same group of people, of course) and obtain a correlation
between the two tests. The problem with test-retest reliability is the amount of time you wait
between testing. The longer you wait, the lower your estimation of reliability.
Finally, the third type of reliability is inter-rater reliability, a measure of homogeneity. With
inter-rater reliability, two people rate a behavior, object, or phenomenon and determine the
amount of agreement between them. To determine inter-rater reliability, you take the number
of agreements and divide them by the number of total observations.
A measurement can be reliable, but not valid. However, a measurement must first be reliable
before it can be valid. Thus reliability is a necessary, but not sufficient, condition of validity. In
other words, a measurement may consistently assess a phenomena (or outcome), but unless
that measurement tests what you want it to, it is not valid.
Remember that when designing a research project, it is important that your measurements are
both reliable and valid. If they aren't, then your instruments are basically useless and you
decrease your chances of accurately measuring what you intended to measure.
Activity-2-
Assuming that you would like to conduct a research on the casual relationship of the research
variables in the research question “how employee motivation will influence their productivity in
an organization?” identify and discuss the three aspect of measurement Reliability and the type
of question each addresses in the course of addressing the answer to the research question.
5.3. Measurement Error
True score theory is a good simple model for measurement, but it may not always be an
accurate reflection of reality. In particular, it assumes that any observation is composed of the
true value plus some random error value; but is that reasonable? What if all error is not
random? Isn't it possible that some errors are systematic, that they hold across most or all of the
members of a group? One way to deal with this notion is to revise the simple true score model
by dividing the error component into two subcomponents, random error and systematic error.
Figure 3 shows these two components of measurement error, what the difference between
them is, and how they affect research.
Random error is caused by any factors that randomly affect measurement of the variable across
the sample. For instance, people's moods can inflate or deflate their performance on any
occasion. In a particular testing, some children may be in a good mood and others may be
depressed. If mood affects the children's performance on the measure, it might artificially inflate
the observed scores for some children and artificially deflate them for others. The important
thing about random error is that it does not have any consistent effects across the entire
sample. Instead, it pushes observed scores up or down randomly. This means that if you could
see all the random errors in a distribution they would have to sum to 0. There would be as many
negative errors as positive ones. (Of course, you can't see the random errors because all you see
is the observed score X. God can see the random errors, but she's not telling us what they are!)
The important property of random error is that it adds variability to the data but does not affect
average performance for the group (Figure3). Because of this, random error is sometimes
considered noise.
Figure 3. Random error adds variability to a distribution but does not affect central
tendency (the average)
Systematic error is caused by any factors that systematically affect measurement of the
variable across the sample. For instance, if there is loud traffic going by just outside of a
classroom where students are taking a test, this noise is liable to affect all of the children's
scores—in this case, systematically lowering them. Unlike random error, systematic
errors tend to be either positive or negative consistently; because of this, systematic error
is sometimes considered to be bias in measurement (Figure 4 ).
Figure - 4 - Systematic error affects the central tendency of a
distribution
Reducing Measurement Error
So, how can you reduce measurement errors, random or systematic? One thing you can do is to
pilot test your instruments to get feedback from your respondents regarding how easy or hard
the measure was and information about how the testing environment affected their
performance. Second, if you are gathering measures using people to collect the data (as
interviewers or observers), you should make sure you train them thoroughly so that they aren't
inadvertently introducing error. Third, when you collect the data for your study you should
double-check the data thoroughly. All data entry for computer analysis should be double-
punched and verified. This means that you enter the data twice, the second time having your
data-entry machine check that you are typing the exact same data you typed the first time.
Fourth, you can use statistical procedures to adjust for measurement error. These range from
rather simple formulas you can apply directly to your data to complex modeling procedures for
modeling the error and its effects. Finally, one of the best things you can do to deal with
measurement errors, especially systematic errors, is to use multiple measures of the same
construct. Especially if the different measures don't share the same systematic errors, you will
be able to triangulate across the multiple measures and get a more accurate sense of what's
happening.
Activity-3-
Considering the question under activity two attempt to explain the difference between a
random error and systematic error in measurement by giving example in the context of
the given research question and suggest how to reduce measurement errors.
Commentary
Your ability to answer your research question is partly determined by the quality of your
data which is again influenced partly by your sampling procedure ,the potential random
and systematic error in measurement therefore you need to exercise at most care in the
data collection process method of data collection and sampling procedure.
On the surface, measurement may appear to be a very simple process. It is simple as long
as we are measuring objective properties, which are physically verifiable characteristics
such as age, income, number of bottles purchased, store last visited, and so on. However,
researchers often desire to measure subjective properties, which can not be directly
observed because they are mental constructs such as a person’s attitude or intentions. In
this case, the researcher must ask a respondent to translate his/her mental constructs onto
a continuum of intensity-easy task. To do this, the researcher must develop question
formats that are very clear and that are used identically by the respondents. This process
is known as scale development.
5.4.2. Description
Description refers to the use of unique descriptor, or label, to stand for each designation
in the scale. For instance, “yes” and “no”,” agree” and “disagree”, and the number of
years of a respondent’s age are descriptors of a simple scale. All scales include
description in the form of characteristics labels that identify what is being measured.
5.4.3. Order
Order refers to the relative sizes of the descriptors. Here, the key word is “relative” and
includes such descriptors as “greater than”, “less than”, or “equal to”. A respondent’s
least preferred brand is “less than” his/her most preferred brand, and respondents who
check the same income category indicate the same (‘equal to”). Not all scales posses
order characteristics. For instance, is a “buyer” greater than or less than a “non-buyer” ?
We have no way of making a relative size distinction.
5.4.4. Distance
A scale has the characteristics of distance when absolute differences between the
descriptors are known and may be expressed in units. The respondent who purchases
three bottles of diet cola buys two or more than the one who purchases only one bottle; a
three-car family owns one more automobile than a two-car family. Note that when the
characteristic of distance exists, we are also given order. We know not only the three-car
family has “more than” the number cars of the two-car family, but we also know the
distance between the two (1 car).
5.4.5. Origin
A scale is said to have the characteristics of origin if there is a unique beginning or true
zero point for the scale. Thus, 0 is the origin for an age scale just as it is for the number of
miles travelled to the store or the number of bottles of soda consumed. Not all scales have
a true zero point for the property they are measuring. In fact, many scales used by
researchers have arbitrary neutral points, but they do not posses origins. For instance,
when the respondent says “no opinion” to the question “do you agree or disagree with the
statement “the Lexus is the best car on the road today””, we can not say that the person
has a true zero level of agreement.
Perhaps you noticed that each scaling characteristic is the most basic and is present in
every scale. If a scale has order, it also possesses description. In other words, if a scale
has a higher-level property, it also has all lower-level properties. But the opposite is not
true.
Activity-4-
Assume that you are given an assignment by your immediate supervisor to develop a
questionnaire in order to measure the level of customer satisfactions for a given company
product and develop their profile, attempt to develop a measurement scale and describe
the property of the four types of measurement scale with reference to your assignment .
5.5. Levels of measurement scales
You may ask “why is it important to know the characteristics of scales?” The answer is
that the characteristics possessed by a scale determine that scale’s level of measurement.
In tern which descriptive statistics are most appropriate for your data will depend on the
measurement scale used in collecting information on each particular item. We have four
levels of measurement: Nominal, Ordinal, Interval, and Ratio scales.
The table below shows you how each scale type differs with respect to the scaling
characteristics we have just discussed
Nominal scales are defined as those that use only labels; that is, they possess only the
characteristic of description. Examples include designation as to race, religion, type of
dwelling, gender, brand last purchased, buyer/non buyer; answers that involve yes-no,
agree-disagree; or any other instance in which the descriptor can not be differentiated
except qualitatively. If you describe respondents in a survey according to their
occupation-banker, doctor, computer programmer- you have used a nominal scale. Note
that these examples of a nominal scale only label the respondent. There is no ordering
among the categories (i.e., Male is not “greater” or “less”) and averaging is not
appropriate for this type of data. The measures used to describe this type of data are the
percentages that fall into each category or the mode (the most commonly selected
category).
Nominal scales are for classification – they are not “measures” in the true sense of the
term as they do not represent “quantities”, “magnitudes”, “frequencies”, or the like.
You use nominal scales when the categories are exhaustive (include all alternatives, even
though one choice may be “other”) and mutually exclusive (none fall into more than one
category).
You can summarize these data as the percentage of respondents who fall into each
category or as the mode, which is the term used to describe the most common category
selected. The mode can be used to express “middle” of the distribution, however, the
frequency distribution (percentages in each category) will generally suffice. There is no
measure of variability for this type of data.
These reflect ordered categories (e.g. Small, medium, and large). We can say one
category reflects more of the attribute we are measuring than does another, or that the top
category reflects more than those below it, but we cannot say how much more.
“satisfaction”, for example, can be rated from less to more highly satisfied, but we are not
sure if the difference being “very satisfied” and “satisfied” is really equivalent to the
difference between being “dissatisfied” and being “very dissatisfied”. This limits the
statistics we can use with this type of data. Averages, for example, are not really
appropriate here. As a result, the statistical tests for these types of data tend to use other
approaches, such as looking rankings across the categories in each group.
Most items on surveys have ordinal scale alternatives as selections that the respondents
can choose from. We generally use 5-point scales that include negative eg “very
dissatisfied”, “dissatisfied”, “neutral”, and positive (“satisfied”, “very satisfied”).
These data can also be summarized as the percentage of respondents who fall into each
category. The median can be used to indicate the centre or mid-point of the distribution
and inter quartile range can be used as an indication of variability in the data.
Some users of statistics feel comfortable applying statistics for interval scales to these
data if items are summed to produce a total score (e.g. a satisfaction or loyalty index) or
there are a wide range of ordered categories, but purists are uncomfortable with this
approach. There are however, a variety of procedures (referred to as non-parametric
statistics) that can be used to test for the significance of differences between groups or
subgroups or to assess the significance of changes that occur over time.
Interval scales are those in which the distance between each descriptor is known. The
distance is normally defined as one scale unit. For example, a coffee brand rated “3rd” in
taste is one unit away from one rated “4th”. Some times the researcher must impose a
brief that equal intervals exist between the descriptors. That is, if you were asked to
evaluate a store’s sales people by selling a single designation form, a list of “extremely
friendly”, “very friendly”, “some what friendly”, “some what unfriendly”, “very
unfriendly”, or “extremely unfriendly”, the researcher would probably assume that each
designation was one unit away from the preceding one. In theses cases, we say that the
scale is “assumed interval”. As shown in the table 2, these descriptors are evenly spaced
on a questionnaire; as such, the labels connote a continuum and the check lines are equal
distances apart. By wording or spacing the response opinions on a scale so they appear to
have equal intervals between them, the researcher achieves a higher-level of
measurement than ordinal or nominal.
For this type of data, we can compute mean (average) scores or medians and percentiles.
Typically, the median is preferred if the data are skewed (biased towards lower or higher
scores) or the range can go to very high values (as in housing costs or income level) as
the median is less affected by skew and outliers than is the mean. Measures of variance,
the standard deviation, or the median absolute deviation can be computed for sample
means to express the range within which the population mean is likely to lie.
The statistics for testing for group differences or changes over time that are available for
this type of data (e.g. the t-test, the analysis of variance or ANOVA, etc) tend to be more
powerful.
Ratio scales are ones in which a true origin exists-such as an actual number of purchases
in a certain time period, dollars spent, miles travelled, number of children, or years of
college education. This characteristic allows us to construct ratios when comparing
results of the measurement. One person may spend twice as much as as another, or travel
one-third as far. Such ratios are inappropriate for interval scales, so we are not allowed to
say that one store has one-half as friendly as another.
1. Please rank each brand in terms of your preference. Place a “1 “ by your first choice, a
“2” by your second choice, and so on
___________ Pepsi
___________ Seven- up
Loyal Vs Hadiya
Bambis Vs Tana
C. Interval-scaled questions
Lacoste 1 2 3 4 5
Parker 1 2 3 4 5
Sony 1 2 3 4 5
2. Indicate your degree of agreement with the following statements by circling the appropriate
number
Disagree agree
a. I love to cook 1 2 3 4 5
D. Ratio-scaled questions
________ years.
2. Approximately how many times in the last week have you purchased anything over Birr 5 in
value at Tana Super market?
0 1 2 3 4 5 More (specify)
3. How much do you think a typical purchaser of a Birr 100,000 term life insurance policy pays
per year for that policy? Birr ________
4. What is the probability that you will use a lawyer’s service when you are ready to make a
will?_______ percent
Description: Statement with which respondent shows the amount of agreement / disagreement to a
specific measurement question. It is the characteristics of an ordinal scale.
Description: Scale is inscribed between two bipolar words and respondent selects the point that most
represents the direction and intensity of his / her feelings
5. Rank order
Description: Respondent is asked to rate or rank each options with reference to a specific research
variable. This allows the researcher to obtain information on relative preferences, importance etc. Long
lists should be avoided (respondents generally find it difficult to rank more than 5 items)
Example: Please indicate, in rank order, your preferred Chewing gum, putting 1 next to your favorite
through to 5 for your least favorite.
Poppotine
Strawberry
Special mint
Wow
Banana
The level of measurement determines what information you will have about the object of
study; it determines what you can say and what you can not say about the object. For
example, nominal scales measure the lowest information level, and therefore, they are
sometimes considered as the crudest scales. Nominal scales allow us to do nothing more
than identify our object of study on some property. Ratio scales, however, contain the
greatest amount of information; they allow us to say many things about our object. Yet, it
is not always possible to have a true zero point.
The level of measurement dictates what type of statistical analysis you may or may not
perform. Low-level scales permit much more sophisticated analysis. In other words, the
amount of information contained in the scale dictates the limits of statistical analysis.
Actitivty-5-
Explain the implication of level of measurement scale in answering your research question by providing
a specific example from the examples you are given above for illustration.