LECTURE 3 - Test Development - 044659
LECTURE 3 - Test Development - 044659
Construction II
LECTURE 3
TEST DEVELOPMENT
The process of developing a test occurs in 5 stages:
Test Conceptualization
Test Construction
Test Tryout
Item Analysis
Test Revision
TEST CONCEPTUALIZATION
Regardless of the stimulus for developing the new test, a number of questions
immediately confront the prospective test developer.
What is the test designed to measure? What is the objective of the test? Is there a
need for this test? Who will use this test? Who will take this test? What content will
the test cover? How will the test be administered?
What types of responses will be required of testtakers? Who benefits from an
administration of this test? Is there any potential for harm as the result of an
administration of this test? Etc.
3. NORM-REFERENCED OR CRITERION-REFERENCED
Different approaches to test development and individual item analyses are necessary, depending
upon whether the finished test is designed to be norm-referenced or criterion-referenced.
Tests that compare the tester's performance to the performance of peers in a norming group,
usually of similar age or other demographic. Criterion-referenced Tests that compare the tester's
performance to an objective standard.
Norm-referenced tests are standardized tests characterized by scoring that compares the
performance of the test-taker to a norming group (a group with similar characteristics such as age
or grade level). Examples of norm-referenced tests are the SAT and ACT and most IQ tests.
Ideally, each item on a criterion-oriented test addresses the issue of whether the testtaker has met
certain criteria. Criterion-referenced testing and assessment is commonly employed in licensing
contexts, be it a license to practice medicine or to drive a car.
4. PILOT WORK
In the context of test development, terms such as pilot work, pilot study, and pilot
research refer, in general, to the preliminary research surrounding the creation of a
prototype of the test.
A pilot study is a research study conducted before the intended study. Pilot studies
are usually executed as planned for the intended study, but on a smaller scale
In pilot work, the test developer typically attempts to determine how best to measure
a targeted construct.
Pilot study may include:
• open-ended interviews with research subjects
• interviews with parents, teachers, friends, and others who know the subject
II. TEST CONSTRUCTION
1. Scaling
Scaling may be defined as the process of setting rules for assigning numbers in
measurement. In psychometrics, scales may also be conceived of as instruments used
to measure a trait, a state, or an ability.
Generally speaking, a testtaker is presumed to have more or less of the characteristic
measured by a (valid) test as a function of the test score.
The higher or lower the score, the more or less of the characteristic the testtaker
presumably possesses.
But how are numbers assigned to responses so that a test score can be calculated?
This is done through scaling the test items.
A. Types of Data & Measurement Scales: Nominal, Ordinal, Interval
and Ratio
Nominal
Nominal scales are used for labeling variables, without any quantitative value.
“Nominal” scales could simply be called “labels.” Here are some examples,
below.
Notice that all of these scales are mutually exclusive (no overlap) and none of
them have any numerical significance. A good way to remember all of this is that
“nominal” sounds a lot like “name” and nominal scales are kind of like “names” or
labels.
Note: a sub-type of nominal scale with only two categories (e.g. male/female) is called
“dichotomous.”
Ordinal Scale
Other sub-types of nominal data are “nominal with order” (like “cold, warm, hot, very hot”) and
nominal without order (like “male/female”).
With ordinal scales, the order of the values is what’s important and significant, but the differences
between each one is not really known. Take a look at the example below. In each case, we know that
a #4 is better than a #3 or #2, but we don’t know–and cannot quantify–how much better it is
For example, is the difference between “OK” and “Unhappy” the same as the difference between
“Very Happy” and “Happy?” We can’t say.
Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness,
discomfort, etc.
“Ordinal” is easy to remember because is sounds like “order” and that’s the key to remember with
“ordinal scales”–it is the order that matters, but that’s all you really get from these.
Interval
Interval scales are numeric scales in which we know both the order and the exact differences
between the values.
The classic example of an interval scale is Celsius temperature because the difference between each
value is the same. For example, the difference between 60 and 50 degrees is a measurable 10
degrees, as is the difference between 80 and 70 degrees.
Interval scales are nice because the realm of statistical analysis on these data sets opens up.
Like the others, you can remember the key points of an “interval scale” pretty easily. “Interval”
itself means “space in between,” which is the important thing to remember–interval scales not only
tell us about order, but also about the value between each item.
Here’s the problem with interval scales: they don’t have a “true zero.” For example, there is no
such thing as “no temperature,” at least not with Celsius
Without a true zero, it is impossible to compute ratios
Ratio
In summary, nominal variables are used to “name,” or label a series of values. Ordinal
scales provide good information about the order of choices, such as in a customer
satisfaction survey.
Interval scales give us the order of values + the ability to quantify the difference between
each one.
Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate
ratios since a “true zero” can be defined.
B. Scaling Methods
Rating scale- A major type of summative rating scale method is the Likert scale- Each item
presents the testtaker with five alternative responses (sometimes seven), usually on an agree–
disagree or approve– disapprove continuum.
Rating scales can be unidimensional or multidimensional.
Scales can be unidimensional or multidimensional, based on whether the underlying construct
is unidimensional (e.g., exam model, homework mode) or multidimensional (e.g., academic
aptitude, intelligence). Unidimensional scale measures constructs along a single scale, ranging
from high to low.
Method of paired comparisons- Testtakers are presented with pairs of stimuli (two
photographs, two objects, two statements), which they are asked to compare.
2. WRITING ITEMS
a) Item Pool- An item pool is the reservoir or well from which items will or will not be drawn
for the final version of the test.
When devising a standardized test using a multiple-choice format, it is usually advisable that
the first draft contain approximately twice the number of items that the final version of the test
will contain.
b) Item format – Variables such as the form, plan, structure, arrangement, and layout of
individual test items are collectively referred to as item format.
Two types of item format are the selected-response format and the constructed-response format.
Items presented in a selected-response format require testtakers to select a response from a set
of alternative responses.
Items presented in a constructed response format require testtakers to supply or to create the
correct answer, not merely to select it.
Scaling Methods
Scoring Items
a)Cumulative model - the higher the score on the test, the higher the testtaker is on the ability,
trait, or other characteristic that the test purports to measure.
b) Class or category scoring - testtaker responses earn credit toward placement in a particular
class or category with other testtakers whose pattern of responses is presumably similar in some
way.