Unit 1 Test Development
Unit 1 Test Development
Free Response
– Essay, Short Answer
– Interview Questions
– Fill in the Blank
– Projective Techniques
Multiple Choice
Multiple choice most common in educational
testing (and also some personality and
employment testing)
– consists of a stem and a number of responses--
should only be one right answer
– the wrong answers are called distractors because
they may appear correct--should be realistic
enough to appeal to uninformed test taker
– easy scoring but downside is that test takers can
get some correct by guessing
Multiple Choice
Pros
• more answer options (4-5) reduce the chance of
guessing that an item is correct
• many items can aid in student comparison and
reduce ambiguity, increase reliability
Cons
• measures narrow facets of performance
• reading time increased with more answers
• transparent clues (e.g., verb tenses or letter uses
“a” or “an”) may encourage guessing
• difficult to write four or five reasonable choices
• takes more time to write questions
True/False
True/False is also used in educational
testing and some personality testing
– in educational testing the test taker can
again gain some advantage by guessing
True/False (cont.)
Ideally a true/false question should be
constructed so that an incorrect response
indicates something about the student's
misunderstanding of the learning objective.
1. _ Sunny 2. Outgoing
Friendly Loyal
Likert Scales
Likert scales are usually reliable and
highly popular (e.g., personality and
attitude tests)
– item is presented with an array of response
options (e.g., 1 to 5 or 1 to 7 scale), usually
on an agree/disagree or
approve/disapprove continuum
Test Types
Structured Response
– Advantages
Great Breadth
Quick Scoring
– Disadvantages
Limited Depth
Difficult to assess higher levels of skills
Guessing/Memorization vs. Knowledge
Subjective Items
subjective items are less easily scored
but provide the test taker with fewer
cues and open wider areas for
response--often used in education
– essay questions - responses can vary in
breadth and depth and scorer must
determine to what extent the response is
correct (often by examining match with
predetermined correct response)
Essay Questions
Provide a freedom on response that
facilitates assessing higher cognitive
behaviors (e.g., analysis and evaluation)
– Disadvantages
Difficult to Grade
Judgement error (e.g., interrater reliability)
Requires Advance - Objective Scoring Key
Writing Good Items
An art that requires originality, creativity, combined
with knowledge of test domain and good item writing
practices
Not all items will perform as expected--may be
too easy or difficult, may be misinterpreted, etc.
Rule of thumb to write at least twice as many items
as you expect to use
Broad vs. Narrow items
Writing Good Items (cont.)
Suggestions:
– identify item topics by consulting test plan
(increases content validity)
– ensure that each item presents a central
idea or problem
– write items drawn only from testing
universe
– write each item in clear and direct manner
Writing Good Items (cont..)
Suggestions:
– use vocabulary and language appropriate for
the target audience (e.g., age, culture)
– take into account sexist or racist language
(e.g., mailman, fireman)
– make all items independent (e.g.,one
question per question)
– ask an expert to review items to reduce
ambiguity and inaccuracy
Writing Administration
Instructions
Specify the testing environment to
decrease variation or error in test scores
should address:
– group or individual administration
– requirements for location (e.g., quiet)
– required equipment
– time limits or approximate completion time
– script for administrator and answers to
questions test takers may ask
Specifying Administration and
Scoring Methods
Determine such things as how test
will be administered (e.g., orally,
written, computer--individually or in
groups)
Method of scoring, but also whether
scored by hand by test administrator, or
accompanied by scoring software, or
sent to test publisher for scoring
Scoring Methods
Cumulative model: most common
– assumes that the more a test taker responds in a
particular fashion the more he/she has of the
attribute being measured (e.g., more “correct”
answers, or endorses higher numbers on a Likert
scale)
– correct responses or responses on Likert scale are
summed
– yields interval data that can be interpreted with
reference to norms
Scoring Methods (cont.)
Categorical model: place test takers in a group
– Reasons:
Cry for help
Want to plea insanity in court
Want to avoid draft into military
Want to show psychological damage
Duplicate items:
“I love my mother.”
“I hate my mother.”
Infrequency scales:
“I’ve never had hair on my head.”
“I have not seen a car in 10 years.”
Random Responding
– May occur for several reasons:
People are not motivated to participate
Reading or language difficulties
Do not understand instructions / item content
Too confused or disturbed to respond
appropriately
Piloting and Revising Tests
can’t assume the test will perform as
expected
pilot test scientifically investigates the
test’s reliability and validity
administer test to sample from target
audience
analyze data and revise test to fix any
problems uncovered--many aspects to
consider
Setting Up the Pilot Test
test situation should match actual
circumstances in which test will be used
(e.g., in sample characteristics, setting)
developers must follow the American
Psychological Association’s codes of
ethics (e.g., strict rules of confidentiality
and publish only aggregate results)
Conducting the Pilot Test
depth and breadth depends on the size
and complexity of the target audience
adhere strictly to test procedures
outlined in test administration
instructions
generally require large sample
may ask participants about the testing
experience
Analyzing the Results
can gather both quantitative and
qualitative information
use quantitative information for such
things as item characteristics, internal
consistency, convergent and
discriminate validity, and in some
instances predictive validity
Revising the Test
Choosing the final items requires
weighing each item’s content validity,
item difficulty and discrimination, inter-
item correlation, and bias
when new items need to be added or
items need to be revised, the items
must again be pilot tested to ensure
that the changes produced the desired
results
Validation and Cross-Validation
Validation is the process of obtaining
evidence that the test effectively measures
what it is supposed to measure (i.e.,
reliability and validity)
first part of establishing content validity is
carried out as the test is developed--that it
measures the constructs (construct
validity) and predicts an outside criterion is
determined in subsequent data collection
Validation and Cross-Validation
when the final revision of a test yields
scores with sufficient evidence of reliability
and validity, test developers then conduct
cross-validation--a final round of test
administration to another sample
because of chance factors the reliability
and validity coefficients will likely be
smaller in the new sample--referred to as
shrinkage
Item Analysis
Purpose of Item Analysis
• The process of examining the respondent’s responses
to each item. Item analysis is useful in helping test
designers determine which items to keep, modify, or
discard on a given test.
• It indicates the truthfulness (validity) of items.
• The need to look into the difficulty and discriminating
ability of the item as well as the effectiveness of the
each alternative.
• Helpful for distractor analysis, performance of each
item, and also to discriminate between inferior and
superior items.
Power Tests
Speed Tests
• Item Difficulty -- is the exam question (aka “item”)
too easy or too hard? When an item is one that
every student either gets wrong or correct, it
decreases an exam’s reliability.
• If everyone gets a particular answer correct,
there’s less of a way to tell who really
understands the material with deep knowledge.
• Conversely, if everyone gets a particular answer
incorrect, then there’s no way to differentiate
those who’ve learned the material deeply.
•Item Discrimination -- does the exam
question discriminate between students who
understand the material and those who do
not?
•Desirable discrimination can be shown by
comparing the correct answers to the total
test scores of students--i.e., do students who
scored high overall have a higher rate of
correct answers on the item than those who
scored low overall? If you separate top
scorers from bottom scorers, which group is
getting which answer correct?
•Item Distractors -- for multiple-choice exams,
distractors play a significant role. Do exam
questions effectively distract test takers from
the correct answer.