Reporting - Test Development
Reporting - Test Development
Test Development
● Meaning: it is an umbrella term for all that goes into the process of creating a test.
● Purpose: It will help to evaluate the patient with definite and concise finding base on the test being
conducted.
● Introduce the 5 stages
1. Test conceptualization
2. Test construction
3. Test tryout
4. Item analysis
5. Test revision
Test construction
● Scaling
- process of setting rules for assigning numbers in measurement
● Types of scale
- age based
- grade based
- stanine scale
- scale based on dimensions
- scale based on comparison, sequence
● Scaling Methods
1. Rating scale
- grouping of words, statements, or symbols
2. Summative scale
- Likert scale - usually to scale attitudes
For example: It was easy to navigate the website to find what I was looking for.
(1 = Strongly agree, 2 = Agree, 3 = Disagree, 4 = Strongly disagree)
3. Method of Paired Comparisons
For Example: Select the behavior that you think would be more justified:
a. cheating on taxes if one has a chance
b. accepting a bribe in the course of one’s duties
4. Comparative Scaling
- entails judgments of a stimulus in comparison with every other stimulus on the scale
5. Categorical Scaling
- Testtakers would be asked to sort the cards into three piles:
- those behaviors that are never justified
- those that are sometimes justified, and
- those that are always justified
6. Guttman Scale
For example: Do you agree or disagree with each of the following:
a. I do not support any regulations on gun sales to civilian population.
b. I support stricter background checks during the process of gun sales.
c. I support the prohibition of sales of gun bump stocks.
d. I support prohibiting gun sales to mentally ill people.
e. I support prohibition of gun sales to civilians altogether.
- Scalogram Analysis
- graphic mapping of a test taker's responses
● Writing Items
- three questions related to the test blueprint
1. What range of content should the items cover?
2. Which of the many different types of item formats should be employed?
3. How many items should be written in total and for each content area covered?
● Item Pool
- reservoir or well from which items will or will not be drawn for the final version of the test
● Item Format
1. Selected-response format
- require test takers to select a response from a set of alternative responses
- Three types: multiple-choice, matching, and true–false
1.1 Multiple choice Format
- three elements:
- Stem - stimulus
- Alternative - the correct answer
- Distractors (foils) - incorrect alternative or option
1.2 Matching Item
- The test taker is presented with two columns: premises on the left and responses
on the right
1.3 Binary Item
- True-false item - most familiar binary-choice item
2. Constructed-response format
- Has three types:
2.1 Completion Item
For Example:
The standard deviation is generally considered the most useful measure of __________.
2.2 Short-answer item
2.3 Essay Item
● Item Bank
- Collection of test questions
● Scoring Items
○ Class Scoring - testtaker responses earn credit toward placement in a particular class or
category with other test takers whose pattern of responses is presumably similar in some way
○ Ipsative Scoring - comparing a testtaker’s score on one scale within a test to another scale
within that same test
Test tryout
● Purpose of Test Tryout
- Trying the test to the tryout sample.
● Sample Size for Tryout : No fewer than 5 subjects and as many as 10 subjects.
- Phantom Factors : False result because there weren’t enough participants.
Item analysis
● Discuss the item analysis and the tools test developers
★ Item Analysis
- refers to the process of examining the student’s responses to each item in the test.
● The tools test developers are:
★ An index of the item’s difficulty
- Refers to the proportion of the number of the students in the upper and lower groups
who answered an item correctly.
- For maximum discrimination among the abilities of the test takers, the optimal average
item difficulty is approximately .5, with individual items on the test ranging in difficulty
from about .3 to .8.
★ An index of the item’s reliability
- an indication of the internal consistency of a test the higher this index, the greater the
test’s internal consistency. This index is equal to the product of the item-score standard
deviation (s) and the correlation (r) between the item score and the total test score.
- Factor analysis and inter-item consistency
- A statistical tool useful in determining whether items on a test appear to be measuring
the same thing(s) is factor analysis.
★ an index of the item’s validity
- is a statistic designed to provide an indication of the degree to which a test is measuring
what it purports to measure. The higher the item-validity index, the greater the test’s
criterion-related validity. The item-validity index can calculated once the following two
statistics are known:
- The item-score standard deviation of item 1 (denoted by the symbol s1) can be
calculated using the index of the item’s difficulty (p1) in the following formula:
Some of the issues surrounding the development of a new edition of an existing test
1. Stimulus materials look dated and current testtakers cannot relate to them.
2. The verbal content of the test is not readily understood by current testtakers.
3. Certain words or expressions in the test items or directions may be perceived as inappropriate or even
4. offensive to a particular group.
5. The test norms are no longer adequate as a result of age-related shifts in the abilities and measured
over time.
Cross-validation is a process of assessing the reliability and validity of a test by using multiple subsets of data.
Validity shrinkage refers to the reduction in the validity coefficients of a test when it is administered to a
different sample from the one used for initial test validation.
Test validation is evaluating the effectiveness of a test in measuring what it is supposed to measure.
Co norming referred to when used in conjunction with the creation of norms or the revision of existing norms.
Quality Assurance
Anchor protocol is a method used to ensure consistency and fairness in scoring across different versions of a
test.
Scoring drift involves changes over time in the way scores are assigned or interpreted.
Item bank is a valuable resource for efficient and effective test development. It's a collection of test items
stored in a database, categorized and tagged for easy retrieval.