0% found this document useful (0 votes)
20 views5 pages

Reporting - Test Development

Uploaded by

ericamaesamson15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views5 pages

Reporting - Test Development

Uploaded by

ericamaesamson15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

REPORT OUTLINE

Test Development
● Meaning: it is an umbrella term for all that goes into the process of creating a test.
● Purpose: It will help to evaluate the patient with definite and concise finding base on the test being
conducted.
● Introduce the 5 stages
1. Test conceptualization
2. Test construction
3. Test tryout
4. Item analysis
5. Test revision

Test Conceptualization (moriente)


● This is when the idea was conceptualized
● An emerging social phenomenon or pattern of behavior might serve as the stimulus for the
development of a new test
● The development of a new test may be in response to a need to assess mastery in an emerging
occupation or profession.
● Initial questions:
○ What is the test designed to measure?
○ What is the objective of the test?
○ Is there a need for this test?
○ Who will use this test?
○ Who will take this test?
○ How will the test be administered?
● Reference Testing
1. Norm-referenced test: performance is compared with that of the normative sample
2. Criterion-referenced test: evaluation of the score based on the set standards
● Pilot Work: The preliminary research surrounding the creation of the prototype of the test
○ Test items may be piloted to evaluate whether they should be included in the final form of the
instrument
○ Test developer typically attempts to determine how best to measure a targeted construct.

Test construction
● Scaling
- process of setting rules for assigning numbers in measurement
● Types of scale
- age based
- grade based
- stanine scale
- scale based on dimensions
- scale based on comparison, sequence
● Scaling Methods
1. Rating scale
- grouping of words, statements, or symbols

2. Summative scale
- Likert scale - usually to scale attitudes
For example: It was easy to navigate the website to find what I was looking for.
(1 = Strongly agree, 2 = Agree, 3 = Disagree, 4 = Strongly disagree)
3. Method of Paired Comparisons
For Example: Select the behavior that you think would be more justified:
a. cheating on taxes if one has a chance
b. accepting a bribe in the course of one’s duties
4. Comparative Scaling
- entails judgments of a stimulus in comparison with every other stimulus on the scale
5. Categorical Scaling
- Testtakers would be asked to sort the cards into three piles:
- those behaviors that are never justified
- those that are sometimes justified, and
- those that are always justified
6. Guttman Scale
For example: Do you agree or disagree with each of the following:
a. I do not support any regulations on gun sales to civilian population.
b. I support stricter background checks during the process of gun sales.
c. I support the prohibition of sales of gun bump stocks.
d. I support prohibiting gun sales to mentally ill people.
e. I support prohibition of gun sales to civilians altogether.
- Scalogram Analysis
- graphic mapping of a test taker's responses

● Writing Items
- three questions related to the test blueprint
1. What range of content should the items cover?
2. Which of the many different types of item formats should be employed?
3. How many items should be written in total and for each content area covered?
● Item Pool
- reservoir or well from which items will or will not be drawn for the final version of the test
● Item Format
1. Selected-response format
- require test takers to select a response from a set of alternative responses
- Three types: multiple-choice, matching, and true–false
1.1 Multiple choice Format
- three elements:
- Stem - stimulus
- Alternative - the correct answer
- Distractors (foils) - incorrect alternative or option
1.2 Matching Item
- The test taker is presented with two columns: premises on the left and responses
on the right
1.3 Binary Item
- True-false item - most familiar binary-choice item
2. Constructed-response format
- Has three types:
2.1 Completion Item
For Example:
The standard deviation is generally considered the most useful measure of __________.
2.2 Short-answer item
2.3 Essay Item

● Item Bank
- Collection of test questions

● Computerized Adaptive Testing (CAT) - an interactive, computer administered test-taking


process
● Floor Effect - the diminished utility of an assessment tool for distinguishing test takers at the low
end of the ability, trait, or other attribute being measured
● Ceiling Effect - refers to the diminished utility of an assessment tool for distinguishing test takers
at the high end of the ability, trait, or other attribute being measured
● Item Branching - ability of the computer to tailor the content and order of presentation of test
items on the basis of responses to previous items

● Scoring Items
○ Class Scoring - testtaker responses earn credit toward placement in a particular class or
category with other test takers whose pattern of responses is presumably similar in some way
○ Ipsative Scoring - comparing a testtaker’s score on one scale within a test to another scale
within that same test
Test tryout
● Purpose of Test Tryout
- Trying the test to the tryout sample.

● Selection of Tryout Participants


- Participants should be similar to the test intended users.

● Sample Size for Tryout : No fewer than 5 subjects and as many as 10 subjects.
- Phantom Factors : False result because there weren’t enough participants.

● Test Administration Condition


- The conditions during the test tryout should be similar to the actual test conditions.

Identifying Good Items


● Characteristics of Good Items
- Reliability: Gives consistent results
- Validity: It measures what it is supposed to measure
- Discriminative: Can tell the difference between high and low scorers.
- Pseudobulbar Affect (PBA) (Rudolph et al., 2016).

● Statistical and Qualitative Analysis


- Quantitative: Involves checking how many get each question right or wrong.
- Qualitative: Experts will look at the questions more thoughtfully

● Conclusion of Test Tryout and Item Analysis


- These practices ensure that the final test accurately measure what it is supposed to measure and
effectively differentiates high and low scorers.

Item analysis
● Discuss the item analysis and the tools test developers
★ Item Analysis
- refers to the process of examining the student’s responses to each item in the test.
● The tools test developers are:
★ An index of the item’s difficulty
- Refers to the proportion of the number of the students in the upper and lower groups
who answered an item correctly.
- For maximum discrimination among the abilities of the test takers, the optimal average
item difficulty is approximately .5, with individual items on the test ranging in difficulty
from about .3 to .8.
★ An index of the item’s reliability
- an indication of the internal consistency of a test the higher this index, the greater the
test’s internal consistency. This index is equal to the product of the item-score standard
deviation (s) and the correlation (r) between the item score and the total test score.
- Factor analysis and inter-item consistency
- A statistical tool useful in determining whether items on a test appear to be measuring
the same thing(s) is factor analysis.
★ an index of the item’s validity
- is a statistic designed to provide an indication of the degree to which a test is measuring
what it purports to measure. The higher the item-validity index, the greater the test’s
criterion-related validity. The item-validity index can calculated once the following two
statistics are known:
- The item-score standard deviation of item 1 (denoted by the symbol s1) can be
calculated using the index of the item’s difficulty (p1) in the following formula:

★ An index of item discrimination


- Measures of item discrimination indicate how adequately an item separates or
discriminates between high scorers and low scorers on an entire test.
- Analysis of item alternatives is the quality of each alternative within a multiple-choice
item that can be readily assessed with reference to the comparative performance of
upper and lower scorers.
★ Item-Characteristics Curves
- jhgdg
Other Consideration in Item Analysis
★ Guessing
- In achievement testing, the problem of how to handle testtaker guessing is one that has
eluded any universally acceptable solution.
Following the three Criteria that any correction for guessing must meet as well
1. A correction for guessing should acknowledge that guessing on achievement
tests is not random but based on subject knowledge and ability to rule out
distractions, with individual knowledge varying across items.
2. Correcting guessing requires considering omitted items, whether they should be
scored incorrectly, excluded, or scored as random guesses, and determining
their handling.
3. Some testtakers may be luckier in guessing correct choices, and any correction
for guessing may underestimate or overestimate the effects of guessing for lucky
and unlucky testtakers.
★ Item fairness
- Just as we may speak of biased tests, we may speak of biased test items.
★ Speed tests
- Item analyses of tests taken under speed conditions yield misleading or uninterpretable
results.
Qualitative item analysis
- In contrast to statistically based procedures, qualitative methods involve exploration of the issues
through verbal means such as interviews and group discussions conducted with testtakers and other
relevant parties.
★ “Think aloud” test administration
- On a one-to-one basis with an examiner, examinees are asked to take a test, thinking
aloud as they respond to each item. If the test is designed to measure achievement,
such verbalizations may be useful in assessing not only if certain students (such as low
or high scorers on previous examinations) are misinterpreting a particular item but also
why and how they are misinterpreting the item.
Test Revision
- Action taken to modify a test’s content or format for the purpose of improving the test’s effectiveness as a tool
of measurement.
1. Characterize each item according to its strengths and weaknesses
2. Test developers may find that they must balance various strengths and weaknesses across items
3. To administer the revised test under standardized conditions to a second appropriate sample of
examinees.

Some of the issues surrounding the development of a new edition of an existing test
1. Stimulus materials look dated and current testtakers cannot relate to them.
2. The verbal content of the test is not readily understood by current testtakers.
3. Certain words or expressions in the test items or directions may be perceived as inappropriate or even
4. offensive to a particular group.
5. The test norms are no longer adequate as a result of age-related shifts in the abilities and measured
over time.

Cross-validation is a process of assessing the reliability and validity of a test by using multiple subsets of data.
Validity shrinkage refers to the reduction in the validity coefficients of a test when it is administered to a
different sample from the one used for initial test validation.
Test validation is evaluating the effectiveness of a test in measuring what it is supposed to measure.
Co norming referred to when used in conjunction with the creation of norms or the revision of existing norms.

Quality Assurance
Anchor protocol is a method used to ensure consistency and fairness in scoring across different versions of a
test.
Scoring drift involves changes over time in the way scores are assigned or interpreted.

The use of IRT and revising tests


IRT information curves can help test developers evaluate how well an individual item (or entire test) is working
to measure different levels of the underlying construct.
(1) evaluating existing tests for the purpose of mapping test revisions
(2) determining measurement equivalence across testtaker populations
(3) developing item banks.

Determining measurement equivalence across testtaker populations


Differential Item Functioning In IRT, refers to the phenomenon where items on a test have different properties
for different groups of test-takers, even when those groups have the same underlying ability level.

Item bank is a valuable resource for efficient and effective test development. It's a collection of test items
stored in a database, categorized and tagged for easy retrieval.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy