Psy211 Readings
Psy211 Readings
Theories of test reliability were developed to estimate the effects of inconsistency on the
accuracy of psychological measurement.
T = True Score
E = Error of measurement
Error in measurement represent discrepancies between scores obtained on tests and the
corresponding true scores Thus,
E=X–T
Other things being equal, the longer the test, the more reliable it will be.
Lengthening a test ,however, will only increase its consistency in terms of content
sampling, not its stability over time. The effect that lengthening or shortening a test will
have on its coefficient can be estimated by means of the Spearman-Brown formula.
Spearman Brown formula is used to correct the split-half reliability estimates.
- Provides a good estimate of what the reliability coefficient would be if the two
halves were increased to the original length of the instrument.
The formula for calculating the standard error of measurement (SEM) is:
SEM= s√ 1- r
Where: s represents the
standard deviation and r is the reliability coefficient.
Example: Case of Anne ((Whiston, 2000)
Anne took the Graduate Record Examinations Aptitude Test (GRE), an instrument used in
selecting and admitting students in the graduate program.
GRE gives three scores: Verbal (GRE-V), Quantitative (GRE-Q) and Analytical
(GRE-A) Scores range from 200 to 800
Anne’s Scores in the GRE-V is 430:
The reliability coefficient for the GRE-V is .90 (Educational Testing Service, 1997).
Therefore, the standard error of measurement would be
100√ 1- .90 = 100 √ .10 = 100 (.32) = 32.
We would then add and subtract the standard error of measurement to Anne’s
score to get the range.
A counselor could then tell Anne that 68% of the time she could expect her GRE-
V score would fall between 398 (430 - 32) and 462 (430 + 32).
In this case, we would say that 95% of the time Anne’s score would fall between 366
(430 - 64) and 494 (430 + 64).
Given this information, how would you help Anne, if you are the counsellor?
If Anne is applying to a graduate program that only admits students with GRE-V scores
of 600 or higher, what are her chances of being admitted?
TEST VALIDITY
The degree to which a test measures what it purports (what it is supposed) to measure
when compared with accepted criteria. (Anastasi and Urbina, 1997).
TYPES OF VALIDITY
-Scholastic aptitude
CRITERION- -criterion measure is to Correlate test scores with criterion tests
RELATED be obtained in the future. measure obtained after a period of -General aptitude
-Goal is to have test time. batteries
Predictive scores accurately pre- -Prognostic tests
dict criterion perfor- Ex. Predictive validities of -Readiness tests
mance identified. Admission tests -Intelligence tests
ITEM ANALYSIS
A general term for procedures designed to assess the utility or validity of a set of
test items.
• Validity concerns the entire instrument, while item analysis examines the qualities
of each item.
• done during test construction and revision; provides information that can be used
to revise or edit problematic items or eliminate faulty items.
Item Difficulty Index
An index of the easiness or difficulty of an item
• it reflects the proportion of people getting the item correct, calculated by dividing
the number of individuals who answered the item correctly by the total number of
people.
• item difficulty index can range from .00 (meaning no one got the item correct)to
1.00 (meaning everyone got the item correct.
• item difficulty actually indicate how easy the item is because it provides the
proportion of individuals who got the item correct.
Example: in a test where 15 of the students in a class of 25 got the first item on
the test correct.
p= 15 = .60
25
• the desired item difficulty depends on the purpose of the assessment, the group
taking the instrument, and the format of the item.
Item Discrimination Index
A measure of how effectively an item discriminates between examinees who score
high on the test as a whole ( or on some other criterion variable) and those who
score low. (Aiken 2000).
calculate by subtracting the proportion of examinees in the lower group from the
proportion of examinees in the upper group who got the item correct or who
endorsed the item in the expected manner.
item discrimination indices can range from + 1.00 (all of the upper group got it
right and none of the lower group got it right) to – 1.00 (none of the upper
group got it
right and all of the lower group got it right)
the determination of the upper and lower group will depend on the distribution of
scores. If normal distribution, use the upper 27% for the upper group and
lower
27% for the lower group (Kelly,1939). For small groups Anastasi and Urbina (1997)
suggest the range of upper and lower 25% to 33%.
In general, negative item discrimination indices, particularly and small positive
indices are indicators that the item needs to be eliminated or revised.
The resulting value of D ranges from -1 to 1, with values closer to 1 indicating a
strong discrimination between high- and low-performing individuals, and values
closer to 0 indicating poor discrimination. A negative value of D indicates that
low-performing individuals performed better on the item than high-performing
individuals, which may indicate a problem with the item.
• Theory of test in which item scores are expressed in terms of estimated scores
on a latent-ability continuum.
• it rests on the assumption that the performance of an examinee on a test item
can be predicted by a set of factors called traits, latent traits or abilities.
• using IRT, we get an indication of an individual’s performance based not on the
total score, but on the precise items the person answers correctly.
• it suggests that the relationship between examinees’ item performance and the
underlying trait being measured can be described by an item characteristic
curve.
Item characteristic curve. A graph, used in item analysis, in which the proportion of
examinees passing a specified item is plotted against total test scores.
• Item response curve is constructed by plotting the proportion of respondents
who gave the keyed response against estimates of their true standing on a uni-
dimensional latent trait or characteristic. An item response curve can be
constructed either from the responses of a large group of examinees to an item,
or if certain parameters are estimated from a theoretical model
Rasch Model – one parameter (item difficulty) model for scaling test items for
purposes of item analysis and test standardization.
- The model is based on the assumption that indexes of guessing and item
discrimination are negligible parameters. As with other latent trait models, the
Rasch model relates examinees’ performances on test items (percentage
passing) to their estimated standings on a hypothetical latent-ability trait or
continuum.
Item Response Theory (IRT) is a statistical modeling framework used to analyze and
interpret responses to test items. IRT assumes that the probability of a person
correctly answering an item is a function of both the person's ability and the
characteristics of the item. In other words, IRT models the relationship between an
individual's ability and the probability of a correct response to an item.
IRT models are used in educational and psychological testing to evaluate the quality
of test items, to estimate individuals' abilities, and to create scoring systems for
tests. Unlike classical test theory, which assumes that the difficulty of a test item is
fixed and independent of the characteristics of the test-takers, IRT models allow for
the estimation of item difficulty and discrimination parameters that are specific to the
item.
IRT models can be used with both dichotomous (yes/no) and polytomous (multiple-
choice) test items. Some commonly used IRT models include the one-parameter
logistic model (also known as the Rasch model), the two-parameter logistic model,
and the three-parameter logistic model. These models differ in the number of
parameters used to describe the relationship between an individual's ability and the
probability of a correct response to an item.
IRT models have several advantages over classical test theory, including the ability
to estimate individuals' abilities more accurately, the ability to estimate item
parameters more precisely, and the ability to create item banks that can be used to
construct customized tests for different population