0% found this document useful (0 votes)
18 views12 pages

Psy211 Readings

The document outlines the characteristics of a good test, emphasizing design properties such as a clear purpose, standard content, and administration procedures, alongside psychometric properties like reliability and validity. It details methods for obtaining reliability, including test-retest and internal consistency, and discusses validity types such as content, criterion-related, and construct validity. Additionally, it covers item analysis, including item difficulty and discrimination indices, and introduces Item Response Theory (IRT) for evaluating test items based on latent traits.

Uploaded by

qinle60434
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views12 pages

Psy211 Readings

The document outlines the characteristics of a good test, emphasizing design properties such as a clear purpose, standard content, and administration procedures, alongside psychometric properties like reliability and validity. It details methods for obtaining reliability, including test-retest and internal consistency, and discusses validity types such as content, criterion-related, and construct validity. Additionally, it covers item analysis, including item difficulty and discrimination indices, and introduces Item Response Theory (IRT) for evaluating test items based on latent traits.

Uploaded by

qinle60434
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

CHARACTERISTICS OF A GOOD TEST

DESIGN PROPERTIES OF A GOOD TEST (Freidenberg, 1995)


 A clearly defined purpose.
o What is the test supposed to measure?
- (knowledge, skills, behavior, and attitudes and other characteristics)
o Who will take the test?
- the format of the test may be varied to suit the test taker (oral, written,
pictures, words, manipulations)
o How will the test scores be used?
- appropriateness of different types of test items and test scores.

 A specific and standard content.


Content is specific to the domain to be measured
standard - all test takers are tested on the same attributes or knowledge.

 A set of standard administration procedures.


-Standard conditions are necessary to minimize the effects of irrelevant variables,

❖ A standard scoring procedure.

PSYCHOMETRIC PROPERTIES OF A GOOD TEST


 Reliability
Refers to the consistency of scores obtained by the same person when retested
with the same test or with an equivalent form of the test on different occasions.
 Validity
Refers to the degree to which a test measures what it is supposed to measure.
 Good Item Statistics
Item Analysis- process of statistically reexamining the qualities of each item of
the test. It includes Item Difficulty Index and Discrimination Index.
TEST RELIABILITY

 Refers to the accuracy or consistency of measurement or the degree to which


test scores are consistent, dependable, repeatable and free from errors or free
from bias.
 Broadly, test reliability indicates the extent to which individual differences in test
scores are attributable to “true differences in the characteristics under
consideration and the extent to which they are attributable to chance errors”
Despite optimum testing conditions, however, no test is a perfectly reliable
instrument.
Reliability Coefficient – A numerical index (between .00 and 1.00) of the reliability
of an assessment instrument. It is based on the correlation between two independently
derived set of scores.
General Model of Reliability

Theories of test reliability were developed to estimate the effects of inconsistency on the
accuracy of psychological measurement.

This conceptual breakdown is typically represented by the simple equation

Observed test score = True Score + errors of measurement


X =T+E

where X = score on the test

T = True Score

E = Error of measurement

Error in measurement represent discrepancies between scores obtained on tests and the
corresponding true scores Thus,

E=X–T

The goal of reliability theory is to estimate errors in measurement and to suggest


ways of improving tests so that errors are minimized.
METHODS OF OBTAINING RELIABILITY

Methods Procedure Coefficient Problems

Same test given twice with Memory effect Practice


time interval between testing effect Change over time
Test-Retest Coefficient of Practice effect- may
- The error variance Stability produce improvement in retest
corresponds to random scores.
fluctuation of perfor- mance
from one test session to Thus, the correlation between
another as a result of the 2 tests will be spuriously
uncontrolled testing high.
conditions.
The time interval must be
Source of Error: Time recorded.
Sampling

Equivalent tests given with Coefficient of Hard to develop two equivalent


time interval between testing. Equivalence tests.
Alternate Form or
Parallel -Uses one form of test on the and May reflect change in behavior
Form first testing and with another over time
comparable Coefficient of
form on the second. stability Practice effect may tend to
reduce the correlation between
-In the development of Consistency of the two test forms.
alternate forms, need to response to
ensure that they are truly different item The degree to which the nature
parallel. samples. of the test will change in
repetition.
Source of Error: Item
Sampling
METHODS OF OBTAINING RELIABILITY

Methods Procedure Coefficient Problems

One test given at one time


only.
Internal Coefficient of Uses shortened forms
Consistency or -Two scores are obtained by internal (split-half)
Split-Half dividing the test into consistency
comparable halves. Only good if traits are
(split-half method) unitary or homogenous
Coefficient of
-uses corrected correlation Equivalence Gives high estimate on a
between two halves of the speeded test.
test
The correlation gives the
-Temporal stability is not a reliability of only one half.
problem because only one
test session is involved. Hard to compute by hand

Kuder- -utilizes a single Consistency of Two sources of error:


Richardson administration of a single responses
Reliability form. to all items a) content sampling
(Inter-item - KR20 for heterogenous
b) heterogeneity of the
consistency) instruments
behavior domain sampled
- KR 21 for homogenous
instruments

Coefficient Alpha Appropriate for instru- ments Consistency of


or where the scoring is not responses
Cronbach’s Alpha dichotomous such as scales. to items

Takes into consideration the


variance of each item
Inter-rater or Different scorers or Consistency Source of Error:
observers rate the items of ratings Observer
Inter-scorer or responses indepen- differences
reliability dently.

Used for free responses

 Other things being equal, the longer the test, the more reliable it will be.
Lengthening a test ,however, will only increase its consistency in terms of content
sampling, not its stability over time. The effect that lengthening or shortening a test will
have on its coefficient can be estimated by means of the Spearman-Brown formula.
Spearman Brown formula is used to correct the split-half reliability estimates.

- Provides a good estimate of what the reliability coefficient would be if the two
halves were increased to the original length of the instrument.

Standard Error of Measurement (Whiston, 2000)

is an estimate of the standard deviation of a normal distribution of scores that would


presumably be obtained if a person took the test an infinite number of times.
 It provides a band or a range of where a Psychologist or Counsellor can expect a
client’s “true score” to fall if he is to take the instrument over and over again
 The mean of this hypothetical score distribution is the person’s true score on the
test. If a client took a test 100 times, we would expect that one of those test
scores would be his or her true score.
 Depending on the confidence level that is needed, standard error of
measurement can be used to predict where a score might fall 68%, 95% or
99.5% of the time.

The formula for calculating the standard error of measurement (SEM) is:

SEM= s√ 1- r
Where: s represents the
standard deviation and r is the reliability coefficient.
Example: Case of Anne ((Whiston, 2000)

Anne took the Graduate Record Examinations Aptitude Test (GRE), an instrument used in
selecting and admitting students in the graduate program.
GRE gives three scores: Verbal (GRE-V), Quantitative (GRE-Q) and Analytical
(GRE-A) Scores range from 200 to 800
Anne’s Scores in the GRE-V is 430:

Assume that the mean is 500 and standard deviation is 100

The reliability coefficient for the GRE-V is .90 (Educational Testing Service, 1997).
Therefore, the standard error of measurement would be
100√ 1- .90 = 100 √ .10 = 100 (.32) = 32.

We would then add and subtract the standard error of measurement to Anne’s
score to get the range.

A counselor could then tell Anne that 68% of the time she could expect her GRE-
V score would fall between 398 (430 - 32) and 462 (430 + 32).

If we wanted to expand this interpretation, we could use two standard errors of


measurement (2 x 32 = 64).

In this case, we would say that 95% of the time Anne’s score would fall between 366
(430 - 64) and 494 (430 + 64).

If we wanted to further increase the probability of including her true score, we


would use three standard errors of measurement and conclude that 99.5% of the time
her score would fall between 334 (430-96) and 526 (430 + 96).
Question:

Given this information, how would you help Anne, if you are the counsellor?

If Anne is applying to a graduate program that only admits students with GRE-V scores
of 600 or higher, what are her chances of being admitted?
TEST VALIDITY
The degree to which a test measures what it purports (what it is supposed) to measure
when compared with accepted criteria. (Anastasi and Urbina, 1997).

TYPES OF VALIDITY

Types Purpose/ Descriptio Procedure Types of Tests


N
To compare whether the Compare test blueprint with the -Survey achievement
CONTENT test items match the set school, course, program tests
of goals and objectives; objectives, goals. -Criterion-referen- ced
Have panel experts in content area tests
-if the test items are (e.g. teachers, profess -Essential skills
representative of the -sors), to do the following: tests
defined universe or -Examine whether the items -Minimum-level skills
content domain that they represent the defined universe or tests
are supposed to content domain. - State assessment
measure. - Utilize systematic obser- vation of tests
behavior (observe -Professional
- concern is on test items skills and competencies needed to licensing exams
(content), perform a given task;. -Aptitude Tests
objectives, and format.

To predict perfor- mance Use a rating, observation or Aptitude Tests Ability


CRITERION- on another measure or to another test as criterion. Tests Personality
RELATED predict an individual’s Tests Employment
beha- vior in specified Correlate test scores with criterion Tests Achievement
situations measure obtained at the same tests
time. certification tests
Concurrent - criterion measure example: Test correlated with
obtained as same time supervisory ratings of the worker’s
performance
conducted at the same time

-Scholastic aptitude
CRITERION- -criterion measure is to Correlate test scores with criterion tests
RELATED be obtained in the future. measure obtained after a period of -General aptitude
-Goal is to have test time. batteries
Predictive scores accurately pre- -Prognostic tests
dict criterion perfor- Ex. Predictive validities of -Readiness tests
mance identified. Admission tests -Intelligence tests

CONSTRUCT To determine whether a Conduct multivariate sta- tistical Intelligence tests


construct exists and to analysis such as factor analysis, Aptitude Tests
understand the traits or discriminant ana- lysis, multivariate Personality Tests
concepts that make up analysis of variance.
A construct the set of scores or items.
is not direct -Requires evidence that support
-ly observa- -The extent to which a the interpretation of test scores in
ble but test measure a theoretical line with “theoretical implications
usually derived construct or trait. such as associated with the construct label.
from theory, intelligence, mecha- nical
research or comprehension, and -The authors should precisely
observation. anxiety. define each construct and
distinguish it from other constructs.
-involves gradual
accumulation of
evidence
Validity Coefficient – the correlation between the scores on an instrument and the
correlation measure.

ITEM ANALYSIS

 A general term for procedures designed to assess the utility or validity of a set of
test items.

• Validity concerns the entire instrument, while item analysis examines the qualities
of each item.
• done during test construction and revision; provides information that can be used
to revise or edit problematic items or eliminate faulty items.
Item Difficulty Index
 An index of the easiness or difficulty of an item
• it reflects the proportion of people getting the item correct, calculated by dividing
the number of individuals who answered the item correctly by the total number of
people.

p = number who answered


correctly total number of
examinees

• item difficulty index can range from .00 (meaning no one got the item correct)to
1.00 (meaning everyone got the item correct.

• item difficulty actually indicate how easy the item is because it provides the
proportion of individuals who got the item correct.

Example: in a test where 15 of the students in a class of 25 got the first item on
the test correct.

p= 15 = .60
25
• the desired item difficulty depends on the purpose of the assessment, the group
taking the instrument, and the format of the item.
Item Discrimination Index
 A measure of how effectively an item discriminates between examinees who score
high on the test as a whole ( or on some other criterion variable) and those who
score low. (Aiken 2000).

I. Extreme Group Method


 examinees are divided into two groups based on high and low scores.

 calculate by subtracting the proportion of examinees in the lower group from the
proportion of examinees in the upper group who got the item correct or who
endorsed the item in the expected manner.

 item discrimination indices can range from + 1.00 (all of the upper group got it
right and none of the lower group got it right) to – 1.00 (none of the upper
group got it
right and all of the lower group got it right)

 the determination of the upper and lower group will depend on the distribution of
scores. If normal distribution, use the upper 27% for the upper group and
lower
27% for the lower group (Kelly,1939). For small groups Anastasi and Urbina (1997)
suggest the range of upper and lower 25% to 33%.
 In general, negative item discrimination indices, particularly and small positive
indices are indicators that the item needs to be eliminated or revised.
 The resulting value of D ranges from -1 to 1, with values closer to 1 indicating a
strong discrimination between high- and low-performing individuals, and values
closer to 0 indicating poor discrimination. A negative value of D indicates that
low-performing individuals performed better on the item than high-performing
individuals, which may indicate a problem with the item.

ITEM RESPONSE THEORY (IRT) OR LATENT TRAIT THEORY

• Theory of test in which item scores are expressed in terms of estimated scores
on a latent-ability continuum.
• it rests on the assumption that the performance of an examinee on a test item
can be predicted by a set of factors called traits, latent traits or abilities.
• using IRT, we get an indication of an individual’s performance based not on the
total score, but on the precise items the person answers correctly.
• it suggests that the relationship between examinees’ item performance and the
underlying trait being measured can be described by an item characteristic
curve.
Item characteristic curve. A graph, used in item analysis, in which the proportion of
examinees passing a specified item is plotted against total test scores.
• Item response curve is constructed by plotting the proportion of respondents
who gave the keyed response against estimates of their true standing on a uni-
dimensional latent trait or characteristic. An item response curve can be
constructed either from the responses of a large group of examinees to an item,
or if certain parameters are estimated from a theoretical model
Rasch Model – one parameter (item difficulty) model for scaling test items for
purposes of item analysis and test standardization.

- The model is based on the assumption that indexes of guessing and item
discrimination are negligible parameters. As with other latent trait models, the
Rasch model relates examinees’ performances on test items (percentage
passing) to their estimated standings on a hypothetical latent-ability trait or
continuum.

Item Response Theory (IRT) is a statistical modeling framework used to analyze and
interpret responses to test items. IRT assumes that the probability of a person
correctly answering an item is a function of both the person's ability and the
characteristics of the item. In other words, IRT models the relationship between an
individual's ability and the probability of a correct response to an item.

IRT models are used in educational and psychological testing to evaluate the quality
of test items, to estimate individuals' abilities, and to create scoring systems for
tests. Unlike classical test theory, which assumes that the difficulty of a test item is
fixed and independent of the characteristics of the test-takers, IRT models allow for
the estimation of item difficulty and discrimination parameters that are specific to the
item.

IRT models can be used with both dichotomous (yes/no) and polytomous (multiple-
choice) test items. Some commonly used IRT models include the one-parameter
logistic model (also known as the Rasch model), the two-parameter logistic model,
and the three-parameter logistic model. These models differ in the number of
parameters used to describe the relationship between an individual's ability and the
probability of a correct response to an item.

IRT models have several advantages over classical test theory, including the ability
to estimate individuals' abilities more accurately, the ability to estimate item
parameters more precisely, and the ability to create item banks that can be used to
construct customized tests for different population

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy