Objectives
Objectives
E.g.
If the results are to be used as a measure of students’ reading skills
our interpretations are to be based on evidence that the scores actually reflect
reading skills
not impacted by irrelevant factor, such as the vocabulary or linguistic complexity
We get similar scores when different teachers independently rate student
performances on the same assessment task
a high degree of reliability from one rater to another
1
6 3/20/2025
Assessment procedure should
Be economical in terms of time and money
Be easily administered
Be easily scored
Produce results that can be accurately interpreted
7 Nature of validity
Validity
The appropriateness of the interpretation and use of the results
A matter of degree
it does not exist on all-or-none basis. (high validity, low validity)
Specific to some particular use or interpretation for a specific population of test
takers
No assessment is valid for all purposes
When indicating computational skill
the mathematics test may have a high degree of validity for 3rd and 4th
graders but a low degree of validity for the 2nd and 5th graders
A reading test
may have high validity for skimming and scanning and low validity for
inferencing
Necessary to consider the specific interpretation or use to be made of the results
8 Major considerations in assessment validation
Content
The assessment content and specifications from which it was derived
Construct
The nature of the characteristics being measured
Assessment-criterion relationships
The relation of the assessment results to other measures
Consequences
The consequences of the uses and interpretations of the results
9 Content
How an individual performs on a domain of tasks that the assessment is supposed
to represent
E.g. knowledge of 200 words
we select 20 words and generalize it to the knowledge of 200
2
9
3/20/2025
11 Content
Assessment development to enhance validity
Table of specifications
Subject-mater content (topics to be learned)
Instructional objectives (types of performance)
12 Content
Assessment development to enhance validity
The percentage in the table
The relative degree of emphasis that each content area and each instructional
objective is to be given in the test
13 Content
Table of specifications
3
3/20/2025
13
Table of specifications
The specifications should be in harmony with what was taught
The weights assigned in the table reflect the emphasis that was given during
instruction
The more closely the Qs match the specified sample
the more valid a measure of student learning
It can be used in selecting tests that publishers prepare
How well do they match with our table of specifications?
14 Construct
Is the test actually measuring the construct it claims it is measuring?
A construct is an individual characteristic or an abstract theoretical concept
assumed to exist to explain some aspect of behavior
Reading comprehension, inferencing, speaking proficiency, intelligence,
creativity, anxiety, mathematical reasoning, etc.
These are called constructs because they are theoretical constructions that are used
to explain performance on an assessment
15 Construct
Construct validation
the process of determining if the performance on an assessment can be
interpreted in terms of a construct(s)
Two questions are important in construct validations
Does the assessment adequately represent the intended construct? (construct
underrepresentation)
Problem-solving task turning into a memorization task
Is performance influenced by factors that are irrelevant to the construct?
(construct-irrelevant variance)
A mathematics test influenced by reading demands
4
3/20/2025
17 Assessment-criterion considerations
When test scores are to be used
to predict future performance
to estimate current performance on some valued measure other than the test
itself (called a criterion)
Concerned with evaluating the relationship between the test and the criterion
18 Assessment-criterion considerations
For example, can ALES scores indicate success at exams in masters programs?
The degree of relationship can be described by statistically correlating the two set of
scores
The resulting correlation coefficient provides a numerical summary of the degree
of relationship between the two sets of scores
Scatter plots and expectancy tables can also be used.
19
20
Example on excel
Interpretation
Interpretation
.90 to
21 Consideration of consequences
Assessments are intended to contribute to improved learning, but do they?
What impact do assessments have on teaching?
5
3/20/2025
21
What impact do assessments have on teaching?
What are the possibly negative, unintended consequences of a particular use of
assessment results?
High importance associated with test results lead teachers to focus narrowly on
what is on the test while ignoring important parts of the curriculum not covered by
the test
E.g. Changing the construct of teaching from problem-solving to memorization
ability because of a high-stakes test
An example: college professors preparing for YDS for several years and end up
passing exam but not speaking English
24 Reliability
The consistency of measurement
how consistent test scores or results are from one assessment to another
The more consistent the assessment results are from one measurement to another
the fewer errors there will be
25
6
24
3/20/2025
The more consistent the assessment results are from one measurement to another
the fewer errors there will be
Consequently, the greater reliability
25 Reliability
An estimate of reliability refers to a particular type of consistency
Different periods of time
Different samples of tasks
Different raters
Low reliability means low validity
But high reliability does not mean high validity
26 Determining reliability in correlation methods
Consistency
over a period of time
over different forms of assessment
within the assessment itself
different raters
27 Test-retest method
The same assessment
administered twice to the same group of students
with a given time interval between the two (a measure of stability)
Not too long not too short for the purpose
The longer the interval between the first and second assessments
influenced by changes in the student characteristic being measured
the smaller the reliability coefficient will be
28 Test-retest method
Stability is important when results are used for several years
like English test scores, but not as important for a unit test
The test-retest method is not very relevant for teacher-constructed classroom tests
Not desirable to readminister the same assessment
In choosing standardized tests, stability is an important criterion
29 Equivalent(parallel)-forms method
Uses two different but equivalent forms of an assessment
Two different tests are prepared based on the same set of specifications
Administered to the same group of students in a short period of time
The resulting assessment scores are correlated
It does not tell anything about long-term stability
30
7
3/20/2025
30 Split-half method
The assessment is administered to a group of students in the usual manner and
then is divided in half for scoring purposes
E.g. to score the even-numbered and the odd-numbered tasks separately
This produces two scores for each student
When correlated, provides a measure of internal consistency
To estimate the scores’ reliability based on the full-length assessment, Spearman
Brown formula is applied
31 Interrater consistency
When student work is judgmentally scored
whether the same scores are assigned by another judge
Consistency can be evaluated with correlation
the scores assigned by one judge with those assigned by another judge
To achieve acceptable levels of interrater consistency
Agreed on scoring-rubrics
Training of raters to use those rubrics with examples of student work
32 Writing rubric
33
34
35
36
37
38
39
40 Examples
41 Reliability methods
8
3/20/2025
To estimate the amount of variation to be expected in the scores
Standard error of measurement
The standard error of measurement is the standard deviation of the errors of
measurement
When the standard error of measurement is small, the confidence band is narrow
(indicating high reliability)
Greater confidence that the obtained score is near the true score
A teacher who is aware of the standard error of measurement realizes that it is
impossible to be dogmatic in interpreting minor differences in assessment scores
9
3/20/2025
To save money, one should not prefer tests with lower validity and reliability
estimates
10