Principle of Language Assessment
Principle of Language Assessment
Assessment
1. Practicality 2. 3. 4.
5. Washback
Reliability Validity Authenticity
PRACTICALITY
1. Student-Related Reliability
Content-Related Evidence
Muttiple-choice tasks —
decontextualized
Does the Test Offer Beneficial Washback to
the Learner?
MAXIMIZING BOTH PRACTICALITY AND WASHBACK
Validity
Validity
● Validity is defined as a test or assessment used to measure what is supposed to be
measured.
● In test validation, we are not examining the validity of the test content or of even the
test scores themselves, but rather the validity of the way we interpret or use the
information gathered through the testing procedure.
● In examining validity, we look beyond the reliability of the test scores themselves, and
consider the relationships between test performance and other types of performance in
other contexts.
● Validity is a unitary concept.
● An example, if the researcher has to examine one particular research study and
also come up with the same conclusions, then the research study will be valid
internally. In contrast, external validity, the results and conclusions can be
generalized to other situations or to other subjects.
Validity
Face Validity
Face Validity
Validity that indicates whether a measuring device or a research instrument in terms of its
appearance to assess what you want to measure, this validity refers more to the shape and
appearance of the instrument. Three meaning of face validity:
1. Validity by assumption
2. Validity by definition
3. Validity by appearance
Measurement of individual abilities such as measurements of honesty, intelligence, aptitude and
skill.
Construct Validity
The constructs validity is related to the ability of a measuring
instrument to measure the meaning of a concept.
Construct validity is seen as a unifying concept, and construct
validation as a process that combines all the evidentiary bases for
validity.
For Example, in speaking test is to measure the productive oral
mastery, which is the construct of speaking. This construct of
speaking includes the fluency, the pronunciation, the content, the
organization, the grammar, and the diction. When a speaking test
measures all these, we can say that the test is valid by construct.
Criterion Validity
1. Concurrent Validity
To examine the predictive utility of test scores in cases, we need to collect data
demonstrating a relationship between scores on the test and course performance.
For example, we have a program to train teachers at S-2 level, and so we make a
test with the purpose to know whether the participants will be successful or not in
their study at S-2 level. The test is administered at the beginning of S-2 program. By
the end of S-2 program we score the success of the participants. These scores are
compared with the scores of the test that we made and administered at the beginning
of the program. If the result of the comparison shows that there is a correlation
between the two scores, Participants who get good scores from the test at the
beginning of the program also get good grades at the end of the program, then we
can conclude that the test at the beginning of the program has predictive validity.
Evidence supporting construct validity
Individuals are randomly assigned into two or
more groups, each of group is given a
different treatment.
At the end of the treatment, observations are
Correlational Evidence made to investigate the differences between
the different groups.
01 02
Bias test is when an assessment that measures a student's skills and knowledge
inappropriately to its portion, or penalizes a group of students due to racial, ethnic,
socioeconomic, or gender differences.
This can happen when assess the cultural contexts, racial stereotypes, or gender biases.
Culture Background
Tests based on a majority culture measuring, cultural experiences and backgrounds
that come from that culture.
Minority groups taking the test are measured unfairly because of a lack of familiarity
with constructs from the majority group.
Knowledge Background
The study examined the performance of individuals with different content
specializations on reading tests.
The results showed that students' performance was heavily influenced by their
previous background knowledge such as their language skills.
Cognitive Characteristics
There is no evidence as yet relating performance on language tests to other
characteristics such as inhibition, extroversion, aggression, attitude, and
motivational, which have been mentioned with regard to second language learning.
This is not to say that these factors do not affect performance on language test.
The consequential or ethical basis of
validity
Refers to the impact of a test to the test-takers.
When teacher determine that the final exam should be conducted through internet, it
means that the consequence is that the test-takers should be prepared to be able to use
internet-based.
Unless, the student test will not be valid as the test taker may be interrupted by the
inability to use the internet.
The consequential or ethical basis of
validity
Messick (1980, 1988b) has identified four areas to be considered in the ethical use and
interpretation of test results.
The first consideration is that of construct validity, or the evidence that supports the
particular interpretation we wish to make.
A second area of consideration is that of value systems that inform the particular test
use.
A third consideration is that of the practical usefulness of the test.
The fourth area of concern in determining appropriate test use is that of the
consequences to the educational system or society of using test results for a particular
purpose..
Reliability
Listen to poetry
Introduction
A fundamental concern in the development and use of language tests is to
identify potential sources of error in a given measure of communicative
language ability and to minimize the effect of these factors on that measure.
We must be concerned about errors of measurement, or unreliability,
because we know that test performance is affected by factors other than the
abilities we want to measure.
For example, we can all think of factors such as poor health, fatigue, lack of
interest or motivation, and test-wiseness, that can affect individuals’ test
performance, but which are not generally associated with language ability,
and thus not characteristics we want to measure with language tests.
Factors that Affect Language Test Scores
Measurement specialists have long recognized that the
examination of reliability depends upon our ability to distinguish
the effects (on test scores) of the abilities we want to measure
from the effects of other factors.
Given the means of estimating reliability through computing the correlation between parallel tests,
we can derive a means for estimating the measurement error, as well. If an individual's observed
score on a test is composed of a true score and an error score, the greater the proportion of true
score, the less the proportion of error score, and thus the more reliable the observed score.
In any given test situation, there will probably be more than one source of measurement error. If, for
example, we give several groups of individuals a test of listening comprehension in which they
listen to short dialogues or passages read aloud and then select the correct answer from among four
written choices, we assume that test takers' scores on the test will vary according to their different
levels of listening comprehension ability.
Internal consistency
Internal consistency is concerned with how consistent test takers’ performances on the different
parts of the test are with each other. Inconsistencies in performance on different parts of tests can be
caused by a number of factors, including the test method facets discussed.
One approach to examining the internal consistency of a test is the split-half method, in which we
divide the test into two halves and then determine the extent to which scores on these two halves
are consistent with each other.
The Spearman-Brown split-half estimate
Once the test has been split into halves, it is rescored, yielding two score - one
for each half - for each test taker. In one approach to estimating reliability, we
then compute the correlation between the two sets of scores. This gives us an
estimate of how consistent the halves are, however, and we are interested in the
rehabillity of the whole test.
Rater consistency
Once the test has been split into halves, it is rescored, yielding two score - one for
each half - for each test taker. In one approach to estimating reliability, we then
compute the correlation between the two sets of scores. This gives us an estimate
of how consistent the halves are, however, and we are interested in the rehbility of
the whole test.
Rater consistency
The. three approaches to estimating reliability have been deAoped within the CTS measurement
model are concerned with different sources of error. The particular approach or approaches that
we use will depend on what we believe the sources of error are in our measures, given the
particular type of test, administrative procedures, types of test takers, and the use of the test.
Problems with the classical true score model
Generalizability theory
A broad model for investigating the relative effects of different sources of variance in test scores
has been developed by Cronbach and his colleagues (Cronbach et al. 1963; Gleser et al. 1965;
Cronbach et al. 1972). This model, which they call generaiizability theory (G-theory), is
grounded in the framework of factorial design and the analysis of ~ariance.’~ It constitutes a
theory and set of procedures for specifying and estimating the relative effects of different factors
on observed test scores, and thus provides a means for relating the uses or interpretations to be
made of test scores to the way test users specify and interpret different factors as either abilities
or sources of error.
Universes of generalization and universes of measures
When we want to develop or select a test, we generally know the use or uses for which it
is intended, and may also have an idea of what abilities we want to measure. In other
words, we have in mind a universe of generalization, a domain of uses or abilities (or
both) to which we want test scores to generalize
Populations of persons
In addition to defining the universe of possible measures, *he must define the group, or
population of persons about whom we are going to make decisions or inferences. The way in
which we define thispopulation will be determined by the degree of generalizability we need for
the given testing situation. If we intend to use the test results to make decisions about only one
specific group, then that group defines our population of persons.
Universe score
If we could obtain measures for an individual under all the different conditions specified
in the universe of possible measures, his average score on these measures might be
considered the best indicator of his ability. A universe score xp is thus defined as the
mean of a person's scores on all measures from the universe of possible measures (this
universe of possible measures being defined by the facets 2nd conditions of concern for a
given test use).
Standard error of measurement: interpretin individual test scores within classical true
score and generizability theory
The approaches to estimating reliability that have been developed within both CTS theory and
G-theory are based on group performance, and provide information for test developers and test
users about how consistent the scores of groups of individuals are on a given test. However,
reliability and genkralizability coefficients provide no direct information about the accuracy of
individual test scores.
Item response theory
A major limitation to CTS theory is that it does not provide a very satisfactory basis
for predicting how a given individual will perform on a given item. There are two
reasons for this. First, CTS theory makes no assumptions about how an individual’s
level of ability affects the way he performs on a test.
Item response theory is based on stronger, or more restrictive assumptions than is CTS theory, and
is thus able to make stronger predictions about individuals’ performance on individual items, their
levels of ability, and about the characteristics of individual items. In order to incorporate
information about test takers’ levels of ability, IRT must make an assumption about the number of
abilities being measured.
Additional
Name Question
information
Rahma Kamanda Sari What are some issues that could effect the validity
of assessment?