Principles of Language Assessment
Principles of Language Assessment
Washback
A. Practicality
An effective test is practical. This means that it
Is not excessively expensive,
Stays within appropriate time constraints,
Is relatively easy to administer, and
Has a scoring/evaluation procedure that is specific and time-efficient.
A test that is prohibitively expensive is impractical. A test of language proficiency that takes a
student five hours to complete is impractical-it consumes more time (and money) than necessary to
accomplish its objective. A test that requires individual one-on-one proctoring is impractical for a
group of several hundred test-takers and only a handful of examiners. A test that takes a few
minutes for a student to take and several hours for an examiner too evaluate is impractical for most
classroom situations.
B. Reliability
A reliable test is consistent and dependable. If you give the same test to the same student or
matched students on two different occasions, the test should yield similar result. The issue of
reliability of a test may best be addressed by considering a number of factors that may contribute to
the unreliability of a test. Consider the following possibilities (adapted from Mousavi, 2002, p. 804):
fluctuations in the student, in scoring, in test administration, and in the test itself.
Student-Related Reliability
He most common learner-related issue in reliability is caused by temporary illness, fatigue, a “bad
day,” anxiety, and other physical or psychological factors, which may make an “observed” score
deviate from one’s “true” score. Also included in this category are such factors as a test-taker’s
“test-wiseness” or strategies for efficient test taking (Mousavi, 2002, p. 804).
Rater Reliability
Human error, subjectivity, and bias may enter into the scoring process. Inter-rater reliability occurs
when two or more scores yield inconsistent score of the same test, possibly for lack of attention to
scoring criteria, inexperience, inattention, or even preconceived biases. In the story above about the
placement test, the initial scoring plan for the dictations was found to be unreliable-that is, the two
scorers were not applying the same standards.
Test Reliability
Sometimes the nature of the test itself can cause measurement errors. If a test is too long, test-takers
may become fatigued by the time they reach the later items and hastily respond incorrectly. Timed
tests may discriminate against students who do not perform well on a test with a time limit. We all
know people (and you may be include in this category1) who “know” the course material perfectly
but who are adversely affected by the presence of a clock ticking away. Poorly written test items
(that are ambiguous or that have more than on correct answer) may be a further source of test
unreliability.
C. Validity
By far the most complex criterion of an effective test-and arguably the most important principle-is
validity, “the extent to which inferences made from assessment result are appropriate, meaningful,
and useful in terms of the purpose of the assessment” (Ground, 1998, p. 226). A valid test of reading
ability actually measures reading ability-not 20/20 vision, nor previous knowledge in a subject, nor
some other variable of questionable relevance. To measure writing ability, one might ask students
to write as many words as they can in 15 minutes, then simply count the words for the final score.
Such a test would be easy to administer (practical), and the scoring quite dependable (reliable). But
it would not constitute a valid test of writing ability without some consideration of
comprehensibility, rhetorical discourse elements, and the organization of ideas, among other
factors.
Content-Relate Evidence
If a test actually samples the subject matter about which conclusion are to be drawn, and if it
requires the test-takers to perform the behavior that is being measured, it can claim content-related
evidence of validity, often popularly referred to as content validity (e.g., Mousavi, 2002; Hughes,
2003). You can usually identify content-related evidence observationally if you can clearly define
the achievement that you are measuring.
Criterion-Related Evidence
A second of evidence of the validity of a test may be found in what is called criterion-related
evidence, also referred to as criterion-related validity, or the extent to which the “criterion” of the
test has actually been reached. You will recall that in Chapter I it was noted that most classroom-
based assessment with teacher-designed tests fits the concept of criterion-referenced assessment. In
such tests, specified classroom objectives are measured, and implied predetermined levels of
performance are expected to be reached (80 percent is considered a minimal passing grade).
Construct-Related Evidence
A third kind of evidence that can support validity, but one that does not play as large a role
classroom teachers, is construct-related validity, commonly referred to as construct validity. A
construct is any theory, hypothesis, or model that attempts to explain observed phenomena in our
universe of perceptions. Constructs may or may not be directly or empirically measured-their
verification often requires inferential data.
Consequential Validity
As well as the above three widely accepted forms of evidence that may be introduced to support the
validity of an assessment, two other categories may be of some interest and utility in your own
quest for validating classroom test. Messick (1989), Grounlund (1998), McNamara (2000), and
Brindley (2001), among others, underscore the potential importance of the consequences of using an
assessment. Consequential validity encompasses all the consequences of a test, including such
considerations as its accuracy in measuring intended criteria, its impact on the preparation of test-
takers, its effect on the learner, and the (intended and unintended) social consequences of a test’s
interpretation and use.
Face Validity
An important facet of consequential validity is the extent to which “students view the assessment as
fair, relevant, and useful for improving learning” (Gronlund, 1998, p. 210), or what is popularly
known as face validity. “Face validity refers to the degree to which a test looks right, and appears to
measure the knowledge or abilities it claims to measure, based on the subjective judgment of the
examines who take it, the administrative personnel who decode on its use, and other
psychometrically unsophisticated observers” (Mousavi, 2002, p. 244).
D. Authenticity
An fourth major principle of language testing is authenticity, a concept that is a little slippery to
define, especially within the art and science of evaluating and designing tests. Bachman and Palmer
(1996, p. 23) define authenticity as “the degree of correspondence of the characteristics of a given
language test task to the features of a target language task,” and then suggest an agenda for
identifying those target language tasks and for transforming them into valid test items.
E. Washback
A facet of consequential validity, discussed above, is “the effect of testing on teaching and learning”
(Hughes, 2003, p. 1), otherwise known among language-testing specialists as washback. In large-
scale assessment, wasback generally refers to the effects the test have on instruction in terms of how
students prepare for the test. “Cram” courses and “teaching to the test” are examples of such
washback. Another form of washback that occurs more in classroom assessment is the information
that “washes back” to students in the form of useful diagnoses of strengths and weaknesses.
Washback also includes the effects of an assessment on teaching and learning prior to the
assessment itself, that is, on preparation for the assessment.
Reference :
Brown, H. Douglas. 2004. Language Assessment: Principles and Classroom Practices. New York:
Longman.
KIND OF TEST
A. Based on Purposes
There are many kinds of tests; each test has specific purpose and a particular criterion to be
measured. This paper will explain about five kinds of tests based on specific purposes. Those tests
are proficiency test, diagnostic test, placement test, achievement test, language aptitude test.
1. Proficiency Test
The purpose of proficiency test is to test global competence in a language. It tests overall ability
regardless of any training they previously had in the language. Proficiency tests have traditionally
consisted of standardized multiple-choices item on grammar, vocabulary, reading comprehension,
and listening comprehension. One of a standardized proficiency test is TOEFL.
2. Diagnostic Test
The purpose is to diagnose specific aspects of a language. These tests offer a checklist of features for
the teacher to use in discovering difficulties. Proficiency tests should elicit information on what
students need to work in the future; therefore the test will typically offer more detailed
subcategorized information on the learner. For example, a writing diagnostic test would first elicit a
writing sample of the students. Then, the teacher would identify the organization, content, spelling,
grammar, or vocabulary of their writing. Based on that identifying, teacher would know the needs
of students that should have special focus.
3. Placement Test
The purpose of placement test is to place a student into a particular level or section of a language
curriculum or school. It usually includes a sampling of the material to be covered in the various
courses in a curriculum. A student’s performance on the test should indicate the point at which the
student will find material neither too easy nor too difficult. Placement tests come in many varieties:
assessing comprehension and production, responding through written and oral performance,
multiple choice, and gap filling formats. One of the examples of Placement tests is the English as a
Second Language Placement Test (ESLPT) at San Francisco State University.
4. Achievement Test
The purpose of achievement tests is to determine whether course objectives have been met with
skills acquired by the end of a period of instruction. Achievement tests should be limited to
particular material addressed in a curriculum within a particular time frame. Achievement tests
belong to summative because they are administered at the end on a unit/term of study. It analyzes
the extent to which students have acquired language that have already been taught.
5. Language Aptitude Test
The purpose of language aptitude test is to predict a person’s success to exposure to the foreign
language. According to John Carrol and Stanley Sapon (the authors of MLAT), language aptitude
tests does not refer to whether or not an individual can learn a foreign language; but it refers to how
well an individual can learn a foreign language in a given amount of time and under given
conditions. In other words, this test is done to determine how quickly and easily a learner learn
language in language course or language training program. Standardized aptitude tests have been
used in the United States:
1. The Modern Language Aptitude Test (MLAT)
2. The Pimsleur Language Aptitude Battery (PLAB)
B. Based on Response
There are two kinds of tests based on response. They are subjective test and objective test.
1. Subjective Test
Subjective test is a test in which the learners ability or performance are judged by examiners’
opinion and judgment. The example of subjective test is using essay and short answer.
2. Objective Test
Objective test is a test in which learners ability or performance are measured using specific set of
answer, means there are only two possible answer, right and wrong. In other word, the score is
according to right answers. Type of objective test includes multiple choice tests, true or false test,
matching and problem based questions.
Advantages and Disadvantages of Commonly Used Types of Objective Test
Type of test Advantages Disadvantages
True or False Many items can be administered Limited primarily to testing
in a relatively short time. knowledge of information. Easy to
Moderately easy to write and guess correctly on many items, even
easily scored. if material has not been mastered.
Multiple Choice Can be used to assess a broad Difficult and time consuming to
range of content in a brief period. write good items. Possible to assess
Skillfully written items can be higher order cognitive skills, but
measure higher order cognitive most items assess only knowledge.
skills. Can be scored quickly. Some correct answers can be
guesses.
Competence/ system I II
Performance III IV
1. Direct Competence Tests
The direct competence test is a test that focus on to measure the students knowledge about
language component, like grammar or vocabulary, which the elicitation uses one of the basic skills,
speaking, listening, reading, or writing. For the example, a teacher want to know about students
grammar knowledge. The teacher ask the students to write a letter to elicit students knowledge in
grammar.
2. Indirect Competence Test
The indirect competence test is a test that focus on to measure the students knowledge about
language component, like grammar or vocabulary, which the elicitation does not use one of the
basic skills, speaking, listening, reading, or writing. The elicitation in this test uses other ways, such
as multiple choice. For example, the teacher want to know about students grammar knowledge. The
teacher gives a multiple choice test for the students to measure students knowledge in grammar.
3. Direct Performance Test
Direct performance test is a test that focus on to measure the students skill in reading, writing,
speaking, and listening that the elicitation is through direct communication. For example, the
teacher want to know the students skill in writing, the teacher ask the students to write a letter, or
to write a short story.
4. Indirect Performance Test
Indirect performance test is a test that focus on measure the students skill in reading, writing,
speaking, and listening that the elicitation does not use the basic skill. For example, the teacher
want to measure the sutedents skill in listening. The teacher gives some picture and asks the
students to arrange the students the pictures into correct order based on the story that they listen to.
1. Norm-Referenced Test
Norm-referenced tests are designed to highlight achievement differences between and among
students to produce a dependable rank order of students across a continuum of achievement from
high achievers to low achievers (Stiggins, 1994). School systems might want to classify students in
this way so that they can be properly placed in remedial or gifted programs. The content of norm-
referenced tests is selected according to how well it ranks students from high achievers to low. In
other words, the content selected in norm-referenced tests is chosen by how well it descriminates
among students. A student’s performance on an norm referenced test is interpreted in relation to
the performance of a large group of similar students who took the test when it was first normed.
For example, if a student receives a percentile rank score on the total test of 34, this means that he or
she performed as well or better than 34% of the students in the norm group. This type of
information can useful for deciding whether or not students need remedial assistance or is a
candidate for a gifted program. However, the score gives little information about what the student
actually knows or can do.
2. Criterion-Referenced Test
Criterion-referenced tests determine what test takers can do and what they know, not how they
compare to others (Anastasi, 1988). Criterion-referenced tests report how well students are doing
relative to a pre-determined performance level on a specified set of educational goals or outcomes
included in the school, district, or state curriculum. Educators may choose to use a criterian-
referenced test when they wish to see how well students have learned the knowledge and skills
which they are expected to have mastered. This information may be used as one piece of
information to determine how well the student is learning the desired curriculum and how well the
school is teaching that curriculum. The content of a criterion-referenced test is determined by how
well it matches the learning outcomes deemed most important. In other words, the content selected
for the criterion-standard tets is selected on the basis of its significance in the curriculum. Criterion-
referenced tests give detailed information about how well a student has performed on each of the
educational goals or outcomes included on that test.