0% found this document useful (0 votes)
32 views28 pages

Testıng 2

Uploaded by

sevilllomen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views28 pages

Testıng 2

Uploaded by

sevilllomen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

PRINCIPLES OF LANGUAGE ASSESSMENT

Five major principles of language assessment are practicality, reliability, validity, authenticity, and
washback.

How do you know whether a test is effective, appropriate, useful, or, in down-to-earth terms, a “good”
test? For the most part, that question can be answered by responding to such questions as;

• Can it be given within appropriate administrative constraints?

• Is it dependable?

• Does it accurately measure what you want it to measure?

• Does the language in the test represent real-world language use?

• Does the test provide information that is useful for the learner?
PRACTICALITY
Practicality refers to the logistical, down-to-earth, administrative;
issues involved in making, giving, and scoring an assessment
instrument. These include “costs, the amount of time it takes to
construct and to administer, ease of scoring, and ease of
interpreting/reporting the results.
• A PRACTICAL TEST . . .

• • stays within budgetary limits

• • can be completed by the test-taker within appropriate time constraints

• • has clear directions for administration

• • appropriately utilizes available human resources

• • does not exceed available material resources

• • considers the time and effort involved to both design and score
RELIABILITY

• A reliable test is consistent and dependable. If you give the same


test to the same student or matched students on two different
occasions, the test should yield similar results.
• A RELIABLE TEST . . .
• Has consistent conditions across two or more administrations
• gives clear directions for scoring/evaluation
• has uniform rubrics for scoring/evaluation
• lends itself to consistent application of rubrics by the scorer
• contains items/tasks that are unambiguous to the test-taker
• The issue of the reliability of tests can be better understood by
considering a number of factors that can contribute to their reliability.
We examine four possible sources of fluctuations in
• (1) the student,
• (2) the scoring,
• (3) the test administration, and
• (4) the test itself.
Student-Related Reliability

• The most common learner-related issue in reliability is caused by


temporary illness, fatigue, a “bad day," anxiety, and other physical or
psychological factors, which may make an observed score deviate
from one’s “true" score. Also included in this category are such factors
as a test-taker’s test-wiseness, or strategies for efficient test-taking
Rater Reliability
• Human error, subjectivity, and bias may enter into the scoring
process. Inter-rater reliability occurs when two or more scorers yield
consistent scores of the same test. Failure to achieve inter-rater
reliability could stem from lack of adherence to scoring criteria,
inexperience, inattention, or even preconceived biases.

• Intra-rater reliability is an internal factor, a common occurrence for


classroom teachers. Such reliability can be violated in cases of unclear
scoring criteria, fatigue, bias toward particular “good" and “bad”
students, or simple carelessness.
Test Administration Reliability
• Unreliability may also result from the conditions in which the test is
administered.
• an audio player being used to deliver items for comprehension, but
because of street noise outside the building, students sitting next to
open windows could not hear the stimuli accurately.

• Other sources of unreliability are found in photocopying variations,


the amount of light in different parts of the room, variations in
temperature, and even the condition of desks and chairs.
Test Reliability
• Sometimes the nature of the test itself can cause measurement
errors. Tests with multiple-choice items must be carefully designed to
include a number of characteristics that guard against unreliability.
For example, the items need to be evenly difficult, distractors need to
be well designed, and items need to be well distributed to make the
test reliable.
• Allocated time, item difficulty, test items not being clear, subjective
test items, etc..
VALIDITY
• The most complex criterion of an effective test—and arguably the
most important principle—is validity, “the extent to which inferences
made from assessment results are appropriate, meaningful, and
useful in terms of the purpose of the assessment.

• Validity is “an integrated evaluative judgment of the degree to which


empirical evidence and theoretical rationales support the adequacy
and appropriateness of inferences and actions based on test scores or
other modes of assessment
• A VALID TEST . . .

• • measures exactly what it proposes to measure

• • does not measure irrelevant or “contaminating” variables

• • relies as much as possible on empirical evidence (performance)

• • involves performance that samples the test’s criterion (objective)

• • offers useful, meaningful information about a test-taker’s ability

• • is supported by a theoretical rationale or argument


Content-Related Evidence
• If a test actually samples the subject matter about which conclusions
are to be drawn, and if it requires the test-taker to perform the
behavior measured, it can claim content-related evidence of validity,
often popularly referred to as content- related validity.

• Content-related evidence is observationally if you can clearly define


the achievement you are measuring.
• Another way of understanding content validity is to consider the difference
between direct and indirect testing. Direct testing involves the test-taker in
actually performing the target task. In an indirect test, learners do not
perform the task itself but rather a task that is related in some way. For
example, if you intend to test learners’ oral production of syllable stress and
your test task is to have learners mark (with written accent marks) stressed
syllables in a list of written words, you could, with a stretch of logic, argue
you are indirectly testing their ordl production. A direct test of syllable
production would require that students actually produce target words orally.
Criterion-Related Evidence
• Criterion-related validity, or the extent to which the "criterion” of the
test has actually been reached.
• Criterion validity (or criterion-related validity) evaluates how
accurately a test measures the outcome it was designed to measure.
An outcome can be a disease, behavior, or performance. Concurrent
validity measures tests and criterion variables in the present.
• Tests measure specified classroom objectives, and implied
predetermined levels of performance are expected to be reached
(80% is considered a minimal passing grade).
• Criterion-related evidence is best demonstrated through a
comparison of results of an assessment with results of some other
measure of the same criterion.

• For example, in a course unit whose objective is for students to orally


produce voiced and voiceless stops in all possible phonetic
environments, the results of one teacher’s unit test might be
compared with an independent assessment—possibly a commercially
produced test in a textbook—of the same phonemic proficiency.
• Criterion-related evidence usually falls into one of two categories: (1)
concurrent and (2) predictive validity.
• A test has concurrent validity if its results are supported by other
concurrent performance beyond the assessment itself. For example,
the validity of a high score on the final exam of a foreign-language
course will be substantiated by actual proficiency in the language.
• The predictive validity of an assessment becomes important in the
case of placement tests, admissions assessment batteries, and
achievement tests designed to determine students’ readiness to
“move on" to another unit. The assessment criterion in such cases is
not to measure concurrent ability but to assess (and predict) a test-
taker’s likelihood of future success.
Construct-Related Evidence
• A construct is any theory, hypothesis, or model that attempts to
explain observed phenomena in our universe of perceptions.
Constructs may or may not be directly or empirically measured—their
verification often requires inferential data. Proficiency and
communicative competence are examples of linguistic constructs; self-
esteem and motivation are psychological constructs. Virtually every
issue in language learning and teaching involves theoretical constructs.
In the field of assessment, construct validity asks, “Does this test
actually tap into the theoretical construct as it has been defined?"
Tests are, in a manner of speaking, operational definitions of constructs
in that their test tasks are the building blocks of the entity measured.
Consequential Validity (Impact)
• Consequential validity encompasses all the consequences of a test,
including such considerations as its accuracy in measuring intended
criteria, its effect on the preparation of test-takers, and the (intended
and unintended) social consequences of a test’s interpretation and use.

• Encompassing the many consequences of assessment, before and after a


test administration. The impact of test-taking and the use of test scores
can be seen at both a macro level (the effect on society and educational
systems) and a micro level (the effect on individual test-takers).
• At the macro level, the wholesale use of standardized tests for such
gatekeeping purposes as college admission “deprive[s] students of
crucial opportunities to learn and acquire productive language skills,’
causing test consumers to be “increasingly disillusioned with EFL
testing.

• As high-stakes assessment has gained ground in the past two decades,


one aspect of consequential validity has drawn special attention: rhe
effect of test preparation courses and manuals on performance.
• At the micro level, specifically the classroom instructional level,
another important consequence of a test falls into the category of
washback. Waugh and Gronlund (2012) encourage teachers to
consider the effect of assessments on students’ motivation,
subsequent performance in a course, independent learning, study
habits, and attitude toward schoolwork.
Face Validity
• Face validity refers to the degree to which a test looks right, and
appears to measure the knowledge or abilities it claims to measure,
based on the subjective judgment of the examinees who take it, the
administrative personnel who decide on its use, and other
psychometrically unsophisticated observers
Teachers can increase a student’s perception of fair tests by using:

• • formats that are expected and well-constructed with familiar tasks

• • tasks that can be accomplished within an allotted time limit

• • items that are clear and uncomplicated

• • directions that are crystal clear

• • tasks that have been rehearsed in their previous course work

• • tasks that relate to their course work (content validity)

• • level of difficulty that presents a reasonable challenge


AUTHENTICITY
• Authenticity is “the degree of correspondence of the characteristics
of a given language test task to the features of a target language task”
and an agenda for identifying those target language tasks and for
transforming them into valid test items.
• Authenticity is not a concept that easily lends itself to empirical
definition, operationalization, or measurement. After all, who can
certify

• Whether a task or language sample is “real-world” or not? Often such


judgments are subjective, and yet authenticity is a concept that has
occupied the attention of numerous language-testing experts.

• Also many test types fail to simulate real-world tasks.


• AN AUTHENTIC TEST . . .
• contains language that is as natural as possible
• has items that are contextualized rather than isolated includes
meaningful, relevant, interesting topics
• provides some thematic organization to items, such as through a story
line or episode
• offers tasks that replicate real-world tasks
WASHBACK
• A facet of consequential validity is “the effect of testing on teaching
and learning”, known in the language assessment field as washback.

• Washback effect may refer to both the promotion and the inhibition
of learning, thus emphasizing what may be referred to as beneficial
versus harmful (or negative) washback.
• A TEST THAT PROVIDES BENEFICIAL WASHBACK . . .

• • positively influences what and how teachers teach

• • positively influences what and how learners learn

• • offers learners a chance to adequately prepare

• • gives learners feedback that enhances their language development

• • is more formative in nature than summative

• • provides conditions for peak performance by the learner

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy