0% found this document useful (0 votes)
47 views56 pages

Principle of Language Assessment

The document discusses several key principles of language assessment: 1. Practicality refers to the logistical issues of designing, administering, and scoring an assessment. A practical test considers factors like time, resources, and effort. 2. Reliability ensures consistent results across test administrations. Sources of unreliability include student factors, raters, test administration, and subjective test items. 3. Validity means a test measures what it claims to measure. There are several types of validity evidence including content, criteria, and construct-related evidence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views56 pages

Principle of Language Assessment

The document discusses several key principles of language assessment: 1. Practicality refers to the logistical issues of designing, administering, and scoring an assessment. A practical test considers factors like time, resources, and effort. 2. Reliability ensures consistent results across test administrations. Sources of unreliability include student factors, raters, test administration, and subjective test items. 3. Validity means a test measures what it claims to measure. There are several types of validity evidence including content, criteria, and construct-related evidence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Principle of Language

Assessment

Azwinatul Hikmah 22178003


Homanfil Atori N.W 22178008
Khairun Nisa Simanjuntak 22178011
PRINCIPLES OF LANGUAGE ASSESSMENT
TABLE OF CONTENTS

1. Practicality 2. 3. 4.
5. Washback
Reliability Validity Authenticity
PRACTICALITY

Practicality refers to the logistical,


down-to-earth, administrative issues
involved in making, giving, and
scoring an assessment instrument.
A CRITERIA PRACTICAL TEST

 stays within budgetary limits


 can be completed by the test-taker within
appropriate time constraints
 has clear directions for administration
 appropriately utilizes available human resources
 does not exceed available material resources
 considers the time and effort involved to both
design and score
RELIABILITY The principle of reliability in
the following:

 Has consistent conditions


 A reliable test is consistent across two or more
and dependable. administrations
 If you give the same test to the  gives clear directions for
same student or matched scoring/evaluation
students on two different  has uniform rubrics for
occasions, the test should scoring/evaluation
yield similar results.  lends itself to consistent
application of rubrics by the
scorer
 contains items/tasks that are
unambiguous to the test-taker
FOUR TYPES OF EVIDENCE IN THE SUBSEQUENT
SECTIONS.

1. Student-Related Reliability

The most common learner-related issue in


reliability is caused by temporary illness,
fatigue, a “bad day,” anxiety, and other
physical or psychological factors, which may
make an observed score deviate from one’s
“true” score.
2. Rater Reliability

 Lumley (2002) provided some helpful hints


to ensure inter-rater reliability.
 Rater-reliability issues are not limited to
contexts in which two or more scorers are
involved.
 Intra-rater reliability is an internal factor, a
common occurrence for classroom teachers.
3. Test Administration
Reliability

We once witnessed the administration of a test


of aural comprehension in which an audio
player was used to deliver items for
comprehension, but because of street noise
outside the building, students sitting next to
open .
4. Test Reliability

This typically occurs with subjective tests with


open-ended responses (e.g., essay
responses) that require a judgment on the
part of the teacher to determine correct and
incorrect answers.
VALIDITY
Samuel Messick (1989), who is widely recognized as an
expert on validity, as “an integrated evaluative
judgment of the degree to which empirical evidence
and theoretical rationales support the adequacy and
appropriateness of inferences and actions based on
test scores or other modes of assessment”.
A valid test

 measures exactly what it proposes to measure


 does not measure irrelevant or “contaminating” variables
 relies as much as possible on. empirical evidence (performance)
 involves performance that samples the test’s criterion (objective)
 offers useful, meaningful information about a test-taker's ability
 is supported by a theoretical rationale or argument
 TYPES OF EVIDENCE

Content-Related Evidence

If a test actually samples the subject matter about


which conclusions are to be drawn, and if it requires
the test-taker to perform the behavior measured, it can
claim content-related evidence of validity, often
popularly referred to as contentrelated validity (e.g.,
Hughes, 2003; Mousavi, 2009).
Criterion-Related Evidence

second form of evidence of the validity of a


test may be found in what is called
criterion-related evidence, also referred
to as criterion-related validity, or the
extent to which the “criterion” of the
test has actually been reached.
Construct-Related Evidence

Tests are, in a manner of speaking,


operational definitions of constructs in
that their test tasks are the building
blocks of the entity measured (see
Chapelle, 2016; McNamara, 2006; and
Weir, 2005).
3. Consequential Validity (Impact)

Consequential validity encompasses all the


consequences of a test, including such
considerations as its accuracy in measuring
intended criteria, its effect on the
preparation of test-takers, and the (intended
and unintended) social consequences of a
test’s interpretation and use.
4.Face Validity

Face validity refers to the degree to which a


test looks right, and appears to measure
the knowledge or abilities it claims to
measure (Mousavi, 2009).
AUTHENTICITY AN AUTHENTIC TEST.

Bachman and Palmer (1996)  contains language that is as


defined authenticity as “the natural as possible
degree of correspondence of  has items that are
the characteristics of a given contextualized rather than
language test task to the isolated
features of a target language  includes meaningful, relevant,
task” and then suggested an interesting topics
agenda for identifying those  provides some thematic
target language tasks and for organization to items, such as
transforming them into valid through a story line or episode
test items.  offers tasks that replicate real-
world tasks
WASHBACK A TEST THAT PROVIDES
BENEFICIAL WASHBACK

Alderson and Wall (1993)  positively influences what and


considered washback an how teachers teach
important enough concept to  positively influences what and
define a washback hypothesis how learners learn
that essentially elaborated on  offers learners a chance to
how tests influence both adequately prepare
teaching and learning.  gives learners feedback that
enhances their language
development
 is more formative in nature
than summative
 provides conditions for peak
performance by the learner
APPLYING PRINCIPLES TO
CLASSROOM TESTING
● Are the Test Procedures Practical?
Is the Test Itself Reliable?

Can You Ensure Rater


Reliability?
Does the Procedure Demonstrate
Content Validity?

Has the Impact of the Test


Been Carefully Accounted
for?
Are the Test Tasks as Authentic
as Possible?

Muttiple-choice tasks —
decontextualized
Does the Test Offer Beneficial Washback to
the Learner?
MAXIMIZING BOTH PRACTICALITY AND WASHBACK
Validity
Validity
● Validity is defined as a test or assessment used to measure what is supposed to be
measured.
● In test validation, we are not examining the validity of the test content or of even the
test scores themselves, but rather the validity of the way we interpret or use the
information gathered through the testing procedure.
● In examining validity, we look beyond the reliability of the test scores themselves, and
consider the relationships between test performance and other types of performance in
other contexts.
● Validity is a unitary concept.
● An example, if the researcher has to examine one particular research study and
also come up with the same conclusions, then the research study will be valid
internally. In contrast, external validity, the results and conclusions can be
generalized to other situations or to other subjects.
Validity

Internal Validity Content Validity Construct Validity


Include:

Face Validity

External Validity Concurrent Validity Predictive Validity


Reliability & Validity
Content Relevance & content Coverage (Content
validity)

Aspects part of validation:


1. Content relevance
Requires the specification of the behavioral domain in question and the attendant
specification of the task or test domain.
2. Content coverage
The second aspect of examining the test content or the extent to which the tasks
required in the test adequately represent the domain of behavior in question.
Content Validity
 Content validity relates to the ability of an instrument to measure the content (concept) that
must be measured. This means that a measuring instrument is able to reveal the content of a
concept or variable to be measured.
 The validity content relates to the process of logical analysis.\
 For example, the science field of study test must be able to reveal the content of the field of
study, motivation measurement must be able to measure all aspects related to the concept of
motivation.

Face Validity
Validity that indicates whether a measuring device or a research instrument in terms of its
appearance to assess what you want to measure, this validity refers more to the shape and
appearance of the instrument. Three meaning of face validity:
1. Validity by assumption
2. Validity by definition
3. Validity by appearance
Measurement of individual abilities such as measurements of honesty, intelligence, aptitude and
skill.
Construct Validity
 The constructs validity is related to the ability of a measuring
instrument to measure the meaning of a concept.
 Construct validity is seen as a unifying concept, and construct
validation as a process that combines all the evidentiary bases for
validity.
 For Example, in speaking test is to measure the productive oral
mastery, which is the construct of speaking. This construct of
speaking includes the fluency, the pronunciation, the content, the
organization, the grammar, and the diction. When a speaking test
measures all these, we can say that the test is valid by construct.
Criterion Validity
1. Concurrent Validity

Information on concurrent criterion relatedness is used in language testing. The


information typically takes in two forms:
(1) Examining differences in test performance among groups of individuals at different
levels of language ability,
(2) Examining correlations among various measures of a given ability.
For example, in TOEFL test is a valid proficiency test. We make another set of proficiency
test, and then it is administered to our students, who have taken a TOEFL test. The result
of our test is compared with the result of the TOEFL test, using correlation (product-
moment) statistic formula. If there is a high correlation between the two tests, it means
that the test that we make has concurrent validity (with TOEFL test).
2. Predictive Validity

 To examine the predictive utility of test scores in cases, we need to collect data
demonstrating a relationship between scores on the test and course performance.

 For example, we have a program to train teachers at S-2 level, and so we make a
test with the purpose to know whether the participants will be successful or not in
their study at S-2 level. The test is administered at the beginning of S-2 program. By
the end of S-2 program we score the success of the participants. These scores are
compared with the scores of the test that we made and administered at the beginning
of the program. If the result of the comparison shows that there is a correlation
between the two scores, Participants who get good scores from the test at the
beginning of the program also get good grades at the end of the program, then we
can conclude that the test at the beginning of the program has predictive validity.
Evidence supporting construct validity
 Individuals are randomly assigned into two or
more groups, each of group is given a
different treatment.
 At the end of the treatment, observations are
Correlational Evidence made to investigate the differences between
the different groups.

01 02

Correlational evidence comes from


statistical procedures that examine the Experimental Evidence
relationship between variables, or
measures.
Test Bias

 Bias test is when an assessment that measures a student's skills and knowledge
inappropriately to its portion, or penalizes a group of students due to racial, ethnic,
socioeconomic, or gender differences.

 This can happen when assess the cultural contexts, racial stereotypes, or gender biases.
Culture Background
 Tests based on a majority culture measuring, cultural experiences and backgrounds
that come from that culture.
 Minority groups taking the test are measured unfairly because of a lack of familiarity
with constructs from the majority group.

Knowledge Background
 The study examined the performance of individuals with different content
specializations on reading tests.
 The results showed that students' performance was heavily influenced by their
previous background knowledge such as their language skills.

Cognitive Characteristics
 There is no evidence as yet relating performance on language tests to other
characteristics such as inhibition, extroversion, aggression, attitude, and
motivational, which have been mentioned with regard to second language learning.
 This is not to say that these factors do not affect performance on language test.
The consequential or ethical basis of
validity
 Refers to the impact of a test to the test-takers.
 When teacher determine that the final exam should be conducted through internet, it
means that the consequence is that the test-takers should be prepared to be able to use
internet-based.
 Unless, the student test will not be valid as the test taker may be interrupted by the
inability to use the internet.
The consequential or ethical basis of
validity
Messick (1980, 1988b) has identified four areas to be considered in the ethical use and
interpretation of test results.

 The first consideration is that of construct validity, or the evidence that supports the
particular interpretation we wish to make.
 A second area of consideration is that of value systems that inform the particular test
use.
 A third consideration is that of the practical usefulness of the test.
 The fourth area of concern in determining appropriate test use is that of the
consequences to the educational system or society of using test results for a particular
purpose..
Reliability
Listen to poetry
Introduction
A fundamental concern in the development and use of language tests is to
identify potential sources of error in a given measure of communicative
language ability and to minimize the effect of these factors on that measure.
We must be concerned about errors of measurement, or unreliability,
because we know that test performance is affected by factors other than the
abilities we want to measure.
For example, we can all think of factors such as poor health, fatigue, lack of
interest or motivation, and test-wiseness, that can affect individuals’ test
performance, but which are not generally associated with language ability,
and thus not characteristics we want to measure with language tests.
Factors that Affect Language Test Scores
Measurement specialists have long recognized that the
examination of reliability depends upon our ability to distinguish
the effects (on test scores) of the abilities we want to measure
from the effects of other factors.

If we wish to estimate how reliable our test scores are, we must


begin with a set of definitions of the abilities we want to measure,
and of the other factors that we expect to affect test scores
(Stanley 1971: 362).

The effects of these various factors on a test score can be


illustrated in next slide :
The effects of these various factors on a test score can be illustrated in
next slide :
Classical true score measurement theory
When we investigate reliability, it is essential to keep in mind the distinction between
unobservable abilities.
The language abilities we are interested in measuring are abstract, and thus we can never
directly observe, or know, in any absolute sense, an individual’s ‘true’ score for any ability.

True score and error score


Classical true score (CTS) measurement theory consists of a set of
assumptions about the relationships between actual, or observed test scores
and the factors that affect these scores.
Parallel tests
Another concept that is part of CTS theory is that of parallel tests. In order
for two tests to be considered parallel, we assume that they are measures of
the same ability, that is, that an individual’s true score on one test will be the
same as his true score on the other
Reliability as the correlation between parallel
tests
The definitions of true score and error score variance given above are abstract, in the sense
that we cannot actually observe the true and error scores for a given test. These definitions
thus provide no direct means for determining the reliability of observed scores. This is
illustrated in Figure 6.3.
Reliability and measurement error as proportions of observed score variance

Given the means of estimating reliability through computing the correlation between parallel tests,
we can derive a means for estimating the measurement error, as well. If an individual's observed
score on a test is composed of a true score and an error score, the greater the proportion of true
score, the less the proportion of error score, and thus the more reliable the observed score.

Sources of error and approaches to estimating reliability

In any given test situation, there will probably be more than one source of measurement error. If, for
example, we give several groups of individuals a test of listening comprehension in which they
listen to short dialogues or passages read aloud and then select the correct answer from among four
written choices, we assume that test takers' scores on the test will vary according to their different
levels of listening comprehension ability.
Internal consistency

Internal consistency is concerned with how consistent test takers’ performances on the different
parts of the test are with each other. Inconsistencies in performance on different parts of tests can be
caused by a number of factors, including the test method facets discussed.

Split-half reliability estimates

One approach to examining the internal consistency of a test is the split-half method, in which we
divide the test into two halves and then determine the extent to which scores on these two halves
are consistent with each other.
The Spearman-Brown split-half estimate

Once the test has been split into halves, it is rescored, yielding two score - one
for each half - for each test taker. In one approach to estimating reliability, we
then compute the correlation between the two sets of scores. This gives us an
estimate of how consistent the halves are, however, and we are interested in the
rehabillity of the whole test.

Rater consistency

In test scores that are obtained subjectively, such as ratings of compositions or


oral interviews, a source of error is inconsistency in these ratings, In the case of a
single rater, we need to be concerned about the consistency within that
individual’s ratings, or with intra-rater reliability.
The Spearman-Brown split-half estimate

Once the test has been split into halves, it is rescored, yielding two score - one for
each half - for each test taker. In one approach to estimating reliability, we then
compute the correlation between the two sets of scores. This gives us an estimate
of how consistent the halves are, however, and we are interested in the rehbility of
the whole test.

Rater consistency

In test scores that are obtained subjectively, such as ratings of compositions or


oral interviews, a source of error is inconsistency in these ratings, In the case of
a single rater, we need to be concerned about the consistency within that
individual’s ratings, or with intra-rater reliability.
Intra-rater reliability

When an individual judges or rates the adequacy of a given sample of language


performance, whether it is written or spoken, that judgment will be based on a set of
criteria of what constitutes an ‘adequate’ performance. If the rater applies the same set of
criteria consistently in rating the language performance of different individuals,
this will yield a reliable set of ratings.

Stability (test-retest reliability)

As indicated above, for tests such as cloze and dictation we cannot


appropriately estimate the internal consistency of the scores because
of the interdependence of the parts of the test. There are also testing
situations in which it may be necessary to administer a test more
than once.
Equivalence (parallel fows reliability)

Another approach to estimating the reliability of a test


is to examine the equivalence of scores obtained from
alternate forms of a test. Like the test-retest approach,
this is an appropriate means of estimating the
reliability of tests for which internal consistency
Estimates are either inappropriate or not possible.

Summary of classical true score approaches to reliability

The. three approaches to estimating reliability have been deAoped within the CTS measurement
model are concerned with different sources of error. The particular approach or approaches that
we use will depend on what we believe the sources of error are in our measures, given the
particular type of test, administrative procedures, types of test takers, and the use of the test.
Problems with the classical true score model

In many testing situations these apparently straightforward procedures for estimating


the effects of different sources of error are complicated by the fact that the different
sources of error may interact with each other, even when we carefully design our
reliability study. In the previous example, distinguishing lack of equivalence from
interviewer inconsistency may be problematic. Suppose we had four sets of questions.

Generalizability theory

A broad model for investigating the relative effects of different sources of variance in test scores
has been developed by Cronbach and his colleagues (Cronbach et al. 1963; Gleser et al. 1965;
Cronbach et al. 1972). This model, which they call generaiizability theory (G-theory), is
grounded in the framework of factorial design and the analysis of ~ariance.’~ It constitutes a
theory and set of procedures for specifying and estimating the relative effects of different factors
on observed test scores, and thus provides a means for relating the uses or interpretations to be
made of test scores to the way test users specify and interpret different factors as either abilities
or sources of error.
Universes of generalization and universes of measures

When we want to develop or select a test, we generally know the use or uses for which it
is intended, and may also have an idea of what abilities we want to measure. In other
words, we have in mind a universe of generalization, a domain of uses or abilities (or
both) to which we want test scores to generalize

Populations of persons

In addition to defining the universe of possible measures, *he must define the group, or
population of persons about whom we are going to make decisions or inferences. The way in
which we define thispopulation will be determined by the degree of generalizability we need for
the given testing situation. If we intend to use the test results to make decisions about only one
specific group, then that group defines our population of persons.
Universe score

If we could obtain measures for an individual under all the different conditions specified
in the universe of possible measures, his average score on these measures might be
considered the best indicator of his ability. A universe score xp is thus defined as the
mean of a person's scores on all measures from the universe of possible measures (this
universe of possible measures being defined by the facets 2nd conditions of concern for a
given test use).

Standard error of measurement: interpretin individual test scores within classical true
score and generizability theory

The approaches to estimating reliability that have been developed within both CTS theory and
G-theory are based on group performance, and provide information for test developers and test
users about how consistent the scores of groups of individuals are on a given test. However,
reliability and genkralizability coefficients provide no direct information about the accuracy of
individual test scores.
Item response theory

A major limitation to CTS theory is that it does not provide a very satisfactory basis
for predicting how a given individual will perform on a given item. There are two
reasons for this. First, CTS theory makes no assumptions about how an individual’s
level of ability affects the way he performs on a test.

The unidimensionality assumption

Item response theory is based on stronger, or more restrictive assumptions than is CTS theory, and
is thus able to make stronger predictions about individuals’ performance on individual items, their
levels of ability, and about the characteristics of individual items. In order to incorporate
information about test takers’ levels of ability, IRT must make an assumption about the number of
abilities being measured.
Additional
Name Question
information
Rahma Kamanda Sari What are some issues that could effect the validity
of assessment?

Nabilah Rachmadhani What does it mean by "budgetary limit" in


practicability? is the test preparation also included
in the "budgetary limit" or only during the test? 

Lathifa Azhari Can you give us some examples of the real-world


tasks in English language teaching  at school?

Risnanda Can you please explain more about the examples


of macro and micro level aspects?
Nindi Oktriyani If the assessment does not meet the five
principles, can it be said that the assessment
failed? or reassessment?
THANK
YOU!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy