Testing, Assessing, Evaluating, and Teaching: Evaluation Measureme NT
Testing, Assessing, Evaluating, and Teaching: Evaluation Measureme NT
The terms measurement, test, and evaluation are often used synonymously, but actually
these three terms are not the same things. It is important to understand the distinction among
them. To understand them, let’s learn the figure below,
Measureme
nt
Evaluation 4
3
Test
1 2
5
From the figure above, we can devide five areas which can explain the distinction of the three
terms:
Area 1 : Evaluation doesn’t involve either tests or measurement.
e.g : The use of qualitative descriptions of students performance for diagnosing
learning problems.
Area 2 : A non-test measurement for evaluation.
e.g : A teacher ranking used for assigning graders.
Area 3 : A test is used for purposes of evaluation.
e.g : The use of an achievement test to determine students progress.
Area 4 : A non- test uses of tests and measurement.
e.g : The use of a proficiency tests as a criterion in second language acquisition
research.
Area 5 : A non-test measurement that is not used for evaluation.
e.g : Assigning code nimbers to subjects in second language research according to
native language.
In common, the word test sometimes means something with unpleasant and anxiety
feelings or self doubt. According to Brown (2004), a test is a method of measuring a person’s
ability, knowledge, or performance in a given domain. As a method, a test must be explicit and
structured, such as multiple-choice questions which are accompanied by prescribed correct
answers. Some tests measure general ablility, while others focus on very specific competence or
objectives. Most language tests measure one’s ability to perform language, that is, to speak,
write, read or listen to a subset of language.
Kinds of tests
There are many kinds of tess, each with a specific purpose and particular criterion to be
measured (Brown, 2007). Below, you will find description of five test types that are in common
use in language curricula.
Proficieny Tests
A proficiency test is not intended to be limited to any one course, curriculum, or single
skill in the language. Proficiency tests have traditionally consisted of standardized
multiple-choice items on grammar, vocabulary, reading comprehension, oral
comprehension, and sometimes a sample of writing. Typical example of standardized
proficiency tests are the Test of English as a Foreign Language (TOEFL) and the
International English Language Testing System (IELTS).
Diagnostic Tests
A diagnostic test is designed to diagnose a particular aspect of a language. A diagnostic
test in pronunciation might have the purpose of determining which phonological features
of English are difficult for a learner and should therefore become a part of curriculum.
Pacement Tests
Placement test has a purpose to place a student into an appropriate level or section of a
language curriculum or school.
Achievement Tests
An achievement test is related directly to classroom lessons, units, or even a total
curriculum in which limited to particular material covered in a curriculum within a
particular time frame, and are offered after a course has covered the objectives in
question.
Aptitude Tests
A language aptitude test is designed to measure a person’s capacity or general ability to
learn a foreign language and to be successful in that undertaking. Aptitude tests are
considered to be independent of a particular language. Two standardized aptitude tests
were once in popular use – the Modern Language Aptitude Test (MLAT) and the
Pimsleur Language Aptitude Battery (PLAB). Both are English language tests and
require students to perform such tasks as memorizing numbers and vocabulary, listening
to foreign words, and detecting spelling clues and grammatical patterns.
MEASUREMENT
Bachman (1990) states that measurement (in the social science) is the process of
quantifying the characteristics of persons according to explicit procedures and rules. This
definition includes three distinguishing features: quantification, characteristics, and explicit rules
and procedure. Quantification involves the assigning of numbers. Characteristics can be physical
and mental. In testing, we are almost always interested in quantifying mental attributes and
abilities. Mental attributes can be aptitude, intelligent, motivation, field
dependence/independence, attitude, native language, fluency in speaking and achievement in
reading, while abilities refer to performance on a set of mental tasks. The third distinguishing
characteristic of measurement is that quantification must be done according to explicit rules and
procedures.
Measurement specialists have defined four types of measurement scales; nominal scale,
ordinal scale, interval scale, and ratio scale. Nominal scale comprises numbers that are used to
name the classes or categories of a guiven attribute. Ordinal scale comprises the numbering of
different levels of an attribute that are ordered with respect to each other. Interval scale is a
numbering of different levels in which the distances, or intervals, between the levels are equal.
Ratio scale means if we can make comparison in terms of ratios with such scale.
Type of scale
Property Nominal Ordinal Interval Ratio
Distinctiveness + + + +
Ordering - + + +
Equal intervals - - + +
Absolute zero point - - - +
As test developers and test users, we all sincerely want our tests to be the best measures
possible. In order to measure a given language ability, we must be able to specify what it is, and
this specification generally is at two levels. First, at the theoritical level, we need to specify the
ability in relation to/in contrast to other language abilities and other factors that may affect test
performance. Secondly, at the operational level, we need to specify the instances of language
performance that we are willing to interpret as indicators, or tokens, of ability we wish to
measure. In addition to the limitations related to the underspecification of factors that affect test
performance, there are characteristics of the processes of observation and quantification that
limit our interpretations of test results. These derive from the fact that all measures of mental
ability are necessarily indirect, incomplete, imprecise, subjective, and relative.
The limitations discussed above restrict our ability to make such inferences. A major
concern of language test development, therefore, is to minimize the effects of these limitations.
To acomplish this, the development of language tests needs to be based on a logical sequence of
procedures linking the putative ability, or construct, to be observed performance. This sequence
includes three steps: (1) identifying and defining the construct theoritically; (2) defining the
construct operationally, and (3) establishing procedures for quantifying observations (Thorn-dike
and Hagen, 1977 in Bachman (1990)).
Those general steps in measurement provide a framework both for the development of
language tests and for the interpretation of language test results, in that they provide the essential
linkage between the unobservable language ability or construct we are interested in measuring
and the observation of performance, or the behavioral manifestation, of that construct in the form
of a test score. As an example of the application of these steps to language test development,
consider the theoritical definition of pragmatic competence we presented above. The steps in
measurement discussed above also relate to virtually all concern regarding the interpretation of
test results;
defining the construct theoritically provides the basis for evaluating the validity of
the uses of test scores.
defining the construct operationally is also related to test validity in that the
observed relationships among different measures of the same theoritical construct
provide the basis for investigating concurrent relatedness.
EVALUATION
Evaluation can be defined as the systematic gathering of information for the purpose of
making decisions. The probability of making the correct decision in any given situation is a
function not only of the ability of the decision maker, but also of the quality of the information
upon which the decision is based. Evaluation does not necessarily entail testing. It is only when
the results of tests are used as the basis for making a decision that evaluation is involved.
Definition of terms testing, assessing, and teaching
Before differentiating those three terms, it is better for us to learn the figure below;
TESTS
ASSESSMENT
TEACHING
From the figure above, it is described that tests are a subset of assessment and assessment
itself is a subset of teaching. Tests can be useful devices, but they are only one among many
procedures and tasks that teachers can ultimately use to assess students in teaching learning
activities. Teaching sets up the practice games of language learning: the opportunities forn
learners to listen, think, take risks, set goals, and process feedback from teachers and the recycle
through the skills that they are trying to master.
Another distinction of an assessment deals with the function. Two functions are
commonly identified in the literature: formative and summative assessment. Most of our
classroom assessment is formative assessment: evaluating students in the process of forming
their competences and skills with the goal of helping them to continue that growth process.
While summative assessment aims to measure, or summarize, what a student has grasped, and
typically occurs at the end of a course or unit of instruction. A summation of what a student has
learned implies looking back and taking stock of how well that student has accomplished
objectives, but does not necessarily point the way to future progress. Final exams in a course and
general proficiency exams are examples of summative assessment.
From the historical perspective, it underscores two major approaches to language testing
that were debated in the 1970s and early 1980s. These approaches still prevail today, even in
mutated form: discrete-point and integrative testing. Discrete-point tests are constructed on the
assumption that language can be brokeb down into its component parts and that those parts can
be tested successfully. These components are the skills of listening, speaking, reading, and
writing, and various units of language (discete points) of phonology/graphology, morphology,
lexicon, syntax, and discourse. Two types of tests have historically been claimed to be examples
of integrative tests: cloze tests and dictation. A cloze test is a reading passage (perhaps 150 to
300 words) in which roughly every sixth or seventh word has been deleted; the test taker is
required to supply words that fit into those blanhs. Dictation is a familiar language-teaching
technique that evolved into a testing technique. Supporters ague that dictation is an integrative
test because it taps into grammatical and discourse competencies required for other modes of
performance in a language. Success on a dictation requires careful listening, reproduction in
writing of what is heard, efficient short-term memory, and to an extent, some expectancy rules to
aid the short-term memory.
Whether focusing on testing or assessing, a finite nuber of principles can be named that
serve as guidelines for the design of a new test or assessment and for evaluating the efficacy of
an existing procedure. The term test is used as a generic term for both test and formal
assessment, since all the principles apply to both (Brown, 2007). There are five basic principles
for designing effective tests and assessment;
Practicality
It is within the means of financial limitations, time constraints, ease of administration,
and scoring and interpretation.
Reliability
A reliable test is consistent and dependable.
validity (content, face, and construct)
Validity of test deals with the degree to which the test actually measures what it is
intended to measure.
Authenticity
In a test, authenticity may be presented in the following ways:
The language use in the test is as natural as possible
Items are contextualized
Topics and situations are interesting, enjoyable, and humorous
Some thematic organization to items is provided such as through story line
Tasks represent real world tasks
Washback
The feedback should wash back to students in the form of useful diagnoses of strengths
and weakness.
Current issues in classroom testing
By the mid 1980s, the language-testing field had abandoned arguments about the unitary
trait hypothesis and had begun to focus on designing communicative language-testing tasks.
Communicative testing presented challenges to test designers, test constructors began to identify
the kinds of real-world tasks that language learners were called upon to perform. Weir (1990) in
Brown (2004) reminded is readers that “to measure language proficiency ... account must now be
taken of : where, when, how, with whom, and why language is to be used, and on what topics,
and with what effect.” And the assessment field became more and more concerned with the
authenticity of tasks and the genuineness of texts.
interpersonal intelligence
intrapersonal intelligence
spatial intelligence
musical intelligence
bodily-kinesthetic intelligence
contextual intelligence
emotional intelligence
These new conceptualizations of intelligence have not been universally accepted by the
academic community. Nevertheless, their intuitive appeal infused the decade of the 1990s with a
sense of both freedom and responsibility in our testing agenda. Coupled with parallel educational
reforms, they helped to free us from relying exclusively on timed, discrete-point, analytical tests
in measuring language. We were prodded to cautiously combat the potential tyranny of
“objectivity” and its accompanyingimpersonal approach. But, we also assumed the responsibility
for tapping into whole language skills, learning processes, and the ability to negotiate meaning.
Recent years have seen a burgeoning of assessment in which the test-taker performs
responses on a computer. Some computer-based tests (also know as “computer-assisted” or
“web-based” tests) are small-scale “home-grown” tests available on web-sites. Others are
standardized, large-scale tests in which thousands or even tens of thousands of test-takers are
involved. Students receive prompts (or probes, as they are sometimes referred to) in the form of
spoken or written stimuli from the computerized test and are required to type (or in some
cases,speak) their responses. Almost all computer-based test items have fixed, closed-ended
responses. Computer-based testing, with or without CAT technology, offers these advantages:
classromm-based testing
self-directed testing on various aspects of a language (vocabulary, grammar,
discourse, one or all of the four skill, etc)
practice for upcoming high-stakes standardized tests
some individualization, in the case of CATs
large-scale standardized tests that can be administered easily to thousands of test-
takers at many different stations, then scored electronically for rapid reporting of
results.
Of course, some disadvantages are present in our current prediction for computerizing testing.
Among them:
lack of security and possibility of cheating are inherent in classroom-based,
unsupervised computerized tests.
Occasional “home-grown” quizzes that appear on unofficial websides may be
mistaken for validate assessments.
The multiple-choice format preferred for most computer-based tests contains the
usual potential for flawed item design
Open-ended responses are less likely to appear because of the need for human scores,
with all the atendant issues of cost, reliability, and turn-around time.
The human interactive element (especially in oral production) is absent.
REFERENCES
Brown, H.D. 2004. Language assessment: Principles and classroom practies. White Plains.
NY: Pearson Education.
Presented in:
Advanced Assessment in English Language Teaching
(Class discussion)
Lectuer:
Fachrurrazy, MA, PhD