0% found this document useful (0 votes)
76 views10 pages

Testing, Assessing, Evaluating, and Teaching: Evaluation Measureme NT

This document defines key terms in testing and assessment: measurement, tests, and evaluation. It distinguishes that measurement refers to quantifying characteristics according to rules, tests are structured methods to measure ability or performance, and evaluation does not always involve tests or measurement. There are different types of tests, each with their own purpose, such as proficiency tests, diagnostic tests, placement tests, achievement tests, and aptitude tests. Reliability and validity are important qualities for accurate measurement and interpretation of test scores. Scales used in measurement include nominal, ordinal, interval, and ratio scales. Developing good language tests requires defining the construct to be measured at both a theoretical and operational level.

Uploaded by

Syams Akbar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views10 pages

Testing, Assessing, Evaluating, and Teaching: Evaluation Measureme NT

This document defines key terms in testing and assessment: measurement, tests, and evaluation. It distinguishes that measurement refers to quantifying characteristics according to rules, tests are structured methods to measure ability or performance, and evaluation does not always involve tests or measurement. There are different types of tests, each with their own purpose, such as proficiency tests, diagnostic tests, placement tests, achievement tests, and aptitude tests. Reliability and validity are important qualities for accurate measurement and interpretation of test scores. Scales used in measurement include nominal, ordinal, interval, and ratio scales. Developing good language tests requires defining the construct to be measured at both a theoretical and operational level.

Uploaded by

Syams Akbar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 10

TESTING, ASSESSING, EVALUATING, AND TEACHING

Definition of terms measurement, test and evaluation

The terms measurement, test, and evaluation are often used synonymously, but actually
these three terms are not the same things. It is important to understand the distinction among
them. To understand them, let’s learn the figure below,

Measureme
nt
Evaluation 4
3
Test
1 2

5
From the figure above, we can devide five areas which can explain the distinction of the three
terms:
 Area 1 : Evaluation doesn’t involve either tests or measurement.
e.g : The use of qualitative descriptions of students performance for diagnosing
learning problems.
 Area 2 : A non-test measurement for evaluation.
e.g : A teacher ranking used for assigning graders.
 Area 3 : A test is used for purposes of evaluation.
e.g : The use of an achievement test to determine students progress.
 Area 4 : A non- test uses of tests and measurement.
e.g : The use of a proficiency tests as a criterion in second language acquisition
research.
 Area 5 : A non-test measurement that is not used for evaluation.
e.g : Assigning code nimbers to subjects in second language research according to
native language.

From the distinction above, it is clear that:

 Not all measurements are tests

 Not all tests are evaluative

 Not all evaluation involves either measurement or tests


TESTS

In common, the word test sometimes means something with unpleasant and anxiety
feelings or self doubt. According to Brown (2004), a test is a method of measuring a person’s
ability, knowledge, or performance in a given domain. As a method, a test must be explicit and
structured, such as multiple-choice questions which are accompanied by prescribed correct
answers. Some tests measure general ablility, while others focus on very specific competence or
objectives. Most language tests measure one’s ability to perform language, that is, to speak,
write, read or listen to a subset of language.

According to Bachman (1990), a test is a measurement instrument designed to elicit a


specific sample of an individuals’ behavior. From this definition, it is clear that a test is one type
of measurement, and of course, there are still many other kinds of instrument to measure
something. The distinction between test and measurement is very clear, so these two terms can
not be used to subtitude one another.

Kinds of tests

There are many kinds of tess, each with a specific purpose and particular criterion to be
measured (Brown, 2007). Below, you will find description of five test types that are in common
use in language curricula.
 Proficieny Tests
A proficiency test is not intended to be limited to any one course, curriculum, or single
skill in the language. Proficiency tests have traditionally consisted of standardized
multiple-choice items on grammar, vocabulary, reading comprehension, oral
comprehension, and sometimes a sample of writing. Typical example of standardized
proficiency tests are the Test of English as a Foreign Language (TOEFL) and the
International English Language Testing System (IELTS).
 Diagnostic Tests
A diagnostic test is designed to diagnose a particular aspect of a language. A diagnostic
test in pronunciation might have the purpose of determining which phonological features
of English are difficult for a learner and should therefore become a part of curriculum.
 Pacement Tests
Placement test has a purpose to place a student into an appropriate level or section of a
language curriculum or school.
 Achievement Tests
An achievement test is related directly to classroom lessons, units, or even a total
curriculum in which limited to particular material covered in a curriculum within a
particular time frame, and are offered after a course has covered the objectives in
question.
 Aptitude Tests
A language aptitude test is designed to measure a person’s capacity or general ability to
learn a foreign language and to be successful in that undertaking. Aptitude tests are
considered to be independent of a particular language. Two standardized aptitude tests
were once in popular use – the Modern Language Aptitude Test (MLAT) and the
Pimsleur Language Aptitude Battery (PLAB). Both are English language tests and
require students to perform such tasks as memorizing numbers and vocabulary, listening
to foreign words, and detecting spelling clues and grammatical patterns.
MEASUREMENT

Bachman (1990) states that measurement (in the social science) is the process of
quantifying the characteristics of persons according to explicit procedures and rules. This
definition includes three distinguishing features: quantification, characteristics, and explicit rules
and procedure. Quantification involves the assigning of numbers. Characteristics can be physical
and mental. In testing, we are almost always interested in quantifying mental attributes and
abilities. Mental attributes can be aptitude, intelligent, motivation, field
dependence/independence, attitude, native language, fluency in speaking and achievement in
reading, while abilities refer to performance on a set of mental tasks. The third distinguishing
characteristic of measurement is that quantification must be done according to explicit rules and
procedures.

If we are to interpret the score on a given test as an indicator of an individual’s ability,


that score must be both reliable and valid. Reliability has to do with the consistency of measures
accross different times, test forms, raters, and other characteristics of the measurement context.
The primary concerns in examining the reliability of test scores are:

1. To identify the different sources of error.


2. To use the appropriate empirical procedures for estimating the effect of these
sources of error on test scores.
While validity refers to the extent to which the inference or decisions we make on the basis of
test scores are meaningful, appropriate, and useful (American Psychological Association, 1985 in
Bachman (1990)). In examining vlidity, we must also be concerned with the appropriateness and
usefulness of the test score for a given purpose. Reliability and validity are both essential to the
use of tests. Reliability is a quality of tests scores themselves, while validity is a quality of test
interpretation and use. Therefore, a test score which is not reliable, it can not be valid.

Measurement specialists have defined four types of measurement scales; nominal scale,
ordinal scale, interval scale, and ratio scale. Nominal scale comprises numbers that are used to
name the classes or categories of a guiven attribute. Ordinal scale comprises the numbering of
different levels of an attribute that are ordered with respect to each other. Interval scale is a
numbering of different levels in which the distances, or intervals, between the levels are equal.
Ratio scale means if we can make comparison in terms of ratios with such scale.
Type of scale
Property Nominal Ordinal Interval Ratio
Distinctiveness + + + +
Ordering - + + +
Equal intervals - - + +
Absolute zero point - - - +

As test developers and test users, we all sincerely want our tests to be the best measures
possible. In order to measure a given language ability, we must be able to specify what it is, and
this specification generally is at two levels. First, at the theoritical level, we need to specify the
ability in relation to/in contrast to other language abilities and other factors that may affect test
performance. Secondly, at the operational level, we need to specify the instances of language
performance that we are willing to interpret as indicators, or tokens, of ability we wish to
measure. In addition to the limitations related to the underspecification of factors that affect test
performance, there are characteristics of the processes of observation and quantification that
limit our interpretations of test results. These derive from the fact that all measures of mental
ability are necessarily indirect, incomplete, imprecise, subjective, and relative.

The limitations discussed above restrict our ability to make such inferences. A major
concern of language test development, therefore, is to minimize the effects of these limitations.
To acomplish this, the development of language tests needs to be based on a logical sequence of
procedures linking the putative ability, or construct, to be observed performance. This sequence
includes three steps: (1) identifying and defining the construct theoritically; (2) defining the
construct operationally, and (3) establishing procedures for quantifying observations (Thorn-dike
and Hagen, 1977 in Bachman (1990)).

Those general steps in measurement provide a framework both for the development of
language tests and for the interpretation of language test results, in that they provide the essential
linkage between the unobservable language ability or construct we are interested in measuring
and the observation of performance, or the behavioral manifestation, of that construct in the form
of a test score. As an example of the application of these steps to language test development,
consider the theoritical definition of pragmatic competence we presented above. The steps in
measurement discussed above also relate to virtually all concern regarding the interpretation of
test results;

 defining the construct theoritically provides the basis for evaluating the validity of
the uses of test scores.

 defining the construct operationally is also related to test validity in that the
observed relationships among different measures of the same theoritical construct
provide the basis for investigating concurrent relatedness.

 establishing procedures for quantifying observations is dirrectly related to


reliability in which the precision of the scales we use and the consistency with
which they are applied across different test administrations, different test forms,
different scores, and with different groups of test takers will affect the results of
tests.

EVALUATION

Evaluation can be defined as the systematic gathering of information for the purpose of
making decisions. The probability of making the correct decision in any given situation is a
function not only of the ability of the decision maker, but also of the quality of the information
upon which the decision is based. Evaluation does not necessarily entail testing. It is only when
the results of tests are used as the basis for making a decision that evaluation is involved.
Definition of terms testing, assessing, and teaching

Before differentiating those three terms, it is better for us to learn the figure below;

TESTS

ASSESSMENT

TEACHING

From the figure above, it is described that tests are a subset of assessment and assessment
itself is a subset of teaching. Tests can be useful devices, but they are only one among many
procedures and tasks that teachers can ultimately use to assess students in teaching learning
activities. Teaching sets up the practice games of language learning: the opportunities forn
learners to listen, think, take risks, set goals, and process feedback from teachers and the recycle
through the skills that they are trying to master.

Distinguishing among tests, assessment and teaching is to distingusih between informal


and formal assessment. Informal assessment can take a number of forms, starting with incidental,
unplanned comments and responses, along with coaching and other impromptu feedback to the
students. On the other hand, formal assessments are exercises or procedures specifically
designed to tap into a storehouse of skills and knowledge. They are systematic, planned sampling
techniques constructed to give teacher and student an appraisal of student achievement. And we
can say that all tests are formal assessment, but not all formal assessment is testing.

Another distinction of an assessment deals with the function. Two functions are
commonly identified in the literature: formative and summative assessment. Most of our
classroom assessment is formative assessment: evaluating students in the process of forming
their competences and skills with the goal of helping them to continue that growth process.
While summative assessment aims to measure, or summarize, what a student has grasped, and
typically occurs at the end of a course or unit of instruction. A summation of what a student has
learned implies looking back and taking stock of how well that student has accomplished
objectives, but does not necessarily point the way to future progress. Final exams in a course and
general proficiency exams are examples of summative assessment.
From the historical perspective, it underscores two major approaches to language testing
that were debated in the 1970s and early 1980s. These approaches still prevail today, even in
mutated form: discrete-point and integrative testing. Discrete-point tests are constructed on the
assumption that language can be brokeb down into its component parts and that those parts can
be tested successfully. These components are the skills of listening, speaking, reading, and
writing, and various units of language (discete points) of phonology/graphology, morphology,
lexicon, syntax, and discourse. Two types of tests have historically been claimed to be examples
of integrative tests: cloze tests and dictation. A cloze test is a reading passage (perhaps 150 to
300 words) in which roughly every sixth or seventh word has been deleted; the test taker is
required to supply words that fit into those blanhs. Dictation is a familiar language-teaching
technique that evolved into a testing technique. Supporters ague that dictation is an integrative
test because it taps into grammatical and discourse competencies required for other modes of
performance in a language. Success on a dictation requires careful listening, reproduction in
writing of what is heard, efficient short-term memory, and to an extent, some expectancy rules to
aid the short-term memory.

Principles of language assessment

Whether focusing on testing or assessing, a finite nuber of principles can be named that
serve as guidelines for the design of a new test or assessment and for evaluating the efficacy of
an existing procedure. The term test is used as a generic term for both test and formal
assessment, since all the principles apply to both (Brown, 2007). There are five basic principles
for designing effective tests and assessment;
 Practicality
It is within the means of financial limitations, time constraints, ease of administration,
and scoring and interpretation.
 Reliability
A reliable test is consistent and dependable.
 validity (content, face, and construct)
Validity of test deals with the degree to which the test actually measures what it is
intended to measure.
 Authenticity
In a test, authenticity may be presented in the following ways:
 The language use in the test is as natural as possible
 Items are contextualized
 Topics and situations are interesting, enjoyable, and humorous
 Some thematic organization to items is provided such as through story line
 Tasks represent real world tasks
 Washback
The feedback should wash back to students in the form of useful diagnoses of strengths
and weakness.
Current issues in classroom testing

By the mid 1980s, the language-testing field had abandoned arguments about the unitary
trait hypothesis and had begun to focus on designing communicative language-testing tasks.
Communicative testing presented challenges to test designers, test constructors began to identify
the kinds of real-world tasks that language learners were called upon to perform. Weir (1990) in
Brown (2004) reminded is readers that “to measure language proficiency ... account must now be
taken of : where, when, how, with whom, and why language is to be used, and on what topics,
and with what effect.” And the assessment field became more and more concerned with the
authenticity of tasks and the genuineness of texts.

Instead of just offering paper-and-pencil selective response tests of a plethora of separate


items, perfomance-based assessment of language typically involves oral production, writtten
production, open-ended response, integrate performance ( across skill areas), group performance,
and other interactive tasks. Such assessment is time-consuming and therefore expensive, but
those extra efforts are paying off in the form of more direct testing because students are assessed
as they perform actual or simulated real-world tasks. In technical terms, higher content validity is
achieved because learners are measured in the process of performing the targeted linguistic acts.
In an English language-teaching context, performance-based assessment means that you may
have a difficult time distinguishing between formal and informal assessment. If you rely a little
less on formally structured tests and a little more on evaluation while students are performing
various tasks, you will be taking some steps toward meeting the goals of performance-based
testing.

The design of communicative performance-based assessment rubrics continues to


challenge both assessment experts and classroom teachers. Such efforts to improve various facets
of classroom testing are accompanied by some stimulating issues, all of which are helping to
shape our current understanding of effective assessment. Intelligence was once viewed strictly as
the ability to perform (a) linguistic and (b) logical-mathematical problem solving. This “IQ”
concept of intelligence has permeated the western world and its way of testing for almost a
century. However, research on intelligence by psychologista like Howard Gardner, Robert
Sternberg, and Daniel Goleman has begun to turn the psychometric world upside down. Standard
theories of intelligence, on which standardized IQ (and other) tests are based, were expanded to
include seven different components among others (Brown, 2007). The seven components are:

 interpersonal intelligence
 intrapersonal intelligence
 spatial intelligence
 musical intelligence
 bodily-kinesthetic intelligence
 contextual intelligence
 emotional intelligence

These new conceptualizations of intelligence have not been universally accepted by the
academic community. Nevertheless, their intuitive appeal infused the decade of the 1990s with a
sense of both freedom and responsibility in our testing agenda. Coupled with parallel educational
reforms, they helped to free us from relying exclusively on timed, discrete-point, analytical tests
in measuring language. We were prodded to cautiously combat the potential tyranny of
“objectivity” and its accompanyingimpersonal approach. But, we also assumed the responsibility
for tapping into whole language skills, learning processes, and the ability to negotiate meaning.

Recent years have seen a burgeoning of assessment in which the test-taker performs
responses on a computer. Some computer-based tests (also know as “computer-assisted” or
“web-based” tests) are small-scale “home-grown” tests available on web-sites. Others are
standardized, large-scale tests in which thousands or even tens of thousands of test-takers are
involved. Students receive prompts (or probes, as they are sometimes referred to) in the form of
spoken or written stimuli from the computerized test and are required to type (or in some
cases,speak) their responses. Almost all computer-based test items have fixed, closed-ended
responses. Computer-based testing, with or without CAT technology, offers these advantages:

 classromm-based testing
 self-directed testing on various aspects of a language (vocabulary, grammar,
discourse, one or all of the four skill, etc)
 practice for upcoming high-stakes standardized tests
 some individualization, in the case of CATs
 large-scale standardized tests that can be administered easily to thousands of test-
takers at many different stations, then scored electronically for rapid reporting of
results.
Of course, some disadvantages are present in our current prediction for computerizing testing.
Among them:
 lack of security and possibility of cheating are inherent in classroom-based,
unsupervised computerized tests.
 Occasional “home-grown” quizzes that appear on unofficial websides may be
mistaken for validate assessments.
 The multiple-choice format preferred for most computer-based tests contains the
usual potential for flawed item design
 Open-ended responses are less likely to appear because of the need for human scores,
with all the atendant issues of cost, reliability, and turn-around time.
 The human interactive element (especially in oral production) is absent.
REFERENCES

Bachman, L.F. 1990. Fundamental considerations in language testing. Oxford: Oxford


University Press.

Brown, H.D. 2004. Language assessment: Principles and classroom practies. White Plains.
NY: Pearson Education.

Brown, H.D. 2007. Teaching by principles: An interactive approach to language pedagogy.


White Plains. NY: Pearson Education.
TESTING, ASSESSING, EVALUATING, AND TEACHING

Presented in:
Advanced Assessment in English Language Teaching
(Class discussion)

Lectuer:
Fachrurrazy, MA, PhD

GRADUATE PROGRAM IN ENGLISH LANGUAGE TEACHING


STATE UNIVERSITY OF MALANG
FEBRUARY 2013

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy