Mid-Term Scope Dpe 104
Mid-Term Scope Dpe 104
As teachers, we are continually faced with the challenge of assessing the progress of our
students as well as our own effectiveness as teachers. Assessment decisions could substantially
improve student performance, guide the teachers in enhancing the teaching-learning process and
assist policy makers in improving the educational system. At the same time, however, poor
assessment procedures could adversely affect the students, teachers and administrators. Assessment
of learning is a tricky business, indeed, for it requires measuring concepts, ideas and abstract
constructs quite unlike the assessment of physical quantities which can be done with appropriate
degree of accuracy. In assessment of learning, we deal with intangibles and attempt to characterize
them in a manner that would be widely understood.
Not too long ago, assessment of learning was confined to techniques and procedures for
determining whether or not cognitive knowledge (memorization of facts and theories) was
successfully acquired. Thus, assessment was essentially confined to pencil-paper testing of the
cognitive levels of learning (Bloom, 1954). In the past two decades, however, educators and
educationists recognized that not only are we expected to know facts and figures in today’s society,
but we are also expected to function effectively in the modern world, interact with other people,
and adjust to situations. Such expectations have not been matched with appropriate assessment
methods which could identify successful acquisition of skills other than cognitive skills until the
early to late 1990’s. Consequently, the traditional assessment method of pencil and paper testing
identified potentially high performing students who have not been successful in coping with the
demands of modern society.
The most common method assessing student learning is through tests (teacher-made or
standardized). Despite some criticisms leveled against using tests in determining if students are
learning or if schools are successful, these tests will continue to be used in the foreseeable future
(Shepard, 2000). Test results provide an easy and easily understood means of informing the student
about his progress or the school about his performance. Standardized tests, in particular, provide
clear targets to aim for when teachers and administrators want improvement (Jason, 2003). Tests,
coupled with other observational performance-based techniques, provide a powerful combination
for objective and precise assessment procedure.
The first step towards elevating a field of study into a science is to take measurements of the
quantities and qualities of interest in the field. In the Physical Sciences, such measurements are
quite easily understood and well-accepted. For instance, if we want to measure the length of a
piece of string with a standard ruler or meter stick; to find the weight of an object, we compare the
2
heaviness of the object with a standard kilogram or pound and so on. Sometimes, we can measure
physical quantities by combining directly measurable quantities to derived quantities. For example,
to find the area of a rectangular piece of paper, we simply multiply the lengths of the sides of the
paper. In the field of educational measurement, however, the quantities and qualities of interest are
more abstract, unseen and cannot be touched, they cannot be observed thus makes the
measurement process in education much more difficult.
For instance, knowledge of the subject matter is often measured through standardized test
results. In this case, the measurement procedure is testing. The same concept can be measured in
another way. We can ask a group of experts to rate a student’s (or teacher’s) knowledge of the
subject matter is measured through perceptions.
Objective measurements are measurements that do not depend on the person or individual
taking the measurements. Regardless of who is taking the measurement, the same measurement
values should be obtained when using an objective assessment procedure. In contrast, subjective
measurements often differ from one assessor to the next even if the same quantity or quality is
being measured.
For the variable X= class participation, we can let I1, I2, …, In denote the participation of a
student in n class recitations and let X = sum of the I’s divided by n recitations. Thus, if there were
n = 10 recitations and the student participated in 5 of these 10, then X = 5/10 or 50%.
3
Indicators are the building block of educational measurement upon which all other forms of
measurement are built. A group of indicators constitute a variable. A group of variables form a
construct or a factor. The variables which form a factor correlate highly with each other but have
low correlations with variables in another group.
1.2 Assessment
Once measurements are taken of an educational quantity or quality of interest, then the next
step is to assess the status of the educational phenomenon. For example, suppose that the quantity
of interest is the level of Mathematics for Grade VI pupils in the district. The proposed
measurements are test in Mathematics for Grade VI pupils in the district. The District Office
decided to target an achievement test results, the school officials can assess whether their Grade VI
pupils are within a reasonable range of this target i.e. whether they are above or below the
achievement level target.
Summative Role. An assessment may be done for summative purposes as in the illustration
given above for grade VI mathematics achievement. Summative assessment tries to determine the
extent to which the learning objectives for a course (like Grade VI Mathematics) are met and why.
Diagnostic Role. Assessment may also be done for diagnostic purposes. In this case, we are
interested in determining the gaps. Thus, on the topic of sentence construction, a diagnostic
examination may reveal the difficulties encountered by the students in matching subject and verb
or identifying subject and predicate, in vocabulary etc. This function of assessment is akin to a
medical doctor trying to perform laboratory tests to determine a patient’s illness or disease.
4
Evaluation models are important in the context of education. Evaluation implies that measurements
and assessments of an educational characteristic had been done and that it is now desired to pass on
value judgment on the educational outcome. In evaluating an outcome, we consider the objectives
of the educative process and analyze whether the outputs and outcomes satisfy these objectives, if
they do not, then we need to find the possible reasons for our failure to meet such objectives. The
possible reasons can, perhaps, be identified from the context, inputs, process and outputs of the
educational system. Figure 1 illustrates these ideas:
→ →
CONTEXT INPUTS → PROCESS → OUTPUT → OUTCOME
E
FIGURE 1
A Systems Model for Evaluation
Evaluation provides a tool for determining the extent to which an educational process or program is
effective and at the same time indicates directions for remediation processes of the curriculum that do
5
not contribute to successful student performance. To this end, evaluation enhances organizational
efficiency by providing focus for teacher and administrator efforts as well as allows resources to be
directed to areas of greatest need.
Improving student performance is inextricably linked to improvement in the inputs and processes
that shape the effectiveness of teaching and learning. Evaluation, therefore, is of greatest interest to
both teachers and administrators who plan orchestrate the entire learning activities.
According to Brainard (1996), effective program evaluation is a systematic process that focuses
on program improvement and renewal and discovering peaks of program excellence. Program
evaluation needs to be viewed as an important ongoing activity, on that goes beyond research or
simple fact-finding to inform decisions about the future shape of the program under study. Program
evaluation contributes to quality services by providing feedback from program activities and outcomes
to those who can introduce changes in the program or who decide which services are to be carried out
effectively.
Program evaluation need not be limited to evaluation of educational processes and systems.
Program evaluation is also important in many programs of government agencies. For instance,
agencies of the government undertake programs with the assistance of foreign funding agencies to
target specific social concerns poverty reduction or governance and empowerment at the local
government level. Such programs normally involve millions and it is very important that proper
evaluation be undertaken to ensure that resources invested in these programs are not wasted. In such
cases, the method called PERT (Program Evaluation Review Technique) is an indispensable
quantitative evaluation tool.
SUMMARY
Evaluation
Is the process of gathering and interpreting evidence regarding the problems and progress of
individuals in achieving desirable educational goals?
Function of Evaluation
6
- Prediction
- Diagnosis
- Research
Teaching, Learning and Evaluation are three interdependent aspects of the educative process.
(Gronlund 1981) This interdependence is clearly seen when the main purpose of instruction is
conceived in terms of helping pupils achieve a set of learning is outcomes which include changes in
the intellectual, emotional or physical domains. Instructional objectives or in other words, desired
changes in the pupils, are brought about by planned learning activities and pupil’s progress is
evaluated by tests and other devices.
This integration of evaluation into the teaching-learning process can be seen in the following
stages of the process:
- Evaluation should take into consideration the limitations of the particular educational
situations.
Measurement
- is a part of the educational evaluation process whereby some tools or instruments are used to
provide a quantitative description of the progress of students towards desirable educational
goals?
Test or Testing
Types of Evaluation
- Placement
- Formative
- Diagnostic
- Summative
(These types show that evaluation is integrated with the various phases of instruction)
Placement
Formative
- Evaluation provides the students with feedback regarding his success or failure in attaining
instructional objectives.
It identifies the specific learning errors that need to be corrected and provides reinforcement for
successful performance as well.
For the teacher, formative evaluation provides information for making instruction and remedial
work more effective.
Diagnostic
8
- Evaluation is used to detect students’ learning difficulties which are not revealed by
informative tests or checked by remedial instruction and other instructional adjustments.
Since it discloses the underlying causes of learning difficulties, diagnostic tests are therefore more
comprehensive and detailed.
Summative
- Evaluation is concerned with what students have learned. This implies that the instructional
activity has for the most part been completed and that little correction of learning deficiencies
is possible.
1. Clarifying objectives
2. Identifying variables that affect learning
3. Providing relevant instructional activities to achieve objectives
4. Determining the extent to which the objectives are achieved.
Assessment can be made precise, accurate and dependable only if what are to be achieved are
clearly stated and feasible. To this end, we consider learning targets involving knowledge,
reasoning, skills, products and effects. Learning targets need to be stated in behavioral terms which
denote something which can be observed through the behavior of the students. Thus, the objective
“to understand the concept of buoyancy” is not stated in behavioral terms. It is not clear how one
measures “understanding”. On the other hand, if we restate the target as “to determine the volume
of water displaced by a given object submerged”, then we can easily measure the extent to which a
student understands “buoyancy”.
As early as the 1950’s Bloom (1954), proposed a hierarchy of educational objectives at the
cognitive level. These are:
9
Level 1. KNOWLEDGE which refers to the acquisition of facts, concepts and theories. Knowledge
of historical facts like the date of the EDSA revolution, discovery of the Philippines or of
scientific concepts like the scientific name of milkfish, the chemical symbol of argon etc.
all fall under knowledge.
Knowledge forms the foundation of all other cognitive objectives for without
knowledge; it is not possible to move up to the next higher level of thinking skills in the
hierarchy of educational objectives.
EXAMPLE: The Spaniards ceded the Philippines to the Americans 1898 (knowledge of
facts). In effect, the Philippines declared independence from the Spanish rule only to be
ruled by yet another foreign power, the Americans (comprehension).
Level 3. APPLICATION refers to the transfer of knowledge from one field of study to another or
from one concept to another concept in the same discipline.
EXAMPLE: The classic experiment Pavlov on dogs showed that animals can be
conditioned to respond in a certain way to certain stimuli. The same principle can be
applied in the context of teaching and learning on behavior modification for school
children.
Level 4. ANALYSIS refers to the breaking down of a concept or idea into its components and
explaining the concept as a composition of these concepts.
EXAMPLE: Poverty in the Philippines, particularly at the barangay level, can be traced
back to the low income levels of families in such barangays and the propensity for large
household with an average of about 5 children per family. (Not: Poverty is analyzed in the
context of income and number o children).
Level 5. SYNTHESIS refers to the opposite of analysis and entails putting together the
components in order to summarize the concept.
EXAMPLE: The field of geometry is replete with examples of synthetic lessons. From
the relationship of the parts of a triangle for instance, one can deduce that the sum of the
angles of a triangle is 180 degrees. (Padua, Roberto and Rosita G. Santos. (1997)
“Educational Evaluation and Measurement” Quezon City: Katha Publishing) pp. 21-22.
Level 6. EVALUATION AND REASONING refers to valuing and judgment or putting the
“worth” of a concept or principle.
Skills refer to specific activities or tasks that a student can proficiently do e.g. skills in coloring,
language skills. Skills can be clustered together to form specific competencies characterize a student’s
ability in order that he program of study can be designed as to optimized his/her innate abilities.
Abilities can be roughly categorized into: cognitive, psychomotor and affective abilities. For
instance, the ability to work well with others and to be trusted by every classmate (affective ability) is
an indication that the student can most likely succeed in work that requires leadership abilities. On the
other hand, other student are better at doing things alone like programming and web designing
(cognitive ability) and, therefore, they would be good at highly technical individualized work.
Products, outputs and projects are tangible and concrete evidence of as student’s ability. A
clear target for products and projects need to clearly specify the level of workmanship of such projects
e.g. expert level, skilled level of workmanship of such projects e.g. expert level, skilled level or novice
level output can be characterized by the indicator “at most four (4) imperfections noted” etc.
Once the learning targets are clearly set, it is now necessary to determine and appropriate
assessment procedure or method. We discuss the general categories of assessment methods or
instruments below.
Written – response instruments include objective (multiple choices, true false, matching or
short answer) tests, essays, examinations and checklists. Objectives tests are appropriate for assessing
the various levels of hierarchy of educational objectives. Multiple choice tests in particular can be
constructed in such a way as to tests in particular can be constructed in such a way as to test higher
thinking skills. Essays, when properly planned, can test the student’s grasp of the higher level
cognitive skills particularly in the areas of application analysis, synthesis and judgment. However,
when the essay question is not sufficiently precise and when the parameters are not properly defined,
there is a tendency for the students to write irrelevant and unnecessary things just to fill in blank
spaces. When this happens, both the teacher and the students will experience difficulty and frustration.
In the second essay question, the assessment foci are narrowed down to: (a) the main characters
of the event, and (b.) the roles of each character in the revolution leading to the ouster of the incumbent
President at that time. It becomes clear what the teacher wishes to see and what the students are
supposed to write.
A teacher is often tasked to rate products. Examples of products that are frequently rated in
education are book reports, maps, charts, diagrams, notebooks, essays and creative endeavors of all
sorts. An example of a product rating scale is the classic “handwriting” scale used in the California
Achievement Test, Form W (1957). There are prototype handwriting specimens of pupils and students
(of various grades and ages). The sample handwriting of a student is then moved along the scale until
the quality of the handwriting sample is most similar to the prototype handwriting. To develop a
product rating scale for the various products in education, the teacher must possess prototype products
over his/her years of experience.
One of the most frequently used measurement instruments is the checklist. A performance
checklist consists of a list of behaviors that make up a certain type of performance (e.g. using a
microscope, typing a letter, solving a mathematics performance and so on.). It is used to determine
whether or not an individual behaves in a certain (usually desired) way when asked to complete a
particular task. If a particular behavior is present when an individual is observed, the teacher places a
check opposite it on the list.
The traditional Creeks used oral questioning extensively as assessment method. Socrates himself,
considered the epitome of a teacher, was said to have handled his classes solely based on questioning
and oral interactions.
Oral questioning is an appropriate assessment method when the objectives are: (a) to assess the
student’s stock knowledge and/or (b) to determine the student’s ability to communicate ideas in
coherent verbal sentences. While oral questioning is indeed an option for assessment, several factors
need to be considered when using this option. Of particular significance are the student’s state of mind
and feelings, anxiety and nervousness in making oral presentations which could mask the student’s
true ability.
A tally sheet is a device often used by teachers to record the frequency of student behaviors,
activities or remarks. How many high school students follow instructions during fire drill, for
example? How many instances of aggression or helpfulness are observed when elementary students
12
are observed in the playground? In the class of Mr. Sual in elementary statistics, how often do they ask
questions about inference? Observational tally sheets are most useful in answering these kinds of
questions.
Observation and self-reports are useful supplementary assessment methods when used in
conjunction with oral questioning and performance tests. Such methods can offset the negative impact
on the students brought about by their fears and anxieties during oral questioning or when performing
actual task under observation. However, since there is a tendency to overestimate one’s capability, it
may be useful to consider weighing self-assessment and observational reports against the results of
oral questioning and performance tests.
The quality of the assessment instrument and method used in education is very important since
the evaluation and judgments that the teacher gives on a student are based on the information he
obtains using these instruments. Accordingly, teachers follow a number of procedures to ensure that
the assessment process is valid and reliable.
Validity had traditionally been defined as the instrument’s ability to measure what it purports to
measure. We shall learn in this section that the concept has, of recent, been modified to accommodate
a number of concerns regarding the scope of this traditional definition. Reliability, on the other hand,
is defined as the instrument’s consistency.
2.3.1 Validity
Validity, in recent years, has been defined as referring to the appropriateness, correctness,
meaningfulness and usefulness of the specific conclusions that a teacher reaches regarding the
teaching-learning situation. Content-validity refers to the content and format of the instrument. How
appropriate is the content? How comprehensive? Does the instrument logically get the intended
variable or factor? How adequately does the sample of items or questions represent the content to be
assessed? Is the format appropriate? The content and format must be consistent with the definition of
the variable or factor to be measured. Some criteria for judging content validity are given as follows:
1.Do students have adequate experience with the type of task posed by the item?
2.Did the teachers cover sufficient material for most students to be able to answer the item
correctly?
3. Does the item reflect the degree of emphasis received during instruction?
With these as guide, a content validity table may be constructed in two (2) forms as provided
below:
13
Based on Form B, adjustments in the number of items that relate to a topic can be made
accordingly.
While content validity is important, there are other types of validity that one needs to verify. Face
validity refers to the outward appearance of the test. It is the lowest form of the test validity. A more
important type of validity is called criterion-related validity. In criterion related validity, the test item
is judged against a specific criterion e.g. relevance to a topic like the topic on conservation, for
example. The degree to which the item measures the criterion is said to constitute its criterion validity.
Criterion validity can also be measured by correlating the test with a known valid test (as a criterion).
Finally, a test needs to possess construct validity. A “construct” is another term for a factor, and we
already know that a group of variables that correlate highly with each other form a factor. It follows
that an item possesses construct validity if it loads highly on a given construct or factor. A technique
called factor analysis is required to determine the construct validity of an item. Such technique is
beyond the scope of this book.
2.3.2 Reliability
The reliability of an assessment method refers to its consistency. It is also a term that is
synonymous with dependability or stability.
14
Stability or internal consistency as reliability measures can be estimated in several ways. The
Split-half method involves scoring two halves (usually, odd items versus even items) of a test
separately for each person and then calculating a correlation coefficient for the two sets of scores. The
coefficient indicates the degree to which the two halves of the test provide the same results and hence,
describes the internal consistency of the test. The reliability of the test is calculated using what is
known as the Spearman-Brown prophecy formula:
The Kuder-Richardson is the more frequently employed formula for determining internal
consistency, particularly KR2O and KR21. We present the latter formula since KR2O is more difficult
to calculate and requires a computer program:
where K = number of items on the test, M = mean of the test, Variance = variance of the test
scores.
The mean of a set of scores is simply the sum of the scores divided by the number of scores; its
variance is given by:
Reliability of a test may also mean the consistency of test results when the same test is
administered at two different time periods. This is the test-retest method of estimating reliability. The
estimate of test reliability is then given by the correlation of the two test results.
2.3.3 Fairness
An assessment procedure needs to be fair. This means many things. First, students need to know
exactly what the learning targets are and what method of assessment will be used. If students do not
know what they are supposed to be achieving then they could lost in the maze of concepts being
discussed in class. Likewise, students have to be informed how their progress will be assessed in order
to allow them to strategize and optimize their performance.
Third, fairness also implies freedom from teacher-stereotyping. Some examples of stereotyping
include: boys are better than girls in Mathematics or girls are better than boys in language. Such
15
stereotyped images and thinking could lead to unnecessary and unwanted biases in the way that
teachers assess their students.
The term “ethics” refers to questions of right and wrong. When teachers think about ethics, they
need to ask themselves if it is right to assess a specific knowledge or investigate a certain question. Are
there some aspects of the teaching-learning situation that should not be assessed? Here are some
situations in which assessments may not be called for:
Requiring students to answer checklists of their sexual fantasies;
Asking elementary pupils to answer sensitive questions without consent of their parents;
Testing the mental abilities of pupils using an instrument whose validity and reliability are
unknown;
When a teacher thinks about ethics, the basic question to ask in this regard is: “Will any physical
or psychological harm come to any one as a result of the assessment or testing?” Naturally, no teacher
would want this to happen to any of his/her student.
Test results and assessment results are confidential results. Such should be known only by the
student concerned and the teacher. Results should be communicated to the students in such a way that
other students would not be in possession of information pertaining to any specific member of the
class.
The third ethical issue in assessment is deception. Should students be deceived? There are
instances in which it is necessary to conceal the objective of the assessment from the students in order
to ensure fair and impartial results. When this is the case, the teacher has a special responsibility to (a)
determine whether the use of such techniques is justified by the educational value of the assessment,
(b) determine whether alternative procedures are available that does not make use of concealment and
(c) ensure that students are provided with sufficient explanation as soon as possible.
16
Finally, the temptation to assist certain individuals in class during assessment or testing is ever
present. In this case, it is best if the teacher does not administer the test himself if he believes that such
a concern may, at a later time, be considered unethical.
Analysis. The students must be able to break down a given sentence into its subject and
predicate.
Synthesis. The students must be able to formulate rules to be followed regarding subject-verb
agreement.
Deciding on the type of objective test. The test objectives dictate the kind of objective tests that
will be designed and constructed by the teacher. For instance, for the first four (4) levels, we may want
to construct a multiple-choice type of test while for application and judgment, we may opt to give an
essay test or a modified essay test.
Preparing a table of specifications (TOS). A table of specifications or TOS is a test map that
guides the teacher in constructing a test. The TOS ensures that there is a balance between items that
test lower level thinking skills and those which test higher order thinking skills (or alternatively, a
balance between easy and difficult items) in the test. The simplest TOS consist of four (4) columns: (a)
level of objective to be tested, (b) statement of objective, (c) item numbers where such an objective is
being tested, and (d) Number of items and percentage out of the total for that particular objective. A
prototype table is shown below:
In the table of specification above, we see that there are five items that deal with knowledge and
these items are items 1, 3, 5, 7, 9. Similarly, from the same table we see that five items represent
synthesis, namely: 12, 14, 16, 18, 20. The first four levels of Bloom’s taxonomy are equally
represented in the test while application (tested through essay) is weighted equivalent to ten (10) points
or double the weight given to any of the first four levels. The table of specifications guides the teacher
in formulating the tests. As we can see, the TOS also ensures that each of the objectives in hierarchy of
educational objectives is well represented in the test. As such, the resulting test that it will be
constructed by the teacher will be more or less comprehensive. Without the table of specification, the
tendency for the test maker is to focus too much on facts and concepts at the knowledge level.
Constructing the test items. the actual construction of the test items follow the TOS. As a general
rule, it is advised that the actual number of items to be constructed in the draft double the desired
number of items, For instance, if there are five (5) knowledge level items to be included in the final
test form, then at least ten (10) knowledge level items should be included in the draft. The subsequent
test try-out and item analysis will most likely eliminate many of the constructed items in the draft
18
(either they are too difficult, too easy or non-discriminatory), hence, it will be necessary to construct
more items than will actually be included in the final test form.
Item analysis and try-out. The test draft is tried out to a group of pupils or students. The purpose
of this try-out is to determine the: (a) item characteristics through item analysis, and (b) characteristics
of the test itself-validity, reliability, and practicality.
Obviously, the answer is FALSE because 100 years from 1898 is not 2000 but
1998.
Rule 2. Avoid using the words “always”, “never”, “and often” and other adverbs that tend to be
either always true or always false.
Example: Christmas always falls on a Sunday because it is a Sabbath day._______.
Statements that use the word “always” are almost always false. A test-wise
student can easily guess is way through a test like these and get high scores even if he
does not know anything about the test.
Rule 3: Avoid log sentences as these tend to be “true” Keep sentences short.
Example: tests need to be availed, reliable and useful, although, it would require a great
amount of time and effort to ensure that tests possess this test
characteristic.______
Notice that the statement is true. However, we are also not sure
which part of the sentence is deemed true by the student. It is just fortunate that in this
case, all parts of the above sentence are true. Te following example illustrates what can
go wrong in long sentences:
Example: tests need to be valid, reliable and useful since it takes very little amount of
time, money and effort to construct tests with these characteristics.______
The first part of the sentence is true but the second part is debatable and may, in
fact, be false. Thus, a “true” response is correct and also, a “false” response is correct.
Rule 4. Avoid trick statements with some minor misleading word or spelling anomaly, misplaced
phrases etc. a wise student who does not know the subject matter may detect this strategy
and thus get the answer correctly.
Example: True or False. The Principle of our school is Mr. Albert P. Panadero.
The principal’s name may actually be correct but since the word is, misspelled
and the entire sentence takes a different meaning, the answer would be false!
This is an example of a tricky but utterly useless item.
19
Rule 5. Avoid quoting verbatim from reference materials or textbooks. This practice sends the
wrong signal to the students that it is necessary to memorize the textbook word for word
and thus, acquisition of higher level thinking skills are not given due importance.
Rule 6. Avoid specific determiners or give-away qualifiers. Students quickly learn that strongly
worded statements are more likely to be false than true, for example, statements with
“never” “no” “all” or “always. Moderately worded statements are more likely to be true
than false. Statements with “many” “often” “sometimes” “generally”
”frequently” or “some” should be avoided.
Rule 7. With true or false questions, avoid a grossly disproportionate number of either true or
false statements or even patterns in the occurrence of true and false statements.
Example:
Much of the process of photosynthesis takes place in the:
a. bark
b. leaf
c. c. stem
The qualifiers “much” is vague and could have been replaced by more specific qualifiers
like:” 90% of the photosynthetic process” or some similar phrase that would be more precise.
3. Avoid complex or awkward word arrangements. Also, avoid use of negatives in the stem as this may
add unnecessary comprehension difficulties.
Example:
20
(Poor) As President of the Republic of the Philippines, Corazon Cojuangco Aquino would stand
next to which President of the Philippines Republic subsequent after Corazon C. Aquino?
4. Do not use negatives or double negatives as such statements tend to be confusing. It is best to use
simpler sentences rather than sentences that would require expertise in grammatical construction.
Example:
(Poor) Which of the following will not cause inflation in the Philippine economy?
(Better) Which of the following will cause inflation in the Philippine economy?
Poor: What does the statement “Development patterns acquired during the formative years are
NOT Unchangeable” imply?
A.
B.
C.
D.
Better: What does the statement “Development patterns acquired during the formative years are
changeable” imply?
A.
B.
C.
D.
5.) Each item stem should be as short as possible; otherwise you risk testing more for reading and
comprehension skills.
6.) Distracter should be equally plausible and attractive.
Example:
The short story: May Day’s Eve, was written by which Filipino author?
a. Jose Garcia Villa
b. Nick Joaquin
c. Genoveva Edrosa Matute
d. Robert Frost
e. Edgar Allan Poe
If distracters had all been Filipino authors, the value of the item would be greatly increased. In this
particular instance, only the first three carry the burden of the entire item since the last two can be
essentially disregarded by the students.
7.) All multiple choice options should be grammatically consistent with the stem.
8.) The length, explicitness, or degree of technicality of alternatives should not be the determinants of
the correctness of the answer. The following is an example of this rule:
Example:
If the three angles of two triangles are congruent, then the triangles are:
a. congruent whenever one of the sides of the triangles are congruent
b. similar
c. equiangular and therefore, must also be congruent
d. equilateral if they are equiangular
The correct choice, “b” may be obvious from its length and explicitness alone. The other
choices are long and tend to explain why they must be the correct choices forcing the students to
think that they are, in fact, not the correct answers!
9.) Avoid stems that reveal the answer to another item.
10.) Avoid alternatives that are synonymous with others or those that include or overlap others.
21
Example:
What causes ice to transform from solid state to liquid state?
a. Change in temperature
b. Changes in pressure
c. Change in the chemical composition
d. Change in heat levels
The options a and d are essentially the same. Thus, a student who spots these identical choices
would right away narrow down the field of choices to a, b, c. the last distracter would play no
significant role in increasing the value of the item.
11.) Avoid presenting sequenced items in the same order as in the text.
12.) Avoid use of assumed qualifiers that many examinees may not be aware of.
13.) Avoid use of unnecessary words or phrases. Which are not relevant to the problem at hand (unless
such discriminating ability is the primary intent of the evaluation). The item’s value is
particularly damaged if the unnecessary material is designed to distract or mislead. Such items
test the student’s reading comprehension rather than knowledge of the subject matter.
Example: The side opposite the thirty degree angle in a right triangle is equal to half the length of
the hypotenuse. If the sine of a 30-degree is 0.5 and its hypotenuse is 5, what is the length of the side
opposite the 30-degree angle?
a. 2.5
b. 3.5
c. 5.5
d. 1.5
The sine of a 30-degree is really quite unnecessary since the first sentence already gives the
method for finding the length of the side opposite the thirty-degree angle. This is a case of a
teacher who wants to make sure that no student in his class gets the wrong answer!
14.) Avoid use of non-relevant sources of difficulty such as requiring a complex calculation when only
knowledge of a principle is being tested.
Note in the previous example, knowledge of the sine of a 30-degree angle would have led some
students to use the sine formula for calculation even if a simpler approach would have sufficed.
15.) Avoid extreme specificity requirements in responses.
16.) Include as much of the item as possible in the stem. This allows less repetition and shorter choice
options.
17.) Use the “None of the above” option only when the keyed answer is totally correct. When choice of
the “best” response is needed, “none of the above” is not appropriate, since the implication has
already been made that the correct response may be partially inaccurate.
18.) Note that use of “all of the above” may allow credit for partial knowledge. In a multiple option
item, (allowing only one option choice) if a student only knew that two (2) options were correct,
he could then deduce the correctness of “all of the above”. This assumes you are allowed only
one correct choice.
19.) Having compound response choices may purposefully increase difficulty of an item.
20.) The difficulty of a multiple choice item may be controlled by varying the homogeneity or degree of
similarity of responses. The more homogeneous, the more difficult the item.
Example:
(Less Homogeneous)
Thailand is located in:
22
a. Southeast Asia
b. Eastern Europe
c. South America
d. East Africa
e. Central America
(More Homogeneous)
Thailand is located next to:
a. Laos and Kampuchea
b. India and China
c. China and Malaya
d. Laos and China
e. India and Malaya
A B
___1. Magellan a. First President of the Republic
___2. Mabini b. National Hero
___3. Rizal c. Discovered the Philippines
___4. Lapu-Lapu d. Brain of Katipunan
___5. Aguinaldo e. The great painter
f. Defended Limasawa Island
Normally, column B will contain more items than column A to prevent guessing on the part
of the students. Matching type items, unfortunately, often test lower order thinking skills
(knowledge level) and are unable to test higher order thinking skills such as application and
judgment skills.
A variant of the matching type items is the data sufficiency and comparison type of test
illustrated below:
Example: Write G if the item on the left is greater the item on the right; L if the item on the
left is less than the item on the right; E if the item on the left equals the item on the right and D if
the relationship cannot be determined.
A B
1. Square root of 9 ______ a. -3
2. Square of 25 ______ b. 615
3. 36 inches ______ c. 3 meters
4. 4 feet ______ d. 48 inches
5. 1 kilogram ______ e. 1 pound
23
The data sufficiency test above can, if properly constructed, test higher order thinking skills.
Each item goes beyond simple recall of facts and, in fact, requires the student to make decisions.
Another useful device for testing lower order thinking a skill is the supply type if tests. Like
the multiple choice test, the items in this kind of test consist of a stem and a blank where the
students would write the correct answer.
Example: The study of life and living organisms is called______________.
Supply type tests depend heavily on the way that the stems are constructed. These tests allow
for one and only one answer and, hence, often test only the students’ knowledge. It is, however,
possible to construct supply type of tests that will test higher order thinking as the following
example will show:
Example: Write an appropriate synonym for each of the following. Each blank corresponds to
a letter:
Metamorphose: _ _ _ _ _ _
Flourish: _ _ _ _
The appropriate synonym for the first is CHANGE with six(6) letters while the appropriate
synonym for the second is GROW with four(4) letter. Notice that these questions require not
only mere recall of words but also understanding of these words.
3.6 Essays
Essays, classified as non-objective tests, allow for the assessment of higher order thinking skills.
Such tests require students to organize their thoughts on a subject matter in coherent sentences in order
to inform an audience. In essay tests, students are requested to write one or more paragraphs on a
specified topic.
Essay questions can be used to measure attainment for a variety of objectives. Stecklein (1955)
has listed 14 types of abilities that can be measured by essay items:
1. Comparisons between two or more things
2. The development and defense of an opinion
3. Questions of cause and effect
4. Explanations of meanings
5. Summarizing of information in a designed area
6. Analysis
7. Knowledge of relationships
8. Illustrations of rules, principles, procedure, and applications
9. Applications of rules, laws, and principles to new situations
10. Criticisms of the adequacy, relevance, or correctness of a concept, idea, or information
11. Formulation of new questions and problems
12. Reorganization of facts
13. Discriminations between objects, concepts, or events
14. Inferential thinking
Note that all these involve the higher-level skills mentioned in Bloom’s Taxonomy.
The following are rules of thumb which facilitate the grading of essay papers:
Rule 1: Phrase the direction in such a way that students are guided on the key concepts to be
included.
Example: Write an essay on the topic: “Plant Photosynthesis” using the following keywords and
phrases: chlorophyll, sunlight, water, carbon dioxide, oxygen, by-product, stomata.
24
Note that the students are properly guided in terms of the keywords that the teacher is
looking for in this essay examination. An essay such as the one given below will get a
score of zero (0). Why?
Plant Photosynthesis
Nature has its own way of ensuring the balance between food producers and
consumers. Plants are considered producers of food for animals. Plants produce food
for animals through a process called photosynthesis. It is a complex process that
combines various natural elements on earth into the final product which animals can
consume in order to survive. Naturally, we all need to protect plants so that we will
continue to have food on our table. We should discourage burning of grasses, cutting of
trees and illegal logging. If the leaves of plants are destroyed, they cannot perform
photosynthesis and animals will also perish.
Rule 2: Inform the students on the criteria to be used for grading their essays. This rule allows
the students to focus on relevant and substantive materials rather than on peripheral
and unnecessary facts and bits of information.
Example: Write an essay on the topic: “Plant Photosynthesis” using the keywords indicated. You
will be graded according to the following criteria: (a) coherence, (b) accuracy of
statements, (c) use of keywords, (d) clarity and (e) extra points for innovative
presentation of ideas.
Rule 3: Put a time limit on the essay test.
Rule 4: Decide on your essay grading system prior to getting the essays of your students.
Rule 5: Evaluate all of the students’ answer to one question before proceeding to the next
question.
Scoring or grading essay tests question by question, rather than student by student,
makes it possible to maintain a more uniform standard for judging the answers to each
questions. This procedure also helps offset the halo effect in grading. When all of the
answers on one paper are read together, the grader’s impression of the paper as a whole
is apt to influence the grades he assigns to the individual answers. Grading question by
question, of course, prevents the formation of this overall impression of the student’s
paper. Each answer is more apt to be judged on its own merits when it is read and
compared with other answers to the same question, than when it is read and compared
with other answers by the same student.
Rule 6: Evaluate answers to essay questions without knowing the identity of the writer. This is
another attempt to control personal bias during scoring. Answers to essay questions
should be evaluated in terms of what is written, not in terms of what is known about the
writers from other contracts with them. The best way to prevent our prior knowledge
from influencing our judgment is to evaluate each answer without knowing the identity
of the writer. This can be done by having the students write their names on the back of
the paper or by using code numbers in place of names.
Rule 7: Whenever possible, have two or more persons grade each answer. The best way to
check on the reliability of the scoring of essay answers is to obtain two or more
independent judgments. Although this may not be a feasible practice for routine
classroom testing, it might be done periodically with a fellow teacher (one who is
equally competent in the area). Obtaining two or more independent ratings becomes
especially vital where the results are to be used for important and irreversible decisions,
25
such as in the selection of students for further training or for special awards. Here the
pooled ratings of several competent persons may be needed to attain level of reliability
that is commensurate with the significance of the decision being made.
Some teachers use the cumulative criteria i.e. adding the weights given to each
criterion, as basis for grading while others use the reverse. In the latter method, each
student begins with a score of 100. Points are then deducted every time a teacher
encounters a mistake or when a criterion is missed by the student in his essay.
In 1956, Benjamin Bloom headed a group of educational psychologists who developed a classification
of levels of intellectual behavior important in learning. This became a taxonomy including three
overlapping domains; the cognitive, affective and psychomotor.
Cognitive learning is demonstrated by knowledge recall and the intellectual skills: comprehending
information, organizing ideas, analyzing and synthesizing data, applying knowledge, choosing among
alternatives in problem-solving, and evaluating ideas or actions. This domain on the acquisition and
use of knowledge is predominant in the majority of courses. Bloom identified six levels within the
cognitive domain, from the simple recall or recognition of facts, as the lowest level, through
increasingly more complex and abstract mental levels, to the highest order which is classified as
evaluation. Verb examples that represent intellectual activity on each level are listed here, and each
level is linked to questions appropriate to the level.
1. Knowledge: arrange, define, duplicate, label, list, memorize, name, order, recognize, relate,
recall, repeat, reproduce state.
2. Comprehension: classify, describe, discuss, explain, express, identify, indicate, locate,
recognize, report, restate, review, select, translate,
3. Application: apply, choose, demonstrate, dramatize, employ, illustrate, interpret, operate,
practice, schedule, sketch, solve, use, write.
4. Analysis: analyze, appraise, calculate, categorize, compare, contrast, criticize, differentiate,
discriminate, distinguish, examine, experiment, question, test.
5. Synthesis: arrange, assemble, collect, compose, construct, create, design, develop, formulate,
manage, organize, plan, prepare, propose, set up, write.
6. Evaluation: appraise, argue, assess, attach, choose compare, defend estimate, judge, predict,
rate, core, select, support, value, evaluate.
enjoying, conserving, respecting, and supporting. Verbs applicable to the affective domain include
accepts, attempts, challenges, defends, disputes, joins, judges, praises, questions, shares, supports, and
volunteers.
KNOWLEDGE
o remembering;
o memorizing;
o recognizing;
o identification and
o recall of information
Who, what, when, where, how ...?
Describe
COMPREHENSION
o interpreting;
o translating from one medium to another;
o describing in one's own words;
o organization and selection of facts and ideas
Retell...
APPLICATION
o problem solving;
o applying information to produce some result;
o use of facts, rules and principles
How is...an example of...?
How is...related to...?
Why is...significant?
ANALYSIS
o subdividing something to show how it is put together;
o finding the underlying structure of a communication;
o identifying motives;
o separation of a whole into component parts
What are the parts or features of...?
Classify...according to...
Outline/diagram...
How does...compare/contrast with...?
27
SYNTHESIS
o creating a unique, original product that may be in verbal form or may be a physical
object;
o combination of ideas to form a new whole
What would you predict/infer from...?
What ideas can you add to...?
How would you create/design a new...?
What might happen if you combined...?
What solutions would you suggest for...?
EVALUATION
o making value decisions about issues;
o resolving controversies or differences of opinion;
o development of opinions, judgements or decisions
Do you agree...?
What do you think about...?
What is the most important...?
Place the following in order of priority...
How would you decide about...?
What criteria would you use to assess...?
Since the work was produced by higher education, the words tend to be a little bigger than we
normally use. Domains can be thought of as categories. Trainers often refer to these three
domains as KSA (Knowledge, Skills, and Attitude). This taxonomy of learning behaviors can
be thought of as "the goals of the training process." That is, after the training session, the
learner should have acquired new skills, knowledge, and/or attitudes.
The committee also produced an elaborate compilation for the cognitive and affective
domains, but none for the psychomotor domain. Their explanation for this oversight was that
they have little experience in teaching manual skills within the college level (I guess they
never thought to check with their sports or drama department).
This compilation divides the three domains into subdivisions, starting from the simplest
behavior to the most complex. The divisions outlined are not absolutes and there are other
systems or hierarchies that have been devised in the educational and training world.
However, Bloom's taxonomy is easily understood and is probably the most widely applied
one in use today.
Cognitive (1)
The cognitive domain involves knowledge and the development of intellectual skills. This
includes the recall or recognition of specific facts, procedural patterns, and concepts that
serve in the development of intellectual abilities and skills. There are six major categories,
which are listed in order below, starting from the simplest behavior to the most complex. The
categories can be thought of as degrees of difficulties. That is, the first one must be mastered
before the next one can take place.
Affective (2)
This domain includes the manner in which we deal with things emotionally, such as feelings,
values, appreciation, enthusiasms, motivations, and attitudes. The five major categories are
listed from the simplest behavior to the most complex:
Psychomotor (3)
33
The psychomotor domain includes physical movement, coordination, and use of the motor-
skill areas. Development of these skills requires practice and is measured in terms of speed,
precision, distance, procedures, or techniques in execution. The seven major categories are
listed from the simplest behavior to the most complex:
Dave's:(4)
o Imitation: Observing and patterning behavior after someone else. Performance may be of low
quality. Example: Copying a work of art.
o Manipulation: Being able to perform certain actions by following instructions and practicing.
Example: Creating work on one's own, after taking lessons, or reading about it.
o Precision: Refining, becoming more exact. Few errors are apparent. Example: Working and
36
Harrow's:(5)
Reference
1. Bloom B. S. (1956). T a x o n o m y o f E d u c a t i o n a l O b j e c t i v e s ,
H a n d b o o k I : T h e C o g n i t i v e D o m a i n . New York: David McKay Co Inc.
3. Simpson E. J. (1972). T h e C l a s s i f i c a t i o n o f E d u c a t i o n a l O b j e c t i v e s
i n t h e P s y c h o m o t o r D o m a i n . Washington, DC: Gryphon House.
4. Dave, R. H. (1975). D e v e l o p i n g a n d W r i t i n g B e h a v i o u r a l
O b j e c t i v e s . (R J Armstrong, ed.) Educational Innovators Press.
There are certain characteristics of a good measuring instrument that make it useful; otherwise it may
not serve its purpose well.
These characteristic are:
1. Validity. The validity of a test is the degree of accuracy by which it measures what it aims to
measure. For instance, if a test aims to measure proficiency in solving linear systems in algebra,
and it does measure proficiency in solving linear system in algebra, then it is valid. But if the test
measures only proficiency in solving linear equations then the test is not valid. The degree of
validity of a test is often expressed numerically as a coefficient of correlation with another test of
the same kind and of known validity. This is computed and explained in a later chapter.
a. Content validity. This refers to the relevance of the test items of a test to the subject matter or
situation from which they are taken. For instance, an achievement test in elementary algebra is to
be constructed. If all the items to be included in the test are all taken from elementary algebra,
then the test has a high content validity. However, if most of the items are taken from arithmetic,
then the test will have a very low content validity. This type of validity is also called “face
validity” or “logical validity”
b. Concurrent validity. This refers to the correspondence of the scores of a group in a test with the
scores of the same group in a similar test of already known validity used as a criterion. Suppose a
man constructs an intelligence test and he wants to know how valid his test is. He takes another
intelligence test already known validity and uses this as criterion. He gives the two tests, his test
and the criterion test, to the same group. Then he computes the coefficient of correlation between
the scores of the group in the two tests. If the coefficient of correlation between the two tests is
high, say .80, then the new tests has a high concurrent validity. (The degree of correlation is
expressed numerically from-1.00 to 0 for negative correlation and from 0 to 1.00 for positive
correlation between two tests is high if the examinees getting in the first test also get relatively
high scores in the second test and those getting low scores in the first test also get relatively low
scores in the second test.)
c. Predictive validity. This refers to the degree of accuracy of how a test predicts the level of
performance in a certain activity which it intends to foretell. Example: Intelligence tests usually
predict the level of performance in activities involving intellectual ability like school work. So, if
an individual scores high in an intelligence test and also gets high grades in school work, then the
intelligence test has a high predictive validity.
d. Construct validity. This refers to the agreement of test results with certain characteristics which
the test aims to portray. Consider the following examples. If children with higher than children
with lower intellectual ability, the intelligence test has a high construct validity. Another
example. Suppose in an intelligence test for high school students, the second year student’s score
generally higher than the first students, the third year student’s score generally higher than the
38
second year students, the said intelligence test has high construct validity. Another example. True
extroverts score higher for extroversion than true introverts in a test of personality if the test has
high construct validity.
2. Reliability. The reliability of a test is the degree of consistency of measurement that it gives.
Suppose a test is given to an individual and after the lapse of a certain length of time the same
test is again given to the same individual. If the scores in the two administrations of the test are
identical or almost identical, the test is reliable. Or, if the test is given to a group again and the
means (averages) of the scores in the two test administrations are the same or almost the same,
the test is reliable. Like validity, the degree of reliability of a test is numerically expressed as a
coefficient of correlation.
There are ways of computing the degrees of validity and reliability but they are
complicated statistical methods and they are within the scope of books in higher statistics and so
there is no intention of including them here.
Factors of reliability. There are factors that affect reliability, among which are:
a. Adequacy. Adequacy refers to the appropriate length of the test and the proper sampling of the
test content. A test is adequate if it is long enough to contain a sufficient number of
representative items of the behavior to be measured so that it is able to give true measurement.
To make a test more reliable, make it longer and the subject matter which is the subject of the
test.
b. Objectivity. A test is objective if it yields the same score no matter who checks it or even if it
checked different times. Suppose a teacher scores a paper and the number of correct answer is 73.
Another teacher checks the same paper and the number of correct responses is also 80. After
several days he checks the same test paper and the number of correct responses is also 80. The
test is objective. To make a test objective, make the responses to the items single symbols, words
or phrases.
c. Testing condition. This refers to the conditions of the conditions of the examination room. If the
room is too warm, poorly lighted or unevenly lighted, poorly ventilated or unevenly ventilated,
and quiet. The seats and writing edges of the testers cannot score as much as when the room is
properly lighted, ventilated and quite. The seats and writing edges of the testers should also be
made a possible to ensure good scoring by the testers.
d. Test administration procedures. The manner of administering a test also affects its reliability.
Explicit directions usually accompany a test and they should be followed strictly because these
procedures are standardized. Directions should be clearly understood before starting the test. The
testees may be allowed to ask questions for better understanding of the procedures before the
start of the examination. Testees are no longer expected to ask questions during the test period
because this will distract the others. Testing materials should be sufficient and available. If
possible, the testees should have to two pens so that if one runs out of ink, there is an immediate
replacement.
39
Reliability is a factor of validity; that is, a test cannot be valid without it being reliable. However,
validity is not a factor of reliability because a test can be reliable without it being valid.
a. Administrability. These are tests that are easy to administer and there are tests that are hard to
administer. Group test are usually easy to administer because the directions are easy to follow.
This increases the usability of the test because they are more in demand. On the other hand, there
are tests that are quite difficult to administer on account of the complexity of directions and the
demand for these test.
b. Scorability. This is another factor of usability. There are tests that are easy to score and they are
usually in demand. But there are tests, some of them personality tests that are difficult to score on
account of the different weights, some positive and some negative, given to the items and the
computations to arrive at the final scores are very complicated. This situation lessens the demand
for these tests.
c. Economy. There are test the answer to which are written in the tests themselves and so they
cannot be used again. This makes these kinds of tests costly and this limits their usability. There
are also tests that utilize separate answer sheets so that they can be again and again. Because
these tests are cheaper, they are more in demand enchancing their usability.
d. Comparability. This refers to the availability of norms with which scores of testees are compared
to determine the meanings of their scores. For instance, in an intelligence test of 75 items one
obtains a score of 70. Comparing this with the norms, a score of 70 is equivalent to 95 percentile
rank. This means that the person obtaining the score of 70 has a higher intelligence than 94 percent
of the population for which the test is intended to cover.
e. Utility. A test is utile if it adequately serves the very purpose for which it is intended. If a test is
intended. If a test is intended to measure achievement in mathematics and it does measure
achievement in mathematics, then the test has a high utility. The test is usable.
As far as educational measurement is concerned, there are two general kinds of measuring
instruments. They are:
1. Standard test. A standard or standardized test is one for which content has been selected and
checked empirically, for which norms have been established, for which uniform methods of
administering and scoring have been developed, and which may be scored with a relatively high
degree of objectivity. ( Good, 565) Some examples of standards tests are intelligence test,
aptitude test, personality test and interest test.
40
2. Teacher-made tests. Teacher-made tests are those made by teachers an administered to their
students to determine the achievements of the latter in the subjects they are taking for purposes of
marking and promotion. Some examples of teacher-made tests are essay examinations and
objective types of tests such as true-false, fill-in-the blanks, multiple choice, etc.
Standard test and teacher-made test are very similar in function. Both are for measurement.
However, they differ in many respects. Among their differences are:
(3) Standard tests are given to a large (4) Teacher-made tests are not subjected
portion of the population for which they to any statistical procedures to
are intended for the computation of determine their validity and reliability
norms.
(4) Standard tests are generally correlated (5) Teacher-made test may be objective
with other tests of known and may be essay in which case
Validity and reliability or with measures scoring is subjective.
such as school marks to determine their
validity and reliability. (6) Teacher-made tests have no norms
unless the teacher computes the
(5)Standard test generally are highly median, mean, and other measures for
objective comparison and interpretation.
The more common ways of classifying standard tests are the following:
A. According to Function
respondent interprets and his interpretations will reveal his values, motives, and other
aspects of his personality.
d. Vocational and professional interest inventory. This is a test used to
determine the extent to which a person’s likes and dislikes relate to a given
vocation of profession. (Good, 566) This test reveals the type of work or
career a person is interested in whether business, teaching, nursing. Etc.
2. Educational test. This is an achievement test which aims to measure a person’s
knowledge, skills, abilities, understanding and other outcomes in subjects taught in school.
(Good, 556-557) Examples are achievement tests in mathematics, English, etc.
B. According to Construction
2. Unstructured test. In this test, the examinee is free to respond in any way he
likes, thinks, feels, or has experienced and there are no incorrect answers.
Examples are projective tests. These are also called unrestricted tests because
there are no restrictions imposed.
1. Individual test. This test is administered to only one person at a time. Examples
are personality tests that can be given to only one person at a time
2. Group test. This is a test that can be given to more than one person at a time.
Intelligence tests are usually given to several persons at a time.
D. According to the Degree to Which Words Are Used in Test Items and in Pupil
Responses
1. Verbal test. A verbal test is of the paper-and-pencil test variety but questions may
be presented orally or in written form or objects may be presented for
identification. The answers, however, are given in words usually written but
sometimes given orally.
2. Nonverbal test. This is a test which a minimum amount of language is used. The
test, composed mostly of symbols, may be written or given orally but the answers
are given solely in numbers, graphical representation, or three-dimensional
43
objects or materials. Some intelligence tests are nonverbal and they are used with
people with language difficulty.
3. Performance test. This test is also nonverbal but the pupils may be required to
use paper and pencil for responding, or the manipulation of physical objects and
materials. An example of this test is the arrangement of blocks. This is also used
with persons with language difficulty
2. Scaled tests. This is a test in which the items are of different difficulty and are
arranged from easy to difficult. Examples are power tests. The process of
determining the difficulty of test items and arranging them in an ascending order
of different is called scaling.”A scale is a series of objective samples or products
of different difficulty or quality that have been arranged in a definite order, or
position, usually in ascending order of difficulty or quality.”
3. Standard tests have norms with which test results are compared and given
meaning. Hence, interpretation of test results is easy
4. Standard tests can be used again and again provided they are not given to the
same group twice. Their validity and reliability will be affected because of the
effect of practice if given again and again to the same group.
5. Standard tests provide a comprehensive coverage of the basic knowledge, skills,
abilities and other traits that are generally considered as essential.
1. Since standard test are for general use, their contents may not fully correspond to
the expected outcomes of the instructional objectives of a particular school,
subject, or course. This is especially true with standard achievement tests. Hence,
very careful selection has to be done if standard tests are to be used for
measurement.
2. Since standard tests are very objective they may not be able to measure the
ability to reason, explain, contrast, organize one’s ideas and the like
3. Standard test of the right kind for a purpose may be very scarce and hard to find.
TEACHER-MADE EXAMINATIONS
1. Oral examination. These are tests in which the answers are given in spoken words. The
questions may be given in spoken words or in writing. Examples are oral recitations.
(Good 562) another example is the oral defense of a thesis or dissertation in graduate
studies.
2. Written examination. These are tests in which the answers are given in writing. The
questions may be given orally or in writing. Examples are essay and objective
examinations.
3. Performance examinations. These are examinations in which the responses are given
by means of overt actions. (Good, 562) Examples are calisthenics in physical education,
marching and assembling a gun in military training, planning in woodworking, making a
dress, etc. the questions may be given in words or in writing.
45
• Where do I begin?
— Begin with your objectives What did you want the students to
-
know or be able
to do in each of the lessons?
The teacher normally prepares a draft of the test. Such a draft is subjected to item
analysis and validation in order to ensure that the final version of the test would be
useful and functional. First, the teacher tries out the draft test to a group of students of
similar characteristics as the intended test takers (try-out phase). From the try-out
group, each item will be analyzed in terms of its ability to discriminate between those
who know and those who do not know and also its level of difficulty (item analysis
phase). The item analysis will provide information that will allow the teacher to
decide whether to revise or replace an item (item revision phase). Then, finally, the
final draft of the test is subjected to validation if the intent is to make use of the test as
a standard test for the particular unit or grading period. We shall be concerned with
these concepts in this Chapter.
There are two important characteristics of an item that will be of interest to the
teacher. These are: (a) item difficulty, and (b) discrimination index. We shall learn how
to measure these characteristics and apply our knowledge in making a decision about the
item in question.
The difficulty of an item difficulty is defined as the number of students who are able
to answer the item correctly divided by the total number of students. Thus:
Item difficulty = number of students with correct answer/total number of students
The item difficulty is usually expressed in percentage.
Example: what is the item difficulty index of an item if 25 students are unable to answer
it correctly?
47
Here, the total number of students is 100; hence, the difficulty index is 75/100 or
75%.
One problem with this type of difficulty index is that it may not actually indicate that
the item is difficult (or easy). A student who does not know the subject matter will
naturally be unable to answer the item correctly even if the question is easy. How do we
decide on the basis of this index whether the item is too difficult or too easy? The
following arbitrary rule is often used in the literature:
Difficult items tend to discriminate between those who know and those who do not
know the answer. Conversely, easy items cannot discriminate between these two groups
of students. We are therefore interested in deriving a measure that will tell us whether an
item can discriminate between these two groups of students. Such a measure is called an
index of discrimination.
An easy way to derive such a measure how difficult an item is with respect to those in
the upper 25% of the class and how difficult it is with respect to those in the lower 25%
of the class. If the upper 25% of the class found the item easy yet the lower 25% found it
difficult, then the item can discriminate properly between these two groups. Thus:
Index of discrimination = DU – DL
Example: obtain the index of discrimination of an item if the upper 25% of the class
had a difficulty index of 0.60 (i.e. 60% of the upper 25% got the correct answer) while
the lower 25% of the class had a difficulty index of 0.20.
Here, DU = 0.60 while DL = 0.20, thus index of discrimination = .60 - .20 = .40.
Theoretically, the index of discrimination can range from -1.0 (when DU = 0 and DL = 1)
to 1.0 (when DU =1 and DL – 0). When the index of discrimination is equal to -1, then
this means that all of the lower 25% of the students got the correct answer while all of the
upper 25% got the wrong answer. In a sense, such index discrimination correctly between
the two groups but the item itself is highly questionable. Why should the bright ones get
the wrong answer and the poor ones get the right answer? On the other hand, if the index
of discrimination is 1.0, then this means that all of the lower 25% failed to get the correct
answer. This is perfectly discriminating item and is the ideal item that should be included
in the test. From these discussions, let us agree to discard or revise all items that have
negative discrimination index for although they discriminate correctly between the upper
and lower 25% of the class, the content of the item itself may be highly dubious. As in
the case of the index of difficulty, we have the following rule thumb:
48
Example: Consider a multiple choice type of test of which the following data were
obtained:
Item Options
A B C D
1 0 40 20 20 Total
0 15 5 0 Upper 25%
0 5 10 5 Lower 25%
The correct response is B. let us compute the difficulty index and index of discrimination:
Difficulty Index = no. of students getting correct response/total
= 40/100 = 40%, within range of a “good item”
The discrimination index can similarly be computed:
DU = no. of students in upper 25% with correct response/no. of students in the upper
25%
= 15/20 = .75 or 75%
DL = no. of students in lower 75% with the correct response/no. of students in the lower
25%
= 5/20 = .25 or 25%
Discrimination Index = DU – DL = .75 - .25 = .50 or 50%.
It is also instructive to note that the distracter A is not an effective distracter since this
was never selected by the students. Distracters C and D appear to have good appeal as
distracters.
ITEM ANALYSIS
In a normal classroom situation, test papers are usually returned to students to give them
feedback about their standing in the test and their performance in the lessons covered,
sometimes, upon or after returning the test papers, a teacher explains the answers of the
more difficult items or the entire test to review and intensify the learning of the students.
However, while he expresses surprise over the students’ inability to answer either
difficult or easy questions, he sometimes fails to consider the nature of the items on the
basis of the entire class performance. He rarely goes into actual percentage of the class
49
that got an item right and into the idea that the item discriminates between the bright and
the poor students.
Difficulty Index
Difficulty index refers to the proportion of the number of students in the upper and
lower groups who answered an item correctly. Therefore, difficulty index of an item
maybe obtained by adding the proportion in the upper and lower groups who go the item
right and divide it by 2.
Table 1 below shows the index of difficulty of an item. This table should serve the
teacher in classifying the item from the easiest to the most difficult ones.
TABLE 1
Another criterion that indicates the acceptability of an item is the discriminating power of
an item. Usually, a good item properly discriminates bright students from the poor ones.
To determine this, a discrimination index of difficulty is computed.
The discrimination index refers to the proportion of the students in the upper who got an
item right minus the proportion of students in the lower group who got an item right.
Table 2
Index of Discrimination of an Item
Index range Discrimination of an item
Below-0.10 Questionable item
0.11-0.20 Not discriminating
0.21-0.30 Moderately discriminating
0.31-0.40 Discriminating
0.41-1.00 Very discriminating
Because the discrimination index reflects the degree to which an item and the test as a
whole are measuring a unitary ability or attribute, values of the coefficient will tend to be
lower for test measuring a wide range of content areas than for more homogeneous tests.
Item discrimination indices must not be interpreted in the context of the type of test
which is being analyzed. Items with low discrimination indices are often ambiguously
worded and should be examined. Items with negative indices should be examined to
determine why a negative value was obtained. For example, a negative value may
indicate that the item was mis-keyed, so that students who knew the material tended to
choose an unkeyed, but correct, response option.
Tests with high internal consistency consist of items with mostly positive
relationships with total test score. In practice, values of the discrimination index will
seldom exceed .50 because of the differing shapes of item and total score distributions.
ScorePak classifies item discrimination as “good” if the index is above .30; “fair” if it is
between .10 and .30; and “poor” if it is below .10.
A good item is one that has good discriminating ability and has sufficient level of
difficult (not too difficult nor too easy). In the two tables presented for the levels of
difficulty and discrimination there is a little area of intersection where the two indices
will coincide (between 0.56 to 0.67) which represent the good items in a test. (Source:
Office of Educational Assessment, Washington DC, USA
http://www.washington.edu/oea/services/scanning_scoring/item_analysis.html)
At the end of the Item Analysis report, test items are listed according their degrees of
difficulty (easy, medium, hard) and discrimination (good, fair, poor). These distributions
provide a quick overview of the test, and can be used to identify items which are not
performing well and which can perhaps be improved of discarded.
Summary
Index of Difficulty
P = Ru + RL x 100
T
52
Where:
Ru – The number in the upper group who answered the item correctly.
RL – The number in the lower group who answered the item correctly.
T - The total number who tried the item.
D = Ru + RL
½T
Where:
P – Percentage who answered the item correctly (index of difficulty)
R – Number who answered the item correctly
T – Total number who tried the item.
D = Ru – RL = 6 – 2 = 0.40
½T 10
The discriminating power of an item is reported as a decimal fraction;
maximum discriminating power is indicated by an index of 1.00.
Maximum discrimination is usually found at the 50 percent level of
difficulty
Validation
After performing the item analysis and revising the items which need revision, the
next step is to validate the instrument. The purpose of validation is to determine the
characteristics of the whole test itself, namely, the validity and reliability of the test.
Validation is the process of collecting and analyzing evidence to support the
meaningfulness and usefulness of the test.
A teacher who conducts test validation might want to gather different kinds of
evidence. There are essentially three main type of the evidence that may be collected:
content-related evidence of validity, criterion-related evidence of validity and construct-
related evidence of validity. Content-related evidence of validity refers to the content and
format of the instrument. How appropriate is the content? How comprehensive? Does it
logically get at the intended variable? How adequately does the sample of items or
question represent the content to be assessed?
that a mathematics achievement test is constructed and the scores are categorized as high,
average, and low. The criterion measure used is the final average grades of the students in
high school: Very Good, Good, and Needs Improvement. The two way table lists down
the number of students falling under each of the possible pairs of (test, grade) as shown
below:
High 20 10 5
Average 10 25 5
Low 1 10 14
The expectancy table shows that there were 20 students getting high test scores
and subsequently rated excellent in terms of their final grades; 25 students got average
scores and subsequently rated good in their finals; and finally, 14 students obtained low
test scores and were later graded as needing improvement. The evidence for this
particular test tends to indicate that students getting high scores on it would be graded
excellent; average scores on it would be rated good later; and students getting low scores
on the test would be graded as needing improvement later.
We will not be able to discuss the measurement of construct-related validity in
this book since the method to be used require sophisticated statistical techniques falling
in the category of factor analysis.
Reliability
Reliability refers to the consistency of the score obtained – how consistent they
are for each individual from one administration of an instrument to another and from one
set of items to another. We already gave the formulas for computing the reliability of a
test: for internal consistency, for instance, we could use the split-half method or the
Kuder-Richardson formulates (KR-20 or KR-21).
Reliability Interpretation
.70 - .80 Goof for a classroom test; in the range of most. There
are probably a few items which could be improved.
1. Write questions and use formats that match the developmental levels of
your students.
56
2. Clearly describe what you expect the students to know for the test and
test over that information.
3. Create questions that match the content covered and the objectives
identified for your content.
5. Let older students know how you will be grading - grading scale, etc.
6. Balance tests with both objective and subjective items in order to meet
the various learning styles.
7. Include questions that require the use of both higher and lower level
thinking.
10. Return graded tests within one week time so students can benefit from
the feedback.
12. Give clear directions on how to take the various parts of the test.
13. Identify the point value of the various items so students have an idea of
how much time or effort to place on the various questions.
14. Stick with positive statements rather than “Which of the following is
NOT...”
15. With multiple choice items limit choices to about 3 for elementary
students and no more than 4 for secondary students.
57
1. Judge students’ performance against their own knowledge and not against their peers.
2. Use tests for improvement, for feedback to students, so they can know what their
problems and improve accordingly (formative rather than summative).
3. Use tests also to evaluate our own teaching, so we can find out what we had not
taught well and improve our teaching accordingly.
5. Test to improve knowledge and skills and not just for judgment.
6. Train teachers to ask students to explain what he/she meant by a given answer.
10. Use multiple testing methods to tap knowledge of a given skill (multiple-choice,
open-ended, true-false, etc.)
11. Test the skill directly and not via other skill.
12. Return the test with meaningful feedback and not just with a numerical score.
14. When students fall on a test, let’s not rule out the possibility that the teaching was bad
or that the test was poorly constructed.
18. Remember to return the tests to the students in a short time after they were
administered.