CA 2 Assessment in Learning
CA 2 Assessment in Learning
ASSESSMENT IN LEARNING 1
1
Known as the true score theory.
Explains that variation in the performance of the examinees on a given measure is due to variation in
their abilities.
CTT also assumes that an examinee’s observed score in a given measurement is the sum of the
examinee’s true score and some degree of error in the measurement
Provides an estimation of the item's difficulty based on the frequency or number of examinees who
correctly answer a particular item.
Items with fewer examinees with correct answers are considered more difficult.
It provides an estimation of item discrimination based on the number of examinees with a higher or
lower ability to answer a particular item.
If an item can distinguish between examinees with higher ability (higher total test score) and lower
ability (lower total test score) then an item is considered to have a good discrimination.
Test reliability can also be estimated using approaches from CTT (e.g. Kuder-Richardson 20,
Cronbach’s alpha). Item analysis based on CTT has been the dominant approach because of the
simplicity of calculating the statistics (e.g. item difficulty index, item discrimination index, item–total
correlation)
o The Item Response Theory (IRT)
Analyses test items by estimating the probability that an examinee answers an item correctly or
incorrectly.
It is assumed that the characteristics of an item can be estimated independently of the characteristics or
ability of the examinee and vice-versa.
It provides significantly more information on items and tests.
There are also different IRT Models (e.g. one-parameter model, three-parameter model)
What are the Different Types of Assessment in Learning?
- The most common types are:
o Formative Assessment
Refers to assessment activities that provide information to both teachers and learners on how they can
improve the teaching-learning process.
It is used at the beginning and during interaction of the teacher to assess the learner’s understanding
o Summative Assessment
Provide information on the quantity or quality of what students learned or achieved at the end of
instruction.
Provide effectiveness of teaching strategies and how they can improve their instruction in the future.
o Diagnostic Assessment
Aims to detect the learning problems or difficulties of the learners.
It can be done right after seeing signs of learning problems in teaching.
It can be also done at the beginning of the school year for a spirally-designed curriculum so that
corrective actions are applied.
o Placement Assessment
Is usually done at the beginning of the school year to determine what the learners already know on what
their needs are that could inform the design of instructions.
The entrance examination given in school is an example
o Traditional Assessment
Use of conventional strategies or tools that provide information about the learning of students.
Examples: - Multiple Choice
- Essay Test
- Paper-Pencil Test
o Authentic Assessment
Assessment strategies or tools that allow learners to perform or create a meaningful product.
The authenticity of the assessment task is best described in terms of degree rather than the presence of
authenticity.
Allows performance that most closely resembles the real world.
What are the Different Principles in Assessing Learning?
The Core Principles in Assessing Learning:
1. Assessment should have a clear purpose.
o There should be a method with purpose
o The interpretation of the data collected should be aligned with the purpose that is set.
o The assessment principle is congruent with the Outcomes–Based Education (OBE) principles of clarity of focus
and design dowry.
2. Assessment is NOT an end in itself.
o It serves as a means to enhance student learning.
o Collecting information about student learning whether formative or summative should lead to the decision that
will allow improvement of the learner.
3. Assessment is an ongoing, continuous, and formative process
o Series of tasks and activities over time.
o Continuous feedback.
o Congruent to OBE of expanded opportunity.
4. Assessment is Learner-Centered.
o Assessment of learners provides teachers with an understanding of how they can improve their teaching.
2
5. Assessment is both process and product-oriented.
o Gives equal importance to learner performance and process they engage in to perform or produce a product.
6. Assessment must be comprehensive and holistic.
o Assessment should be conducted in multiple periods to assess learning over time.
o It is congruent with OBE of expanded opportunity.
7. Assessment requires the use of appropriate measures.
o It must connect to psychometric properties, but not limited to validity and reliability.
8. Assessment should be as authentic as possible.
o Assessment should contain from the least to most authentic tasks expected of a learner.
4
Affective Measure – personality, motivation, attitude, interest and disposition
These are processed by the school’s guidance counselor to perform interventions on the learners' academic, career,
and social and emotional development.
Why do we use Paper-Pencil and Performance-based types of assessments?
Paper-and-pencil type of assessment are The skills applied are usually complex and
cognitive tasks that require a single correct require integrated skills to arrive at a target
answer. response.
e.g. - Binary (true or false) e.g. - an essay
- short answer: Identification, Matching - reporting in front of a class
Type, Multiple Choice - reciting a poem
- the items usually pertain to a specific - problem-solving
cognitive skill: RU, AA, EC - creating a word problem
Performance-based type assessment requires the Other Examples of the type of Assessment.
learner to perform o Identify the parts of the plants
e.g. - demonstration o Label the parts of the microscope
- arrive at a product o Complete the compound interest
- present information o Classify the phase of matter
o Provide an appropriate verb in a
sentence
o Identify the type of sentence
VISUAL AID
Below are learning targets that need Performance-based Assessment
o Varnish a wooden cabinet
o Draw a landscape using a paintbrush on the computer.
o Word problem involving multiplication of polynomial
o Deliver a speech
o Essay explaining how human and plant benefits from each other.
o Mount a plant specimen on a glass slide.
How do we Distinguish Teacher-made from Standardized Tests?
Standardized Test – fixed direction for administering and scoring
o Can be purchased with test manuals, booklets, and answer sheets.
o It was developed on a large number of target groups called the NORM. It is used to compare the results of
those who took the test.
e.g.
- Intelligence Test - Critical Thinking Test
- Achievement Test - Interest Test
- Aptitude Test - Personality Test
Teacher-made Test
o Non-standardized intended for classroom assessment
e.g. - quizzes, long tests, exams
- formative and summative test
* Can a teacher-made test become a standardized test? Yes
What Information is sought from the Achievement and Aptitude Test?
Achievement Test
o Measure what learners have learned after instruction.
o A measure of what a person has learned within a given time (Yaremko et.al. 1982)
o A measure of the accomplished skills (Atkinson 1995, Kimball 1989). Explained the traditional and
alternative views on the achievement of learners.
o It can also be measured as the Wide Range Achievement, California Achievement, and IOWA Test for
Basic Skill.
Aptitude Test
o According to Longman (2005), Aptitudes are the characteristics that influence a person’s behavior that aid
goal attainment in a particular situation.
o It refers to the degree of readiness to learn and perform (Corno et.al., 2002)
e.g.
Ability to comprehend instruction
Manage one’s time
Use previously acquired knowledge appropriately.
Make good influences and generalizations
Manage one’s emotion
How do we Differentiate Speed from Power Test?
Speed Test consists of easy items that need to be completed within a time limit
e.g. - Typing Test – Type many words given a limited amount of time
- Power Test – Consists of items with increasing levels of difficulty
e.g. Developed by the National Council of Teachers in Mathematics
The Difference between a Norm-Reference from Criterion-Reference Test
There are two types of tests based on how the scores are interpreted:
Norm-referenced Test interprets result using the distribution of scores of a sample group
It is based on the mean and standard deviation of the sample
A norm is a standard and is based on a very large of samples.
5
It takes that of a bell curve. It also reports the percentage of people with a particular score.
The use of a norm is the basis of interpreting a test score.
It can be used to interpret a particular score.
It is reported a very large group of samples
Criterion-referenced Test has a given set of standards. The scores are compared to the given criterion.
e.g. - 50 Item Test:
- 40-50 Very High
- 30-39 High
- 20-29 Average
- 10-19 Low
- 0-9 Very Low
7
Faulty: What is an ecosystem?
a. It is a community of living organisms in conjunction with the non-living components of their environment
that interact as a system. These biotic and abiotic components are linked together through nutrient cycles
and energy flows.
b. It is a place on Earth’s surface where life dwells.
c. It is an area where one or more individual organisms defend against competition from other organisms.
d. It is the biotic and abiotic surroundings of an organism or population.
e. It is the largest division of the Earth’s surface filled with living organisms.
Good: What is an ecosystem?
a. It is a place on the Earth’s surface where life dwells.
b. It is the biotic and abiotic surroundings of an organism or population.
c. It is the largest division of the Earth’s surface filled with living organisms.
d. It is a large community of living and non-living organisms in a particular area.
e. It is an area where one or more individual organisms defend against competition from other organisms.
3. Place options in a logical order (e.g. alphabetical, from shortest to longest).
Faulty: Which experimental gas law describes how the pressure of a gas tends to increase as the volume of the
container decreases (i.e., “The absolute pressure exerted by a given mass of an ideal gas is inversely
proportional to the volume it occupies.”)
a. Boyle’s Law d. Avogadro’s Law
b. Charles Law e. Faraday’s Law
c. Beer-Lambert Law
Good: Which experimental gas law describes how the pressure of a gas tends to increase as the volume of the
container decreases? (i.e., “The absolute pressure exerted by a given mass of an ideal gas is inversely
proportional to the volume it occupies.”
a. Avogradro’s Law d. Charles Law
b. Beer-Lambert Law e. Faraday’s Law
c. Boyle’s Law
4. Place correct responses randomly to avoid a discernible pattern of correct answers.
5. Use None-of-the-above carefully and only when there is one absolutely correct answer, such as in spelling or math
items.
Faulty: Which of the following is a nonparametric statistic?
a. ANCOVA c. T-test
b. ANOVA d. None of the above
Good: Which of the following is a nonparametric statistic?
a. ANCOVA d. Mann-Whitney U
b. ANOVA e. T-test
c. Correlation
6. Avoid All of the Above as an option, especially if it is intended to be the correct answer.
Faulty: Who among the following has become the President of the Philippine Senate?
a. Ferdinand Marcos d. Quintin Paredes
b. Manuel Quezon e. All of the above
c. Manuel Roxas
Good: Who was the first ever President of the Philippine Senate?
a. Eulogio Rodriguez d. Manuel Roxas
b. Ferdinand Marcos e. Quintin Paredes
c. Manuel Quezon
7. Make all options realistic and reasonable.
General Guidelines in Writing Matching Type Items?
1. Clearly state in the directions the basis for matching the stimuli with responses.
Faulty: Directions: Match the following.
Good: Directions: Column I is a list of countries while Column II presents the continent where these countries are
located. Write the letter of the continent corresponding to the country on the line provided in Column I.
Item #1’s instruction is less preferred as it does not detail the basis for matching the stem and the response options.
2. Ensure that the stimuli are longer and the responses are shorter.
Faulty: Match the description of the flag to its country.
A B
Bangladesh A. Green background with a red circle in the center.
Indonesia B. One red strip on top and a white strip at the bottom.
Japan C. Red background with a white five-petal flower in the center.
Singapore D. Red background with a large yellow circle in the center
Thailand E. Red background with a large yellow pointed star in the center.
F. White background with a large red circle in the center.
Good: Match the description of the flag to its country.
A B
Green background with a red circle in the center. A. Bangladesh
One red strip on and a white strip at the bottom. B. Hongkong
Red background with a white five-petal flower in the center. C. Indonesia
Red background with a large yellow-pointed star in the center. D. Japan
White background with a red circle in the center. E. Singapore
F. Vietnam
8
Item #2 is a better version because the descriptions are presented in the first column while the response options are
in the second column. The stems are also longer than the options.
3. For each item, include only topics that are related to one another and share the same foundation of information.
Faulty: Match the following:
A B
1. Indonesia A. Asia
2. Malaysia B. Bangkok
3. Philippines C. Jakarta
4. Thailand D. Kuala Lumpur
5. Year ASEAN was established E. Manila
F. 1967
Good: On the line to the left of each country in Column I, write the letter of the country’s capital presented in
Column II.
Column I Column II
1. Indonesia A. Bandar Seri Begawan
2. Malaysia B. Bangkok
3. Philippines C. Jakarta
4. Thailand D. Kuala Lumpur
E. Manila
Item #1 is considered an unacceptable item because its response options are not parallel and include different kinds
of information that can provide clues to the correct/wrong answers. On the other hand, item #2 details the basis
for matching and the response options only include related concepts.
4. Make the response options short, homogenous, and arranged in logical order.
Faulty: Match the chemical elements with their characteristics.
A B
Gold A. Au
Hydrogen B. Magnetic metal used in steel
Iron C. Hg
Potassium D. K
Sodium E. With lowest density
F. Na
Good: Match the chemical elements with their symbols.
A B
Gold A. Au
Hydrogen B. Fe
Iron C. H
Potassium D. Hg
Sodium E. K
F. Na
In Item #1, response options are not parallel in content and length. They are also arranged alphabetically.
5. Include response options that are reasonable and realistic and similar in length and grammatical form.
Faulty: Match the subjects with their course description
A B
History A. Studies the production and distribution of goods/services
Political Science B. Study of politics and power
Psychology C. Study of Society
Sociology D. Understands the role of mental functions in social behavior
E. Uses narratives to examine and analyze past events
Good: Match the subjects with their course description.
A B
1. Study of living things A. Biology
2. Study of mind and behavior B. History
3. Study of politics and power C. Political Science
4. Study of recorded events in the past D. Psychology
5. Study of Society E. Sociology
F. Zoology
Item #1 is less preferred because the response options are not consistent in terms of their length and grammatical
form.
6. Provide more response options than the number of stimuli.
Faulty: Match the following fractions with their corresponding decimal equivalents:
A B
1/4 A. 0.25
5/4 B. 0.28
7/25 C. 0.90
9/10 D. 1.25
Good: Match the following fractions with their corresponding decimal equivalents:
A B
1/4 A. 0.09
5/4 B. 0.25
7/25 C. 0.28
9/10 D. 0.90
E. 1.25
9
Item #1 is considered inferior to item #2 because it includes the same number of response options as that of the
stimuli, thus making it more prone to guessing.
General Guidelines in Writing True or False Items
True or false items are best used when a learner’s ability to judge or evaluate is one of the desired learning outcomes
of the course.
Variation
1. T-F Correction or Modified True-or-False Question. In this format, the statement is presented with a keyword
or phrase that is underlined, and the learner has to supply the correct word or phrase.
e.g. Multiple-Choice Test is authentic.
2. Yes-No Variation. In this format, the learner has to choose yes or no, rather than true or false.
e.g. The following are kinds of tests. Circle Yes if it is an authentic test and No if not.
Multiple Choice Test Yes No
Debates Yes No
End-of-the-Term Project Yes No
True or False Test Yes No
3. A-B Variation. In this format, the learner has to choose A or B, rather than true or false.
e.g. Indicate which of the following are traditional or authentic tests by circling A if it is a traditional test and
B if it is authentic.
Traditional Authentic
Multiple Choice Test A B
Debates A B
End-of-the-Term Project A B
True or False Test A B
Because true or false test items are prone to guessing, as learners are asked to choose between two options, utmost
care should be exercised in writing true or false items.
1. Include statements that are completely true or completely false.
Faulty: The presidential system of government, where the president is only the head of state or government, is
adopted by the United States, Chile, Panama, and South Korea.
Good: The presidential system, where the president is only the head of state or government, is adopted by Chile.
Item #1 is of poor quality because, while the description is right, the countries given are not all correct. While
South Korea has a presidential system of government, it also has a prime minister who governs alongside
with the president.
2. Use simple and easy-to-understand statements.
Faulty: Education is a continuous process of higher adjustment for human beings who have evolved physically and
mentally, which is free and conscious of God, as manifested in nature around the intellectual emotional,
and humanity of man.
Good: Education is the process of facilitating learning or the acquisition of knowledge, skills, values, beliefs, and
habits.
Item #1 is somewhat confusing, especially for younger learners because there are many ideas in one statement.
3. Refrain from using negatives – especially double negatives.
Faulty: There is nothing illegal about buying goods through the internet.
Good: It is legal to buy things or goods through the internet.
Double negatives are sometimes confusing and could result in wrong answers, not because the learner does not
know the answer but because of how the test items are presented.
4. Avoid using absolutes such as “always” and “never”.
Faulty: The news and information posted on the CNN website is always accurate.
Good: The news and information posted on the CNN website is usually accurate.
Absolute words such as “always” and “never” restrict possibilities and make a statement as 100 percent or all the
time. They are also a hint for a “false” answer.
5. Express a single idea in each test item.
Faulty: If an object is accelerating, a net force must be acting on it, and the acceleration of an object is directly
proportional to the net force applied to the object.
Good: If an object is accelerating, a net force must be acting on it.
Item #1 consists of two conflicting ideas, wherein one is not correct.
10
Item #1 is prone to many and varied answers. For example, a student may answer the question based on the capital
of these countries or based on what continent they are located. Item #2 is preferred because it is more specific and
requires only one correct answer.
3. Avoid obvious clues to the correct response.
Faulty: Ferdinand Marcos declared martial law in 1972. Who was the president during the period?
Good: The president during the martial law years was .
Item #1 already gives a clue that Ferdinand Marcos was the president during this time because only the president
of a country can declare martial law.
4. Be sure that there is only one correct response.
Faulty: The government should start using renewable energy sources for generating electricity, such as
.
Good: The government should start using renewable resources of energy by using turbines called
.
Item #1 has many possible answers because the statement is very general (e.g., wind, solar, biomass, geothermal,
and hydroelectric). Item #2 is more specific and only requires one correct answer (i.e. wind).
5. Avoid grammatical clues to the correct response.
Faulty: A subatomic particle with a negative electric charge is called an .
Good: A subatomic particle with a negative electric charge is called a(n) .
The word “an” in item #1 provides a clue that the correct answer starts with a bowel.
6. If possible, put the blank at the end of a statement rather than at the beginning.
Faulty: is the basic building block of matter.
Good: The basic building block of matter is .
In item #1, learners may need to read the sentence until the end before they can recognize the problem and then re-
read it again, and then answer the question. On the other hand, in item #2, learners can already identify the context
of the problem by reading through the sentence only once and without having to go back and re-read the sentence.
General Guidelines in Writing Essay Test
Teachers generally choose and employ essay items over other forms of assessment.
They are the most preferred form of assessment to measure learners higher order of thinking skills;
o Understanding the subject matter content
o Ability to reason with their knowledge of the subject
o Problem solving and decision skills
There are two types of essay test:
1. Extended Response Essay – requires much longer response.
2. Restricted-Response Essay – more focused.
The following are the general guidelines for constructing good essay questions.
1. Clearly define the intended learning outcome to be assessed by the essay test.
2. Refrain from using tests for intended learning outcomes that are better assessed by other kinds of assessment.
3. Clearly define and situate the task within a problem.
4. Present tasks that are fair, reasonable, and realistic to the students.
5. Be specific in the prompts about time allotment and criteria for grading the response.
General Guidelines in Problem-Solving Test Items
Problem-solving test items are used to measure the learner’s ability to solve problems that require quantitative
knowledge and competencies and/or critical thinking skills.
There are different variations of the quantitative problem-solving:
1. One answer choice – this type of question contains four or five options, and students are required to choose the
best answer.
e.g. What is the mean of the following score distribution: 32, 44, 56, 69, 75, 77, 95, 96?
a. 68 c. 72 e. 76
b. 69 d. 74
The correct answer is A (68).
2. All possible answer choices – This type of question has four or five options, and students are required to choose
all of the options that are correct.
e.g. Consider the following score distribution: 12,14, 14, 17, 24, 27, 28, 30. Which of the following is/are the
correct measure/s of central tendency? Indicate all possible answers.
a. Mean = 20 d. Median = 17
b. Mean = 22 e. Mode = 14
c. Median = 16
3. Type-in answer – This type of question does not provide options to choose from. Instead, the learners are asked
to supply the correct answer. The teacher should inform the learners at the start how their answers will be rated.
For example, the teacher may require just the correct answer or may require learners to present the step-by-step
procedures for coming up with their answers. On the other hand, for non-mathematical problem solving, such
as a case study, the teacher may present a rubric on how their answers will be rated.
e.g. Compute the mean of the following score distribution: 32, 44, 56, 69, 75, 77, 95, 96. Indicate your answer
in the blank provided.
11
3. Similarity or responses across items that measure the same characteristics.
There are different factors that affect the reliability of a measure:
o The number of items in a test – the more items a test has, the likelihood of reliability is high.
o Individual differences of participants – fatigue, lack of concentration, perseverance, innate ability.
o External environment – includes room temperature, noise level, exposure to material, and quality of
instruction.
Different Ways to Establish Test Reliability
Method in Testing Reliability How is this Reliability Done? What Statistics is Used?
1. Test-retest You have a test, and you need to Correlation refers to a statistical
administer it at one time to a group of procedure where linear relationship is
examinees. Administer it again at expected for two variables. You may
another time to the “same group” of use Pearson Product Moment
examinees. Correlation or Pearson r because test
Test-retest is applicable for tests that data are usually in an interval scale
measure stable variables, such as (refer to a statistics book for Pearson r)
aptitude and psychomotor measures
(e.g., typing test, tasks in physical
education)
2. Parallel Forms Parallel forms are applicable if there Correlate the test results for the first
are two versions of the test. This is form and the second form. Significant
usually done when the test is and positive correlation coefficients
repeatedly used for different groups, are expected.
such as entrance examinations and
licensure examinations. Different
versions of the test are given to a
different group of examinees.
3. Split-Half Administer a test to a group of The correlation coefficient obtained
examinees. The items need to be split using Pearson r and Spearman-Brown
into halves, usually using the odd-even should be significant and positive to
technique. In this technique, get the mean that the test has internal
sum of the points in the odd-numbered consistency reliability.
items and correlate it with the sum of
points of the even-numbered items.
4. Test of internal Consistency This technique will work well when A statistical analysis called
Using Kuder-Richardson and the assessment tool has a large number Cronbach’s alpha or the Kuder
Cronbach’s Alpha Method of items. It is also applicable for scales Richardson is used to determine the
and inventories (e.g., the Likert scale internal consistency of the items. A
from “strongly agree” to “strongly Cronbach’s alpha value of 0.60 and
disagree”) above indicates that the test items have
internal consistency.
5. Inter-rater Reliability Inter-rater is applicable when the A statistical analysis called Kendall’s
assessment requires the use of multiple Tau coefficient of concordance is used
raters. to determine if the ratings provided by
multiple raters agree with each other.
The very basis of statistical analysis to determine reliability is the use of Linear Regression.
1. Linear Regression – is demonstrated when you have two variables that are measured, such as two sets of scores
in a test at two different times by the same participants.
o A straight line is formed and it is said to have a correlation between the two sets of scores.
2. Computation of Pearson r Correlation
o The index of the linear regression is called a Correlation Coefficient.
o When the points in a scatterplot tend to fall within the linear line, the correlation is said to be strong.
o When the direction of the scatterplot is directly proportional, the correlation coefficient will have a positive
value.
o If the line is inverse, the correlation coefficient is called the Pearson r.
e.g. A teacher gave a spelling of two-syllable words with 20 items for Monday and Tuesday, using the
Pearson r:
𝑁(∑𝑋𝑌)−(∑𝑋)(∑𝑌)
𝑟= 2 2 2 2
√[𝑁(∑𝑋 )−(∑𝑋) ][𝑁(∑𝑌 )−(∑𝑌) ]
Monday Test Tuesday Test
X Y X2 Y2 XY
10 20 100 400 200
9 15 81 225 135
6 12 36 144 72
10 18 100 324 180
12 19 144 361 228
4 8 16 64 32
5 7 25 49 35
7 10 49 100 70
8 13 64 169 104
12
∑𝑋 = 71 ∑𝑌 = 122 ∑𝑋 2 = 615 ∑𝑌 2 = 1,836 ∑𝑋𝑌 = 1,056
∑X – Add all the X sores (Monday scores)
∑Y – Add all the Y scores (Tuesday scores)
X2 – Square the value of the X scores (Monday scores)
Y2 – Square the value of the Y scores (Tuesday scores)
XY – Multiply the X and Y scores
∑X2 – Add all the squared values of X
∑Y2 – Add all the squared values of Y
∑XY – Add all the product of X and Y
Substitute the values in the formula:
10(1328)−(87)(139)
𝑟= 2 2
√[10(871)−(87) ][10(2125)−(139) ]
𝑟 = 0.80
3. Difference Between a Positive and a Negative Correlation:
o Positive Correlation – when the value of the correlation coefficient, the higher the score in X and Y.
o Negative Coefficient – the higher the scores in X – the lower the scores in X.
o When the same test is administered to the same group of participants, usually a positive correlation indicates
the reliability or consistency of the score.
4. Determining the strength of a Correlation
o The strength of the correlation also indicates the strength of the reliability of the test. This is indicated by
the value of the Correlation Coefficient. The closer the value to 1.00 or -1.00, the stronger the correlation.
Below is the guide:
0.80 – 1.00 Very strong relationship
0.60 – 0.79 Strong relationship
0.40 – 0.59 Substantial / Marked relationship
0.20 – 0.39 Weak relationship
0.00 – 0.19 Negligible relationship
5. Determining the significance of the correlation.
o The correlation obtained between two variables will be due to the chance.
o In order to determine if the correlation is free of certain errors, it is tested for significance.
o Another statistical analysis mentioned to determine the internal consistency of the test is CRONBACH’S
Alpha.
Student Item Item Item Item Item Total for each Score – Mean (Score-
1 2 3 4 5 case (X) Mean)2
A 5 5 4 4 1 19 2.8 7.84
B 3 4 3 3 2 15 -1.2 1.44
C 2 5 3 3 3 16 -0.2 0.04
D 1 4 2 3 3 13 -3.2 10.24
E 3 3 4 4 4 18 1.8 3.24
̅
𝑋𝑐𝑎𝑠𝑒 = 16.2 2
∑(𝑆𝑐𝑜𝑟𝑒 − 𝑀𝑒𝑎𝑛) = 22.8
Test Validity
o A measure is valid when it measures what is supposed to measure.
o If a quarterly exam is valid, then the contents should directly measure the objectives of the curriculum.
Type of validity Definition Procedure
Content Validity When the items represent the domain The items are compared with the
being measured objectives of the program. The items
need to measure directly the objectives
(for achievement) or definition (for
scales). A reviewer conducts the
checking.
Face Validity When the test is presented well, free of The test items and layout are reviewed
errors, and administered well. and tried out on a small group of
respondents. A manual for administration
can be made as a guide for the test
administrator.
Predictive Validity A measure should predict a future A correlation coefficient is obtained
criterion. An example is an entrance exam where the X-variable is used as the
predicting the grades of the students after predictor and the Y-variable as the
the first semester. criterion.
Construct Validity The components or factors of the test The Pearson r can be used to correlate the
should contain items that are strongly items for each factor. However, there is a
correlated. technique called factor analysis to
determine which items are highly
correlated to form a factor.
Concurrent Validity When two or more measures are present The scores on the measures should be
for each examinee that measure the same correlated.
characteristic
Convergent Validity When components or factors of a test are Correlation is done for the factors of the
hypothesized to have a positive correlation test.
13
Divergent Validity When the components or factors of a test Correlation is done for the factors of the
are hypothesized to have a negative test.
correlation. An example of correlation is
the scores in a test on intrinsic and
extrinsic motivation.
14
o One can distinguish the highest and lowest scores and the corresponding for each score.
o The cumulative percentage in the last column calculates the percentage of the cumulative frequency.
o In the 6th row, the test score of 35 has a corresponding cumulative percentage of 13. This means that 13
percent of the class obtained a score below 35.
o Conversely, one can say that 87 percent of the scores are above 35.
Table 7.3. Frequency Distribution of Grouped Test Scores
Class Interval Midpoint (X1) f Cumulative Cumulative
Frequency (cf) Percentage
75-80 77 3 100 100
70-74 72 0 97 97
65-69 67 2 97 97
60-64 62 8 95 95
55-59 57 8 87 87
50-54 52 17 79 87
45-49 47 18 62 62
40-44 42 21 44 44
35-39 37 13 23 23
30-34 32 9 10 10
25-29 27 0 1 1
20-24 22 1 1 1
Total (N) 100
o Apparently, the data presented in tables 7.1 and 7.2 have been condensed as a result of the grouping of
scores.
o Table 7.3 illustrates a grouped frequency distribution of test scores.
o Let us consider the cumulative percentage in the 5th row of the Class Interval of 55-59, which is 87: we say
that 87 percent of the students got a score below 60.
o In table 7.3 the second column enters the midpoint of the test score in each class interval.
o To compute for class interval:
i = H-L
C
where i = size of the class intervals
H = highest test score
L = lowest test score
C = number of classes
o Transmutation Table for the Grading System.
If the total items is 100, then the passing mark is 50%, the computation is:
Transmutation Table
Graphical Presentation of Test Data
1. Histogram – a histogram is a type of graph appropriate for quantitative data such as test scores.
2. Frequency Polygon – is a visual representation of a distribution.
3. Cumulative Frequency Polygon – is essentially a line graph drawn on graph paper by plotting actual lower or upper
limits of the class intervals on the -axis and the respective cumulative frequencies of these class intervals on the –axis.
4. Bar Graph
4.1. Vertical Bar Graph – can be defined as a graphical representation of data, quantities, or numbers using bars or
strips.
4.2. Horizontal Bar Graph – is a graph in the form of rectangular bars.
5. Pie Graph – a pie chart is very easy you may use an ordinary protractor.
Which Graph is the Best?
No one can give a definite answer to this question.
o We cannot say what the best is.
o The histogram is the easiest in many cases of qualitative data, but it may not be appealing if you want to
compare the performance of two or more groups.
o Bar graph works well with qualitative data and if you want to compare the performance of subgroups of
examinees
o Frequency and percentage polygon are useful for treating quantitative data
o The cumulative frequency and percentage polygons are valuable for determining the percentage of
distribution that falls below or above a given point.
Cumulative Cumulative
Class Interval Midpoint (X) f X, f
Frequency (cf) Percentage
75-80 77 3 231 100 100
70-74 72 0 0 97 97
65-69 67 2 134 97 97
60-64 62 8 496 95 95
55-59 57 8 456 87 87
50-54 52 17 884 79 79
45-49 47 18 846 62 62
40-44 42 21 882 44 44
35-39 37 13 481 23 23
30-34 32 9 288 10 10
25-29 27 0 0 1 1
20-24 22 1 22 1 1
Total (N) 100 ∑X, f = 4720
Formula:
Mdn = Lower Limit + Size of the class interval
Median Class
𝑁
= 2 – cumulative frequency below the median class
__________________________________________
frequency of the median class
Table 8.2 Frequency Distribution of Grouped Test Scores
Class Interval Midpoint (X) f X, f Cumulative Cumulative
Frequency (cf) Percentage
75-80 77 3 231 100 100
70-74 72 0 0 97 97
65-69 67 2 134 97 97
60-64 62 8 496 95 95
55-59 57 8 456 87 87
16
50-54 52 17 884 79 79
45-49 47 18 846 62 62
40-44 42 21 882 44 44
35-39 37 13 481 23 23
30-34 32 9 288 10 10
25-29 27 0 0 1 1
20-24 22 1 22 1 1
Total (N) 100 ∑X, f = 4720
18
How is standard deviation applied in a normal distribution?
Standard Deviation – the most useful measure of variability and research.
o Normal Distribution – a symmetrical distribution – with a normal curve.
Figure 8.6 The Normal Curve
Figure 8.7 The Areas under the Normal Curve
1. The mean, median, and mode are all equal.
2. The curve is symmetrical. As such, the value in a specific area on the left is equal to the value of its
corresponding area on the right.
3. The curve changes from concave to convex and approaches the X-axis, but the tails do not touch the horizontal
axis.
4. The total area on the curve is equal to 1
What are Normal Scores?
A score can be interpreted, the mean and variability placement.
The raw score can be connected to the Z-score.
Z-score – the most useful score to express a raw score in relation to mean and standard deviation.
𝒙−𝒙̄
Z= 𝑆
x-𝑥̄ = deviation score
x = below the average
What are the purposes of Grading and Reporting Learners’ Test Performance?
Communicate the level of learning of the learners in specific course content.
Give feedback on what specific topic learners have mastered
Grades serve as a motivator for learners to study and do better
Give parents information about children’s achievements.
What are the different methods in Scoring Tests Performance Test?
1. Number Right Scoring (NR). The test score is the sum of the scores for correct responses.
2. Negative Marking (NM). Assign positive values to correct answers while punishing the learners for incorrect
responses.
Both NR and NM methods of scoring multiple-choice tests are prone to guessing, which affects the test validity and
reliability
Other scorings were introduced:
o Partial Credit Scoring Methods attempt a learners’ degree of level of knowledge with respect to each
response option given.
o Multiple Answer Scoring Method – allow learners to have multiple answers for each item.
o Retrospect Correcting for Guessing – considers omitted or no-answer items are incorrect. The correction
for guessing is implemented later.
o Standard Setting – Standards based on norm–referenced assessment is derived from the test performance
of a certain group of learners. While standards from criterion–referenced assessments are based on preset
standards specified from the very start by the teacher or school in general.
o Holistic Scoring involves giving a single, overall assessment score for an essay, writing composition, or
other performance-type assessment as a whole.
19
The following is an example of a Rubric for an Oral Presentation:
Rating / Grade Characteristics
A It is very organized. Has a clear opening statement that catches the audience’s
(Exemplary interest. The content of the report is comprehensive and demonstrates substance and
depth. Delivery is very clear and understandable. Uses slides/multimedia equipment
effortlessly to enhance the presentation.
B It is mostly organized. Has an opening statement relevant to the topic. Covers
(Satisfactory) important topics. Has an appropriate pace and without distracting mannerisms.
Looks at slides to keep on track.
C Has opening statement relevant to the topic and but does not give an outline of speech;
(Emerging) is somewhat disorganized. Lacks content and depth in the discussion of the topic.
Delivery is fast and not clear; some items are not covered well. Relies heavily on
slides and notes and makes little eye contact.
D Has no opening statement regarding the focus of the presentation. Does not give
(Unacceptable) adequate coverage of the topic. It is often hard to understand, with a voice that is too
soft or too loud and a pace that is too quick or too slow. Just reads slides; slides have
too much text.
ANALYTIC SCORING:
Involves assessing each aspect of a Performance Task
o Essay writing o Class debate
o Oral presentation o Research paper
Grades are given by averaging
Advantage:
o Its reliability
o Provides information about strengths and weaknesses.
Rubric for a Final Research Paper
Criteria / Indicators Expert Proficient Apprentice Novice
4 3 2 1
1. Introduction At least A to C Any two of the Any one of the None of the
a. Clearly identifies are satisfied given indicators given indicators given indicators
and discusses are satisfied is satisfied is satisfied.
research
focus/purpose
b. Research focus is
clearly grounded
in previous
research /
theoretically
relevant literature
c. The significance
of the study is
clearly identified
(and how it adds to
previous research)
d. Others, please
specify
2. Method At least A to C Any two of the Any one of the None of the
Provides accurate and thorough are satisfied given indicators given indicators given indicators
information on the following: are satisfied is satisfied is satisfied.
a. Research method, design, and
context
b. Data sources, collection
procedure, and tools
c. Data analysis
d. Others, please specify.
3. Results At least A to C Any two of the Any one of the None of the
a. Results are clearly explained on a are satisfied given indicators given indicators given indicators
comprehensive level and are well- are satisfied is satisfied is satisfied.
organized.
b. Tables/figures clearly and concisely
convey the data
c. Statistical analyses are appropriate
tests and are accurately interpreted
d. For others, please specify
4. Conclusions, Discussions, and At least A to C Any two of the Any one of the None of the
Recommendations are satisfied given indicators given indicators given indicators
a. Interpretations/analysis of results are satisfied is satisfied is satisfied.
are thoughtful and insightful; are
clearly informed by the study’s
20
results; and thoroughly address and
how they supported, refuted, and/or
informed the
hypotheses/proposition.
b. Discussions on how the study relates
to and/or enhances the present
scholarship in this area are adequate.
c. Suggestions for further research in
this area are insightful and
thoughtful.
d. Others, please specify
5. Documentation and Quality At least A to C Any two of the Any one of the None of the
of Sources are satisfied given indicators given indicators given indicators
a. Cites all data obtained from other are satisfied is satisfied is satisfied.
sources
b. APA style is accurately used in both
text and references.
c. Sources are all scholarly and clearly
relate to research
d. For others, please specify
6. Spelling and Grammar At least A to C Any two of the Any one of the None of the
a. No error in spelling are satisfied given indicators given indicators given indicators
b. No error in grammar are satisfied is satisfied is satisfied.
c. No error in the use of punctuation
marks
d. For others, please specify
7. Manuscript Format At least A to C Any two of the Any one of the None of the
a. Title page has proper APA are satisfied given indicators given indicators given indicators
formatting are satisfied is satisfied is satisfied.
b. Used correct headings and
subheadings consistently
c. Proper margins were observed
d. For others, please specify
Final Grade
Primary Trait Scoring
o Focuses only one aspect of scoring
o Advantage or disadvantage
o It needs a detailed scoring guide
Multiple-Trait Scoring
o It is like analytic scoring because of its focus
o Focuses on specific features
o Ability to present arguments clearly
o Organize one’s thoughts
o Correct grammar, punctuation, and spelling
What are the different types of test scores?
Grading methods communicate the teachers’ evaluative appraisal of learners' level of achievement or performance
in a test or task
Test scores can take the form of:
o Raw score - number of items answered correctly on a test
- may be useful if everyone knows the test coverage
o Percentage Score – it is interpreted as the percent of content, skills, or knowledge that learners have a solid
group of:
Most appropriate to use in the teacher-made test or criterion-reference test
It is suitable to use in subjects wherein a standard has been set.
o Criterion – Reference Grading System
Test scores are based on performance in specified learning goals
It is premised on the assumption that learners’ performance is independent of the performance of
the other learners in their group/class.
Types of Criterion-Referenced Scores or Grades:
Pass or Fail Grade
o Needs a standard or cut-off score
o Appropriate for comprehensive or licensure exams because there is no limit to
the number of examinees who pass or fail
o Advantages:
It takes the pressure of the learners to get a high letter or numerical grade
It gives the learner a clear-cut idea of their strengths/weaknesses
It allows learners to focus on true understanding
Letter Grade – one of the most commonly used grading systems.
o A, B, C, D, E or five-level grading scale
o A – highest level; E or F – lowest grade
21
o While letters are easy to understand they are not so clear to parents, learners, and
other stakeholders.
Plus (+) and Minus (-) letter Grade
(+)/(-) Letter Grades Interpretation
A+ Excellent
A Superior
A- Very Good
B+ Good
B Very Satisfactory
B- High Average
C+ Average
C Fair
C- Pass
D Conditional
E/F Failed
Categorical Grades
Exceeding Meeting Approaching Emerging Not
Standards Standards Standards Standards Exceeding
Standards
Advanced Intermediate Basic Novice Below Basic
Exemplary Accomplished Developing Beginning Inadequate
Expert Proficient Competent Apprentice Novice
Master Distinguished Proficient Intermediate Novice
o Norm-Referenced Grading System
In grading, learners’ test scores are compared with those of peers.
Norm-referenced grading allows teachers to:
Compare learners’ test performance with that of other learners
Compare learners’ performance in one test (subtest) with another test.
Compare learners’ performance in one form of the test with another form of the test
administered at an earlier date.
Types of Norm-Referenced Scores
Developmental Score – transformed from raw scores and reflects the average performance
at age and grade level.
o Grade Equivalent Score
Describes the test performance of a learner in terms of a grade level and
the months since the beginning of the school year. A decimal point is
used between the grade and month in grade equivalents. 7.5 – 7 taking
the test at the end of the fifth month of the school year.
o Age-Equivalent Score – The learner’s score of 11-5 means that his age equivalent
is 11 years and 5 months
Percentile Rank – if a learner obtained a score of 75 percentile rank in a standardized
achievement test. A score is more than 75%.
Stanine Score – expresses test result in nine equal steps
Description Stanine Percentile Rank
Very High 9 96 and above
Above Average 8 90-95
7 77-89
Average 6 60-76
5 40-59
4 23-39
Below Average 3 11-22
2 4-10
Very Low 1 3 and below
Standard Score – they are raw scores that are converted into a common scale of
measurement that provides a meaningful description of the individual score.
o Z-Score
o T-Score
T = 50 + 10z
A T-score of 50 is considered average
What are the General Guidelines for Grading Tests or Performance Tasks?
1. Stick to the purpose of the assessment.
o Determine the purpose of the test. Formative, summative, diagnostic
2. Be guided by the learning outcomes. What is included in the test? Learners should know.
3. Develop grading criteria – it saves time in the grading process
4. Inform the learner what scoring methods are to be used.
5. Decide on what type of test scores to use.
What are the General Guidelines for Grading Essay Tests?
- Scoring essay responses can be made more rigorous by developing a scoring scheme.
1. Identify the criteria for rating the essay
2. Determine the type of rubric to use
22
3. Prepare the rubric
Point Values Sample Performance Benchmarks
1 Needs Improvement Beginning Novice Inadequate
2 Satisfactory Developing Apprentice Developing
3 Good Accomplished Proficient Proficient
4 Exemplary Exceptional Distinguished Skilled
4. Evaluate essay anonymously
5. Score one essay question at a time.
6. Be conscious of your own biases when evaluating a paper
7. Review initial scores and comments before giving the final rating
8. Get two or more raters for essay
9. Write comments
What is the New Grading System of the Phil. K-12 Program?
The components are: Written Work, Performance Task, and Quarterly Assessment Weight for Grades 1-10:
Component:
Component Senior High School
Core Subjects Immersion / Research / All Other Immersion / Research
Business Simulation / Subjects / Exhibit /
Exhibit / Performance Performance
Written Work 30% 40% 20%
Performance Task 50% 40% 60%
Quarterly Assessment 20% 20% 20%
Thank You.
God Bless Us All!
23