Item Analysis
Item Analysis
ITEM ANALYSIS
Item analysis is a process which examines student responses to individual test items (questions)
in order to assess the quality of those items and of the test as a whole. Item analysis is especially
valuable in improving items which will be used again in later tests, but it can also be used to
eliminate ambiguous or misleading items in a single test administration. In addition, item
analysis is valuable for increasing instructors’ skills in test construction, and identifying specific
areas of course content which need greater emphasis or clarity.
Item analysis uses statistics and expert judgment to evaluate tests based on the quality of
individual items, item sets, and entire sets of items, as well as the relationship of each item to
other items. It “investigates the performance of items considered individually either in relation to
some external criterion or in relation to the remaining items on the test” (Thompson & Levitov,
2000, p. 163). It uses this information to improve item and test quality. Item analysis concepts
are similar for norm-referenced and criterion-referenced tests, but they differ in specific,
significant ways.
Item analysis refers to a statistical technique that helps instructors identify the effectiveness of
their test items. In developing quality assessment and specifically effective multiple-choice test
items, item analysis plays an important role in contributing to the fairness of the test along with
identifying content areas that maybe problematic for students.
Generally, the process of item analysis works best when class sizes exceed 50 students. In such
cases, item analysis can help in identifying potential mistakes in scoring, ambiguous items, and
alternatives (distractors) that don’t work. When performing item analysis, we are analyzing the
following important statistical information:
1
ADVANTAGES OF ITEM ANALYSIS
1) It leads to the improvement of individual test items. The analysis of each of the items will
enable the test constructor/users know the effectiveness of each item. Item analysis
provides diagnostic information for determining the quality of the items.
2) . In view of Crobbach (1990: 178), statistical analysis of items spots questionable items.
When these items are reviewed or rewritten they increase the validity of the test.
3) Item analysis makes it possible to shorten a test and at the same time increase its validity
and reliability. This is achieved because it helps us to choose items of suitable difficulty
level.
4) Item analysis lead to increased skill in test construction. Item analysis reveals ambiguity,
clues, ineffective distracters, and other technical defects that were missed during the
preparation of the test.
5) . Estimate the difficulty of each item (percentage of testees who got the item right in the
upper and lower groups).
6) . Estimate the discriminating power of each item (difference between the number of
testees in the upper and lower groups who got the item right).
2
The first three steps of this procedure merely provide a convenient tabulation of testees
responses from which we can readily obtain an estimate of item difficulty, item
discriminating power and the effectiveness of each distracter. This latter information can
frequently be obtained simply by inspecting the item analysis data.
This indicates the proportion of students who got the item right. A high percentage indicates an
easy item/question and a low percentage indicates a difficult item. Item difficulty helps to
determine how difficult an item is. In general, items should have values of difficulty no less than
20% (probability of 0.2) and no greater than 80% ( probability of 0.8 ). Very difficult or very
easy items contribute little to the discriminating power of a test.
For items with one correct alternative worth a single point, the item difficulty is simply the
percentage of students who answer an item correctly. In this case, it is also equal to the item
mean. The item difficulty index ranges from 0 to 100; the higher the value, the easier the
question. When an alternative is worth other than a single point, or when there is more than one
correct alternative per question, the item difficulty is the average score on that item divided by
the highest number of points for any one alternative. Item difficulty is relevant for determining
whether students have learned the concept being tested. It also plays an important role in the
ability of an item to discriminate between students who know the tested material and those who
do not. The item will have low discrimination if it is so difficult that almost everyone gets it
wrong or guesses, or so easy that almost everyone gets it right.
Item difficulty is the percentage of people who answer an item correctly. It is the relative
frequency with which examinees choose the correct response (Thorndike, Cunningham,
Thorndike, & Hagen,2004). It has an index ranging from a low of 0 to a high of +1.00. Higher
difficulty indexes indicate easier items. An item answered correctly by 75% of the examinees has
an item difficult level of .75. An item answered correctly by 35% of the examinees has an
item difficulty level of .35. Item difficulty is a characteristic of the item and the sample that takes
the test.
3
Item difficulty is calculated by using the following formula (Crocker & Algina, 2000).
Total # tested
A table of Matrix is needed to begin item analysis. It is a two dimensional table, which has on
one dimension the students as arranged from the highest scorer to the lowest scorer on the left
hand side of the row and the number of items represented on the horizontal column of the table
of matrix.
Student1 x ………….
Student2 X ………. x
Student3 X ………….. x
Student4 x X …………..
….. X …………… x
Student40 X ……………
Going student by student, tick each student’s answers into the cells of the chart. However, enter
only the wrong answers ( x ). Any empty cell will therefore signal a correct answer.
To find item difficulty for each item you count for example in item1 the total number of those
that got the item correct divided by total number of students that took the test multiplied by
100% to get the percentage.
Item difficulty power = total number of those that got the item correct ÷ total number of those
that took the test multiplied by 100%. An acceptable range is between 20% (0.2) and 80% (0.8).
4
Another way of determining item difficulty from the table of matrix is to identify the upper 10
scorers and lowest 10 scorers on the test. Set aside the remainder.
Go back to the upper 10 students. Count how many of them got Item 1 correct (this would be all
the empty cells). Write that number at the bottom of the column for those 10. Do the same for the
other items. We will call these sums RU, where U stands for "upper."
Repeat the process for the 10 lowest students. Write those sums under their 20 columns. We will
call these RL, where L stands for "lower."
Difficulty index is just the proportion of people who passed the item. Calculate it for each
item by adding the number correct in the top group (RU) to the number correct in the bottom
group (RL) and then dividing this sum by the total number of students in the top and bottom
groups (20).
R U + RL
20
Your result will now be compared to the difficulty index of 0.2 and 0.8 to determine how
difficult or easy the item is. But we cannot only analyze items with item difficulty alone. This is
where item discrimination power comes in.
5
groups consists of twenty seven percent (27%) of the total group of students who took the test
and is based on the students’ total score for the test). The discrimination index range is between
-1 and +1. The closer the index is to +1, the more effectively the item distinguishes between the
two groups of students. Sometimes an item will discriminate negatively. Such an item should be
revised and eliminated from scoring as it indicates that the lower performing students actually
selected the key or correct response more frequently than the top performers.
Discrimination power is concerned with the establishment of how able the correct option attract
only those who know and fail to attract those who do not know. In computing discrimination
power the participants are categorized into three using a 27% margin, namely high scorers,
moderate scorers and the low scorers.
Using the table of matrix for instance forty participants, the categories will be 27% of 40.
27% of 40 = 10.8
This will be rounded up to 11 and move to the table of matrix and count the first 11 and move to
the bottom and count the last 11. This will represent the upper scoring group and the lower
scoring group respectively since they were initially arranged from the highest scorers to the
lowest scorers. In this analysis we don’t bother about the middle scorers.
For discrimination power we are to analyze the correct options. Using the following;
Ru = number of those in the high scoring group that got the item correct.
RL = number of those in the low scoring group that got the item correct.
Nu = number of those in the high scoring group.
NL = number of those in the low scoring group
Discrimination Power = R u - RL
( Nu + NL ) ÷ 2
The 27 percent is used because “this value will maximize differences in normal
distributions while providing enough cases for analysis” (Wiersma & Jurs, 2001, p. 145).
Comparing the upper and lower groups promotes stability by maximizing differences between
the two groups. The percentage of individuals included in the highest and lowest groups can
6
vary. Nunnally (2005) suggested 25 percent, while SPSS (2000) uses the highest and lowest one-
third.
Wood (2000) stated that ‘When more students in the lower group than in the upper group select
the right answer to an item, the item actually has negative validity. Assuming that the criterion
itself has validity, the item is not only useless but is actually serving to decrease the validity of
the test.
The higher the discrimination index, the better the item because high values indicate that the item
discriminates in favor of the upper group which should answer more items correctly. If more low
scorers answer an item correctly, it will have a negative value and is probably flawed.
A negative discrimination index occurs for items that are too hard or poorly written, which
makes it difficult to select the correct answer. On these items poor students may guess correctly,
while good students, suspecting that a question is too easy, may answer incorrectly by reading
too much into the question. Good items have a discrimination index of .40 and higher;
reasonably good items from .30 to .39; marginal items from .20 to .29, and poor items
less than .20 (Ebel & Frisbie, 2002).
7
consider to be an acceptable item difficulty value for test times. If we are to assume that 0.7 is an
appropriate item difficulty value, then we should expect that the remaining 0.3 be about evenly
distributed among the distractors. Let us take the following test item as an example:
Let us assume that 100 students took the test. If we assume that A is the answer and the item
difficulty is 0.7, then 70 students answered correctly. What about the remaining 30 students and
the effectiveness of the three distractors? If all 30 selected D, the distractors B and C are useless
in their role as distractors. Similarly, if 15 students selected D and another 15 selected B,
then C is not an effective distractor and should be replaced. The ideal situation would be for each
of the three distractors to be selected by 10 students. Therefore, for an item which has an item
difficulty of 0.7, the ideal effectiveness of each distractor can be quantified as 10/100 or 0.1.
What would be the ideal value for distractors in a four option multiple choice item when the item
difficulty of the item is 0.4? Hint: You need to identify the proportion of students who did not
select the correct option.
From a different perspective, the item discrimination formula can also be used in distractor
analysis. The concept of upper groups and lower groups would still remain, but the analysis and
expectation would differ slightly from the regular item discrimination that we have looked at
earlier. Instead of expecting a positive value, we should logically expect a negative value as
more students from the lower group should select distracters. Each distractor can have its own
item discrimination value in order to analyse how the distracters work and ultimately refine the
effectiveness of the test item itself. If we use the above item as an example, the item
discrimination concept can be used to assess the effectiveness of each distractor. If a class has
100 students, we can form upper and lower groups of 30 students each. Assume the following are
observed:
8
No. of upper No. of lower
Distractor group students group students Discrimination Value/index
who selected who selected
A. It rained
all day 20 10 (20 – 10) /30 0.33
B. He was
scolded 3 3 (3 – 3) /30 0
C. He hurt
himself 4 16 (4 – 16) /30 -0.4
D. The
weather 3 1 (3 – 1) /30 0.07
was hot
The values in the last column of the table can once again be interpreted according to how we
examined item discrimination values, but with a twist. Alternative A is the key and a positive
value is the value that we would want. However, the value of 0.33 is rather low considering the
maximum value is 1. The value for distractor B is 0 and this tells us that the distractor did not
discriminate between the proficient students in the upper group and the weaker students in the
lower group. Hence, the effectiveness of this distractor is questionable. Distractor C, on the other
hand, seems to have functioned effectively. More students in the lower group than in the upper
group selected this distractor. As our intention in distractor analysis is to identify distractors that
would seem to be the correct answer to weaker students, then distractor C seems to have done its
job. The same cannot be said of the final distractor. In fact, the positive value obtained here
indicates that more of the proficient students selected this distractor. We should understand by
9
Distractor analysis can be a useful tool in evaluating the effectiveness of our distractors. It is
important for us to be mindful of the distractors that we use in a multiple choice format test as
when distractors are not effective, they are virtually useless. As a result, there is a greater
possibility that students will be able to select the correct answer by guessing as the options have
been reduced.
Some test analysts may desire more complex item statistics. Two correlations which are
commonly used as indicators of item discrimination are shown on the item analysis report. The
first is the biserial correlation, which is the correlation between a student's performance on an
item (right or wrong) and his or her total score on the test. This correlation assumes that the
distribution of test scores is normal and that there is a normal distribution underlying the
right/wrong dichotomy. The biserial correlation has the characteristic, disconcerting to some, of
having maximum values greater than unity. There is no exact test for the statistical significance
of the biserial correlation coefficient.
The point biserial correlation is also a correlation between student performance on an item (right
or wrong) and test score. It assumes that the test score distribution is normal and that the division
on item performance is a natural dichotomy. The possible range of values for the point biserial
correlation is +1 to -1. The Student's t test for the statistical significance of the point biserial
correlation is given on the item analysis report. Enter a table of Student's t values with N - 2
degrees of freedom at the desired percentile point N, in this case, is the total number of students
appearing in the item analysis.
10
The mean scores for students who got an item right and for those who got it wrong are also
shown. These values are used in computing the biserial and point biserial coefficients of
correlation and are not generally used as item analysis statistics.
Item analysis data are not synonymous with item validity. An external criterion is
required to accurately judge the validity of test items. By using the internal criterion of
total test score, item analyses reflect internal consistency of items rather than validity.
The discrimination index is not always a measure of item quality. There is a variety of
reasons an item may have low discriminating power:(a) extremely difficult or easy items
will have low ability to discriminate but such items are often needed to adequately
sample course content and objectives;(b) an item may show low discrimination if the test
measures many different content areas and cognitive skills. For example, if the majority
of the test measures “knowledge of facts,” then an item assessing “ability to apply
principles” may have a low correlation with total test score, yet both types of items are
needed to measure attainment of course objectives.
Item analysis data are tentative. Such data are influenced by the type and number of
students being tested, instructional procedures employed, and chance errors. If repeated
use of items is possible, statistics should be recorded for each administration of each
item.
Item analysis is a completely futile process unless the results help instructors improve their
classroom practices and item writers improve their tests. Let us suggest a number of points of
departure in the application of item analysis data.
11
3) It is only used when the test involves large population of students.
4) It requires the preparation of large number of test items.
5) Generally, item statistics will be somewhat unstable for small groups of students. Perhaps
fifty students might be considered a minimum number if item statistics are to be stable.
Note that for a group of fifty students, the upper and lower groups would contain only
thirteen students each. The stability of item analysis results will improve as the group of
students is increased to one hundred or more. An item analysis for very small groups
must not be considered a stable indication of the performance of a set of items.
CONCLUSION
Item analysis is the process of testing items to ascertain specifically whether the item is
functioning properly in measuring what the entire item is measuring. It begins after the test has
been administered and scored. It also involves detailed and systematic examination of the
testees’ response to each item to determine the difficulty level and discriminating power of the
item.
This also includes determining the effectiveness of each option. The decision on the quality of an
item depends on the purpose for which the test was designed. However, for an item to effectively
measure what the entire test is measuring and provide valid and useful information, it should not
be too easy or difficult.
12
REFERENCE
Holt, Rinehart and Winston, 2003: Measurement and Evaluation in Education and Psychology.
New York
Wood, D. A. (2000). Test construction: Development and interpretation of achievement tests.
Columbus, OH: Charles E. Merrill Books, Inc.
Thorndike, R. M., Cunningham, G. K., Thorndike, R. L., & Hagen,E. P. (2004). Measurement
and evaluation in psychology and education (5th Ed.). New York: MacMillan.
SPSS. (2000). Item analysis. spss.com. Chicago: Statistical Package for the Social Sciences.
Educational Testing Service. (2005, August 10). What’s the DIF? Helping to ensure test question
fairness. research@ets.org. Princeton, NJ: The Educational Testing Service
Nunnally, J. C. (2005). Educational measurement and evaluation (4th Ed.). New York: McGraw-
Hill.
Wiersma, W. & Jurs, S. G. (2001). Educational measurement and testing (2nd Ed.). Boston,
MA: Allyn and Bacon.
Ebel, R. L., & Frisbie, D. A. (2002). Essentials of educational measurement. Englewood Cliffs,
NJ: Prentice-Hall.
Adedokun J. A. (2012). Educational measurement, assessment, evaluation and statistics, New
hope educational publishers, Lagos.
13
ATTITUDE SCALE QUESTIONARE
SCHOOL………………………………………. LEVEL ……………………………………..
GENDER…………………………FACULTY/DEPARTMENT…………………………
TITLE: ATTITUDE TO ABORTION
10 POSITIVE STATEMENTS
1) ………………………………………………………………………………............
……………………………………………………………………………………….
2) ……………………………………………………………………………………….
…………………………………………………………………………………..........
3) ………………………………………………………………………………………….
…………………………………………………………………………………………
4) …………………………………………………………………………………………
…………………………………………………………………………………………
5) …………………………………………………………………………………………
…………………………………………………………………………………………
6) …………………………………………………………………………………………
…………………………………………………………………………………………..
7) ……………………………………………………………………………………………
…………………………………………………………………………………………….
8) …………………………………………………………………………………………….
…………………………………………………………………………………………….
9) ……………………………………………………………………………………………
……………………………………………………………………………………………..
10) ………………………………………………………………………………………………
14
………………………………………………………………………………………………
10 NEGATIVE STATEMENTS
1) ………………………………………………………………………………............
……………………………………………………………………………………….
2) ……………………………………………………………………………………….
…………………………………………………………………………………..........
3) ………………………………………………………………………………………….
…………………………………………………………………………………………
4) …………………………………………………………………………………………
…………………………………………………………………………………………
5) …………………………………………………………………………………………
…………………………………………………………………………………………
6) …………………………………………………………………………………………
…………………………………………………………………………………………..
7) ……………………………………………………………………………………………
…………………………………………………………………………………………….
8) …………………………………………………………………………………………….
…………………………………………………………………………………………….
9) ……………………………………………………………………………………………
……………………………………………………………………………………………..
10) ………………………………………………………………………………………………
………………………………………………………………………………………………
15
UNIVERSITY OF LAGOS
FACULTY OF EDUCATION
SCHOOL OF POSTGRADUATE STUDIES
COURSE TITLE:
ADVANCE MEASUREMENT AND EVALUATION
COURSE CODE
EDF 809
MEASUREMENT AND
IJIYEMI OLUWASEUN 100310046 EVALUATION
MARGARET
EDUCATIONAL
OLAMIJU KIKELOMO 089034085 PSYCHOLOGY
PRECIOUS
SESSION: 2015/2016
16
ATTITUDE TO ABORTION
Please kindly tick as appropriate.
AGE.. 18 to 25 ( ), 26 to 30 ( ), 31 to 35 ( ), 36 to 40 ( ), 40 and above ( ).
SA (strongly agree), A (agree), U ( undecided ), D ( disagree), SD ( strongly disagree).
s/n statements SA A U D SD
1 Abortion helps to reduce population
2 It promotes illicit sexual activities in young people.
3 It helps to save the life of medically at risk mother.
4 It helps to role away shame for pregnant young people.
5 It serves as ends meet to some doctors.
6 It can lead to severe damage of the womb.
7 It helps ladies to pursue happiness.
8 It prevents complications arising from pregnancy.
9 It helps to check family size.
10 In the event of rape, it helps to remove unwanted child.
11 It reduces the number of children proposed by couples.
12 It can cause the death of the woman.
13 It can lead to barrenness when the womb is damaged.
14 In some religious beliefs, it is a sin.
15 It can lead to low self esteem of an individual.
16 It can cause infections in the body.
17 It prevent the conception of bastard children.
18 It helps to prevent disgrace.
19 Abortion is dangerous to human life.
20 Abortion can be termed brutality.
21 Abortion is capital intensive.
22 Abortion breaks marriages, relationships and homes.
23 Abortion is a murderous act.
24 Abortion save guards women’s health.
17