Psychology 8a Module II
Psychology 8a Module II
MODULE II:
BASIC CONCEPTS IN PSYCHOLOGICAL MEASUREMENT AND
STATISTICS APPLIED TO NORMS AND INTERPRETATION OF TESTS
Lesson Objectives:
At the end of this lesson, the student should be able to:
1. define and explain what measurement is;
2. determine the importance of the basic statistical concepts applied to
psychological testing;
3. discuss how norms are used as reference for scoring tests; and
4. identify the characteristics of a good test.
Defining Measurement
Measurement is the process of assigning numbers to objects in such a way that
specific properties of objects are faithfully represented by properties of numbers. This
definition can be refined slightly when applied to psychological measurement, which is
concerned with attributes of persons rather than attributes of objects. Psychological
measurement is the process of assigning numbers (e.g., test scores) to persons in such
a way that some attributes of the persons being measured are faithfully reflected by
some properties of the numbers.
Psychological measurement attempts to represent some attributes of persons in
terms of some properties of numbers. In other words, psychological tests do not attempt
to measure total person but only some specific set of attributes of that person.
The foundation of psychological measurement is the assumption that individuals
differ in behavior, interests, preferences, perceptions and beliefs. The task of a
psychologist interested in measurement is to devise systematic procedures for
translating these differences into quantitative terms. In other words, psychological
measurement specialists are interested in assigning numbers to individuals that will
reflect their differences.
Psychological tests are designed to measure specific attributes of persons not the
whole person. There is no test that measures whether one is a good person or a worthy
human being. Psychological tests only tell us ways in which individuals are similar or
different.
Statistical Concept
What is the importance of statistics to psychology? Psychological measurement
leaves us with lots of numbers, and statistics give us a method for answering questions
about the meaning of those numbers. The primary objective of statistical method is to
organize and summarize quantitative data in order to facilitate their understanding.
Statistics can be used to describe test scores. It can be used to make inferences about
the meaning of test scores. It can provide a method for communicating information
about test scores and for determining what conclusions can and cannot be drawn from
those scores.
Statistical methods have found extensive application in the psychological and
educational testing field and in the study of human ability. Since the time of Binet, who
developed the first extensively used and successful test of intelligence, a
comprehensive body of theory and technique has been developed which is primarily of
a statistical in nature. This body of theory and techniques is concerned with the
construction of instruments for measuring human ability, personal characteristics,
attitudes, interests, and many other aspects of behavior with the logical conditions
which such measurement instruments must satisfy; with the quantitative prediction of
human behavior; and with other related topics.
Scores on psychological tests are interpreted by reference to norms, which
represent the test performance of the standardization sample. The norms are
empirically established by determining what the persons in a representative group
actually do on test.
Norms
Scores on psychological tests rarely provide absolute, ratio scale measures of
psychological attributes. Thus, it rarely makes sense to ask, in an absolute sense, how
much intelligence, motivation, depth, perception, etc. a person has. Scores in
psychological tests do, however, provide useful relative measures. It makes perfect
sense to ask whether Juan is more intelligent, more motivated or has better depth
perception than Jose. Psychological tests provide a systematic method of answering
such questions.
One of the most useful ways of describing a person’s performance on a test is to
compare his or her test score to the test scores of some other person or group of
people. Many psychological tests base their scores on a comparison between each
examinee and some standard population that has already taken the test.
When a person’s test score is interpreted by comparing that score to the scores of
several other people, this is referred to as a norm-based interpretation. The scores to
which each individual is compared are referred to as norms which provide standards for
interpreting test scores. A norm-based score indicates where an individual stands in
comparison to the particular normative group that defined the set of standards.
Characteristics of a Good Test:
There are three characteristics of a good test. These are:
1. Validity:
It is the closeness of agreement between the scores of the test and some other
measures. It is the general worthiness of an examination. It also refers to the
degree to which the test parallels the curriculum and good teaching practice. It is
the degree to which a test measures what it is supposed to measure.
2. Reliability:
It is the test’s self-consistency. A test is considered highly reliable if it yields
approximately the same scores when given a second time or when alternative
forms of tests are administered to the same person. It is the extent to which a
test measures something consistently.
3. Practicability:
Tests must also be usable. Therefore tests should be selected on the basis of the
extent to which it can be used without unnecessary expenditure of time, effort
and money.
Psychology 8A
Module II, Lesson 2:
RELIABILITY: THE CONSISTENCY OF TEST SCORES
Lesson Objectives:
At the end of this lesson, the student should be able to:
1. define reliability;
2. discuss the four ways of evaluating or determining reliability;
3. explain coefficient of correlation; and
4. compute coefficient of correlation.
Reliability Defined
Test scores are reliable when they are reproducible and consistent. Tests may be
unreliable for a number of reasons. Confusing or ambiguous test items may mean
different things to as test taker at different times. Tests may be too short to sample the
abilities being tested adequately, or scoring may be too subjective. If a test yields
different results when it is administered on different occasions or scored by different
people, it is unreliable. A simple analogy is a rubber yardstick. If we did not know how
much it stretched each time we took a measurement, the results would be unreliable no
matter how carefully we marked the measurement. Tests must be reliable if the results
are to be used with confidence.
Reliability can be evaluated or determined in four ways:
1. Retest Reliability:
This can be done by obtaining two measures of the same individual/groups on
the same test. The two sets of scores obtained from the same test given to the
same individual/group at different times are correlated.
2. Equivalent Form Reliability:
This can be done by giving the test in two different but equivalent forms. Scores
obtained on two forms of the same test, both of which are supposed to sample
the same ability, are correlated.
3. Split Reliability:
This can be done by treating half of the test separately. Thus, it is correlating
scores on one-half of a test with scores on the other half.
If each individual/group tested achieves roughly the same scores on both
measures, then the test is reliable. Of course, even for a reliable test, some
differences are to be expected between the pair of scores due to chance and
errors of measurement. Consequently, a statistical measure of the degree of
relationship between the set of paired scores is needed. This degree of
relationship is provided by the coefficient of correlation. The coefficient of
correlation between paired scores is called a reliability coefficient. Well-
constructed tests usually have a reliability coefficient of r=. 90 or greater.
4. Internal Consistency Method:
The essential characteristic is the total score on the test itself. The performance
of the upper criterion group on each test item is then compared with that of the
lower criterion group.
Coefficient of Correlation
Correlation refers to the concomitant variation of paired measures. Suppose that a
test is designed to predict success in college. If it is a good test, high scores on it will be
related to high performance in college and low scores will be related to poor
performance. The coefficient of correlation gives us a way of stating the degree of
relationship more precisely. Coefficient of correlation is used in Retest-reliability by
determining the degree of relationship of the two sets of scores obtained from the same
test given to the same individual/group at different times. This is also applicable to other
ways of evaluating reliability.
The most frequently used method of determining the coefficient of correlation is the
product-moment method that yields the index conventionally designated r. The product
moment coefficient r varies between perfect positive correlation (r = + 1.00) and perfect
negative correlation (r = - 1.00). Lack of any relationship yields r = .00.
ðx = 4 ; ðy = 6
One of the paired measures has been labeled the x-score; the other, the y-score.
The dx and dy refer to the deviations of each score from its mean. N is the number of
paired measures, ðx and ðy are the standard deviation of the x-scores, and y-scores.
Computation of Standard Deviation
ð
d 2
N
Students x-score y-score dx dy dx2 dy2
A 71 39 6 9 36 81
B 67 27 2 -3 4 9
C 65 33 0 3 0 9
D 63 30 -2 0 4 0
E 59 21 -6 -9 36 81
Sum 80 180
Therefore:
80 180
ðx 16 4 ðy 36 6
5 5
dx dy 102
.85
N ðx ðy 5 x 4 x6
Lesson Objective:
At the end this lesson, the student should be able to:
1. define validity;
2. discuss the different procedures used to gather evidence for validity; and
3. explain what is a criterion.
Validity Defined
Once you have established reliability, validity comes into picture. Like reliability of
results, valid interpretations contribute to accuracy in evaluation. Somewhat different
procedures used to gather evidence for the validity of different kinds of interpretation.
1. Content Validation:
This involves gathering evidence that assessment tasks adequately represent the
“domain” of knowledge or skills to be assessed. Observations should not only be
adequate in number but they should also represent whatever is to be assessed and
do so in proportion to instruction. The result of a paper and pencil test of basketball
rules, for example, can represent knowledge of basketball rules, but it cannot
represent basketball skills, which are in the psychomotor domain. If you teach
psychomotor skills, you must test psychomotor skills in proportion to instruction in
these skills for the results to interpret as indications of them. Evident of content
validity is usually gathered by matching assessment tasks to anticipated learning
outcomes.
2. Construct Validation:
This involves gathering evidence that assessment results represent only what is to
be assessed. A “construct” is a meaningful interpretation of observations. An
interpretation has construct validity if it matches expectations based on theory. For
example, suppose you develop a fifth grade life science test to assess
understanding of systems of the human body using multiple choice items. After the
test, which you intended to assess comprehension of systems you discover that
some students did not know some of the general vocabulary words in some of the
test items. Factors such as limited vocabulary, tension, test anxiety, fatigue or
dishonesty can change the meaning of test results so that they do not signify what
was intended. Evidence of construct validity can be gathered by finding ways to
verify that assessment results are consequences of only what the test was intended
to assess and not of some other factor.
3. Criterion-related Validation:
This involves gathering evidence that assessment results have some value for
estimating a standard of performance (criterion). Validity can be assessed by
correlating the test score with some external criterion. For example, the positive
correlation between scores on the scholastic aptitude test and freshman grades in
college indicate that the tests had reasonable validity.
Criteria
A criterion is a measure, which could be used to determine the accuracy of a
decision. In psychological testing, criteria typically represent measures of the outcomes
that specific treatments or decisions are designed to produce. For example, workers are
selected for jobs on the basis of predictions the personnel department makes regarding
their future performance in the job. The job applicants who are actually hired are those
who, on the basis of test scores or other measures are predicted to perform at the
highest level. Actual measures of performance on the job serve as criteria for evaluating
the personnel department’s decisions if the workers who were hired actually do perform
at a higher level than those who were not hired. The predictions of the personnel
department are confirmed or validated. In similar ways, measures of grade point
average or years to complete a degree might serve as criteria for evaluating selection
and placement decisions in the schools.
The correlation between test score and a measure of the outcome of a decision
(criterion) provides an overall measure of the accuracy of predictions. Therefore, the
correlation between test scores and criterion scores can be thought of as a measure of
the validity of decisions. The validity coefficient or the correlation between test scores
and criterion scores provides the basic measure of the validity of a test for making
decisions.
Psychology 8A
Module II, Lesson 4: ITEM ANALYSIS
Lesson Objectives:
At the end of this lesson, the student should be able to:
1. discuss the purpose of item analysis as well as;
2. identify the critical features of test items;
3. learn how to measure distraction analysis and item difficulty; and finally
4. recognize the importance of item analysis.
Distracter Analysis
Typically, there is one correct or preferred answer for each multiple-choice item on a
test. A lot can be learned about test items by examining the frequency with which each
of the incorrect responses is chosen by a group of examinees.
Examine the example below:
Paranoid Schizophrenia often involves delusions, persecution or grandeur. Which secondary
symptom would be most likely for a paranoid schizophrenic?
a. Auditory hallucinations
b. Motor paralysis
c. Loss of memory
d. Aversion to food
Correct response is (a):
Number choosing Percent choosing
each answer each answer
a 47 a 55
b 13 b 15
c 25 c 29
d 1 d 1
The table shows that most of the students answered the item correctly. A fair
number of students choose either b or c; very few chose d.
A perfect test item would have two characteristics.
1. People who “knew” the answer to that question would always choose the
correct response.
2. People who did not know the answer would choose randomly among the
response, meaning that some people would guess correctly. It also means
that each of the possible incorrect responses should be equally popular.
For the test item shown in the table, responses b, c, and d served as distracters.
Fifty-five percent (55%) of the students answered this item correctly. If this were a
perfect test item, we might expect the responses of the other forty-five percent (45%) of
the students to be equally divided among the three distracters. In other words, we might
expect about fifteen percent (15%) of the students to choose each of the three incorrect
responses.
Item Difficulty
Difficulty is a surprisingly slippery concept. Considered the following two items:
1. (6 x 3) + 4 = _____
2. 9 π [1 n (-3.68) x (1-1n) (-3.68)] = _____
Most people would agree that b is more difficult than a. If asked why item b is more
difficult, they might say that it involves more complex advanced procedures than the
first.
Consider next another set of items:
1. Who was Savonarola?
2. Who was Babe Ruth?
Most people will agree that a is more difficult than b. If asked why item a is more
difficult, they might say that answering it requires more specialized knowledge than b.
As these examples might suggest, there is a strong temptation to define difficulty in
terms of the complexity or obscurity of the test item. Yet, if you were to look at any test
you have recently taken, you would likely find yourself hard pressed to explain precisely
why items were more difficult than others.
The psychologist doing an item analysis is faced with a similar problem. Some test
items are harder than others are, but it is difficult to explain or define difficulty in terms of
some intrinsic characteristics of the items. The strategy adopted by psychometricians is
to define difficulty in terms of the number of people who answer each test item correctly.
If everyone chooses the correct answer, the item is defined as an easy item. If only one
person in one hundred answers an item correctly, the item is defined as a difficult item.
Lesson Objectives:
At the end of this lesson, the student should be able to:
1. determine how tests are generated from computers;
2. discuss how tests are administered by automated machines on computers;
3. explain adaptive testing and other applications of computer to test design;
and
4. demonstrate how computers are applied to scoring and interpretation of test
results.
Computer-related Aptitudes
The rapid growth in the use of computers for office work has led to the publication of
several tests for computer-related aptitudes. These tests differ from the usual job
sample procedures employed to assess the competence of trained computer operators.
They are designed primarily for such purposes as the counseling or selection of
potential trainees, or the assignment of present employees to newly-established
function within an office.
An example is the Computer Aptitude, Literacy and Interest Profile. Standardized on
a nationally representative sample of about 1,200 persons, this test provides standard
score norms for each of its six subsets and for the total score measure largely
reasoning as applied to visual, non-verbal content.
Other tests have been developed for computer programmers, computer operators
and word processors. Particular tests vary in the extent of prior specialized training they
assume. Hence, they are designed for somewhat different populations of test takers. All
these tests clearly represent a timely application of psychometric techniques to
personnel assessment.
LESSON 1
I. Fill in the blanks with the correct word or group of words.
_____ 1.
numbers Measurement is the process of assigning (___) to objects/persons in such
a way that specific properties/attributes being measured are faithfully
reflected by some properties of the numbers.
total_____
person2. Psychological test does not attempt to measure the (___) but only some
specific set of attributes of the person.
quantitative
_____data3. The primary objective of statistical method is to organize and summarize
(___) data in order to facilitate their understanding.
statistics
_____ 4. (___) provides a method for communicating information about test scores
and for determining what conclusions can and cannot be drawn from
those scores.
norms
_____ 5. Scores on psychological tests are interpreted by reference to (___), which
represent the test performance of the standardization sample.
norm-based_____
interpretation
6. When a person’s test score is interpreted by comparing that score to the
scores of several other people, this is referred to as a (___) interpretation.
Binet
_____ 7. Since the time of (___) who developed the first extensively used and
successful test of intelligence, a comprehensive body of theory and
technique has been developed which is primarily a statistical nature.
Reliability
_____ 8. (___) is the extent to which a test measures something consistently.
validity
_____ 9. The general worthiness of an examination is referred to as (___).
time, effort, money
_____ 10. Test should be selected on the basis of the extent to which it can be used
without unnecessary expenditure of (___), (___), and (___).
II. True or False.
F
_____ 1. Individuals differ only in interest and preferences but not in perceptions
and beliefs.
F
_____ 2. There is a test that measures whether one is a good person or a worthy
human being.
T
_____ 3. Statistical methods have found extensive application in psychological
testing.
T
_____ 4. A norm-based score indicates where an individual stands in comparison to
other persons.
_____
F 5. A test is considered highly valid if it yields approximately the same scores
when given a second time.
_____
T 6. Tests must also be usable.
F
_____ 7. Reliability is the closeness of agreement between the scores of the test
and some other measures.
_____
F 8. Psychological measurement specialists are interested in assigning
numbers to individuals that will reflect their differences.
T
_____ 9. Statistics can be used to make inferences about the meaning of test
scores.
_____
F 10. Scores on psychological test always provide absolute ratio scale
measures of psychological attributes.
LESSON 2
I. Fill in the blank space with the correct word or group of words.
_____ 1. Test scores are (___) when they are reproducible and consistent.
_____ 2. (___) reliability can be done by treating half of the test separately.
_____ 3. If a test yields different results when it is administered on difficult
occasions or scored by different people it is (___).
_____ 4. The coefficient of (___) between paired scored is called a reliability
coefficient.
_____ 5. A well-constructed test usually have a reliability coefficient r of (___) or
greater.
_____ 6. In the product-moment method, lack of any relationship between two tests
yields r equal to (___).
_____ 7. (___) reliability can be done by giving the test in two different but
equivalent forms.
_____ 8. (___) refers to the concomitant variation of paired measures.
_____ 9. If it is a good test, high score on it will be related to high performance in
college and low scores will be related to (___).
_____ 10. In the method of (___) consistency, the essential characteristic is that the
criterion is none other than the total score on the test itself.
II. Compute the coefficient of correlation and analyze.
LESSON 3
I. True or False.
_____ 1. Once you have established reliability, validity is out of the picture.
_____ 2. The result of a paper and pencil test of basketball rules can represent
basketball skills.
_____ 3. Validity can be assessed by correlating the test scores with some external
criterion.
_____ 4. In psychological testing criteria typically represent measures of the
outcomes that specific treatments or decisions are designed to produce.
_____ 5. Actual measures of performance on the job cannot serve as criteria for
evaluating the personnel department decision.
II. Fill in the blanks with the correct word or group of words.
_____ 1. A (___) is a meaningful interpretation of observation.
_____ 2. A criterion is a measure, which could be used to determine the (___) of
decision.
_____ 3. (___) validation involves gathering of evidences that assessment results
have some value for estimating standard.
_____ 4. (___) validation involves gathering of evidence that assessment tasks
adequately represent the “domain” of knowledge or skills to assess.
_____ 5. Evidence of (___) validity can be gathered by finding ways to verify that
assessment of results are the consequence of only what the test intended
to assess and not of some other factor(s).
LESSON 4
LESSON 5
Answer the following questions briefly but clearly.
1. List down the features of computers as related to psychological testing:
2. Discuss the significant contributions of computers to scoring and interpretation of
tests.
3. Explain how classroom teachers use computers in preparation of instructional
tests.
Psychology 8A
Module II
ANSWERS TO THE SELF-PROGRESS CHECK TESTS
Lesson 1
1. numbers 6. Norm Bared
2. Total Persons 7. Binet
3. Quantitative Numerical 8. Reliability
4. Statistics 9. Validity
5. Norms 10. Effort, Time & Money
II.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
F F T F F T F T T F
Lesson 2
1. Reliable 6. .00
2. Split 7. Equivalent form
3. Unreliable 8. Correlation
4. Correlation 9. Poor Performance
5. .90 10. Internal
II. coefficient of correlation = .99 or 1 is equal to 1 so the correlation is perfect
Lesson 3
Test I.
1. F 2. F 3. T 4. T 5. F
II.
1. Construct
2. Accuracy
3. Criterion Related
Lesson 4
1. a. help increase understanding of test
b. if test is reliable or unreliable
c. determine if tests fail to show levels of validity
d. help understood why test can be used to predict some criteria
e. Suggest ways of improving the measurement of characteristics
2. Number of persons expected to chose each distracter=8; p value = 71
Lesson 5
1. a. Precise adherence to schedule and plans
b. delicate control of stimuli and of judgment about responses
c. choice of successive te0st items in the light of performance to date
d. Immunity to fatigue, boredom, lapse of attention and inadvertent scoring
error.
e. Instant and accurate scoring
f. legible records in several forms, with multiple copies and distend
transmission.
2. a. speed in data analyses and scoring
b. Automated administration provide easier and better way of administration
3. a. Teacher can prepare innumerable forms of instructional tests
b. Tests can be saved on cards or tape for future use
c. Teacher could stock up on tests over the weeks lesson.