0% found this document useful (0 votes)

4 views43 pages

Finals Psych Ass Reviewer

Chapter 8 discusses the comprehensive process of test development, which includes conceptualization, construction, tryout, item analysis, and revision. It emphasizes the importance of scaling, writing effective test items, and conducting statistical analyses to ensure the test's reliability and validity. Additionally, it addresses the need for culturally sensitive assessments and the inclusion of diverse sexual orientations in test frameworks.

Uploaded by

Janna Andrea Generalo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views43 pages

Finals Psych Ass Reviewer

Uploaded by

Janna Andrea Generalo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

CHAPTER 8 Test Development

TEST CONCEPTUALIZATION
Some Preliminary Questions
Pilot Work
TEST CONSTRUCTION
Scaling
Writing Items
Scoring Items
TEST TRYOUT
What Is a Good Item?
ITEM ANALYSIS
The Item-Difficulty Index
The Item-Reliability Index
The Item-Validity Index
The Item-Discrimination Index
Item-Characteristic Curves
Other Considerations in Item Analysis
Qualitative Item Analysis
TEST REVISION 260
Test Revision as a Stage in New Test Development
Test Revision in the Life Cycle of an Existing Test
The Use of IRT in Building and Revising Tests
INSTRUCTOR-MADE TESTS FOR IN-CLASS USE
Addressing Concerns About Classroom Tests
CLOSE-UP Creating and Validating a Test of Asexuality 231
MEET AN ASSESSMENT PROFESSIONAL Meet Dr. Scott Birkeland 254
EVERYDAY PSYCHOMETRICS Adapting Tools of Assessment for
Use with Specific Cultural Groups 261
SELF-ASSESSMENT 272
TEST DEVELOPMENT

Test Development an umbrella term for all that goes into the process of creating a test.

➢ Some tests are conceived of and constructed but neither tried-out, nor item-analyzed,
nor revised.

1. Test Conceptualization
− idea for a test

2. Test Construction
− stage in the process of test development that entails writing test items (or re-writing
or revising existing items)
− formatting items, setting scoring rules, and otherwise designing and building a test

3. Test Tryout
− administered to a representative sample of test takers under conditions that
simulate the conditions that the final version of the test will be administered under

4. Item Analysis
− Statistical procedures employed to assist in making judgments about which items
are good as they are, which items need to be revised, and which items should be
discarded.
− The analysis of the test’s items may include analyses of item reliability, item validity,
and item discrimination. Depending on the type of test, item-difficulty level may be
analyzed as well.

5. Test Revision
− action taken to modify a test’s content or format for the purpose of improving the
test’s effectiveness as a tool of measurement
− This action is usually based on item analyses, as well as related information derived
from the test tryout.
TEST CONCEPTUALIZATION Norm-referenced versus Criterion-referenced Tests: Item Development Issues
A measurement interest related to aspects of the LGBT (lesbian, gay, bi-sexual, and transgender) experience has increased.
The present authors propose that in the interest of comprehensive inclusion, an “A” should be added to the end of “LGBT” so ➢ Good item on a norm-referenced achievement test is an item for which high scorers on the test respond
that this term is routinely abbreviated as “LGBTA.” The additional “A” would acknowledge the existence of asexuality as a correctly. Low scorers tend to respond to that same item incorrectly.
sexual orientation or preference.

A . Some Preliminary Questions Criterion-Oriented Test

Regardless of the stimulus for developing the new test, a number of questions immediately confront the prospective test ➢ On a criterion-oriented test, this same pattern of results may occur: High scorers on the test get a particular
developer.
item right whereas low scorers on the test get that same item wrong.
▪ What is the test designed to measure? This is a deceptively simple question. Its answer is closely linked to how the
➢ However, that is not what makes an item good or acceptable from a criterion-oriented perspective.
test developer defines the construct being measured and how that definition is the same as or different from other tests
purporting to measure the same construct. ➢ Ideally, each item on a criterion-oriented test addresses the issue of whether the test taker—a would-be
▪ What is the objective of the test? In the service of what goal will the test be employed? In what way or ways is the physician, engineer, piano student, or whoever—has met certain criteria.
objective of this test the same as or different from other tests with similar goals? What real-world behaviors would be ➢ In short, when it comes to criterion-oriented assessment, being “first in the class” does not count and is
anticipated to correlate with test taker responses? often irrelevant. Although we can envision exceptions to this general rule, norm-referenced comparisons
▪ Is there a need for this test? Are there any other tests purporting to measure the same thing? In what ways will the new typically are insufficient and inappropriate when knowledge of mastery is what the test user requires.
test be better than or different from existing ones? Will there be more compelling evidence for its reliability or validity? ➢ Commonly employed in licensing contexts, be it a license to practice medicine or to drive a car
Will it be more comprehensive? Will it take less time to administer? In what ways would this test not be better than
existing tests? ➢ derives from a conceptualization of the knowledge or skills to be mastered
➢ two groups of test takers: one group known to have mastered the knowledge or skill being measured and
▪ Who will use this test? Clinicians? Educators? Others? For what purpose or purposes would this test be used?
another group known not to have mastered such knowledge or skill.
▪ Who will take this test? Who is this test for? Who needs to take it? Who would find it desirable to take it? For what age
range of test takers is the test designed? What reading level is required of a testtaker? What cultural factors might affect
Pilot Work/pilot study/pilot research
testtaker response?
▪ What content will the test cover? Why should it cover this content? Is this coverage different from the content ➢ preliminary research surrounding the creation of a prototype of the test.
coverage of existing tests with the same or similar objectives? How and why is the content area different? To what extent ➢ to evaluate whether they should be included in the final form of the instrument.
is this content culture-specific?
➢ the test developer typically attempts to determine how best to measure a targeted construct.
▪ How will the test be administered? Individually or in groups? Is it amenable to both group and individual
➢ process may entail literature reviews and experimentation as well as the creation, revision, and deletion of
administration? What differences will exist between individual and group administrations of this test? Will the test be
designed for or amenable to computer administration? How might differences between versions of the test be reflected preliminary test items.
in test scores?
▪ What is the ideal format of the test? Should it be true–false, essay, multiple-choice, or in some other format? Why is
the format selected for this test the best format?
▪ Should more than one form of the test be developed? On the basis of a cost–benefit analysis, should alternate or
parallel forms of this test be created?
▪ What special training will be required of test users for administering or interpreting the test? What background and
qualifications will a prospective user of data derived from an administration of this test need to have? What restrictions,
if any, should be placed on distributors of the test and on the test’s usage?
▪ What types of responses will be required of testtakers? What kind of disability might preclude someone from being
able to take this test? What adaptations or accommodations are recommended for persons with disabilities?
▪ Who benefits from an administration of this test? What would the testtaker learn, or how might the testtaker benefit,
from an administration of this test? What would the test user learn, or how might the test user benefit? What social
benefit, if any, derives from an administration of this test?
▪ Is there any potential for harm as the result of an administration of this test? What safeguards are built into the
recommended testing procedure to prevent any sort of harm to any of the parties involved in the use of this test?
▪ How will meaning be attributed to scores on this test? Will a testtaker’s score be compared to those of others taking
the test at the same time? To those of others in a criterion group? Will the test evaluate mastery of a particular content
area?
This last question provides a point of departure for elaborating on issues related to test development with regard to norm-
versus criterion-referenced tests.
TEST CONSTRUCTION o An advantage of the method of paired comparisons is that it forces testtakers to choose between items.
A. Scaling
− process of setting rules for assigning numbers in measurement Sorting Tasks
− process by which a measuring device is designed and calibrated and by which numbers (or other indices) – scale values – are − stimuli such as printed cards, drawings, photographs, or other objects are typically presented to testtakers for evaluation.
assigned to different amounts of the trait, attribute, or characteristic being measured.
− Measurement: assignment of numbers according to rules 1. Comparative Scaling
o entails judgments of a stimulus in comparison with every other stimulus on the scale.
L.L. THURSTONE o A version of the MDBS-R that employs comparative scaling might feature 30 items, each printed on a separate index card.
− credited for being at the forefront of efforts to develop methodologically sound scaling methods Testtakers would be asked to sort the cards from most justifiable to least justifiable.
− He adapted psychophysical scaling methods to the study of psychological variables such as attitudes and values o providing testtakers with a list of 30 items on a sheet of paper and asking them to rank the justifiability of the items from 1 to
− scaling—a procedure for obtaining a measure of item difficulty across samples of test takers who vary in ability 30.

Types of Scales 2. Categorical Scaling

(nominal, ordinal, interval, ratio) o Stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum.
➢ There is no best type of scale. Test developers scale a test in the manner they believe is optimally suited to their conception of the o In our running MDBS-R example, testtakers might be given 30 index cards, on each of which is printed one of the 30 items.
measurement of the trait (or whatever) that is being measured. Testtakers would be asked to sort the cards into three piles: those behaviors that are never justified, those that are
Additional scales… sometimes justified, and those that are always justified
1. Age-based Scale – If the testtaker’s test performance as a function of age is of critical interest
2. Grade-based Scale – If the testtaker’s test performance as a function of grade is of critical interest 3. Guttman Scale/ Scalogram Analysis (synonymous)
3. Stanine Scale – If all raw scores on the test are to be transformed into scores that can range from 1 to 9 o Items on it range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured.
4. Unidimensional – only one dimension is presumed to underlie the ratings o A feature of Guttman scales is that all respondents who agree with the stronger statements of the attitude will also agree
5. Multidimensional – more than one dimension is thought to guide the testtaker’s responses with milder statements
6. Comparative – entails judgments of a stimulus in comparison with every other stimulus on the scale o Do you agree or disagree with each of the following:
7. Categorical a. All people should have the right to decide whether they wish to end their lives.
b. People who are terminally ill and in pain should have the option to have a doctor assist them in ending their lives.
Scaling Methods c. People should have the option to sign away the use of artificial life-support equipment before they become seriously ill.
− Generally speaking, a testtaker is presumed to have more or less of the characteristic measured by a (valid) test as a function of d. People have the right to a comfortable life.
the test score. The higher or lower the score, the more or less of the characteristic the testtaker presumably possesses. But how o If this were a perfect Guttman scale, then all respondents who agree with “a” (the most extreme position) should also agree
are numbers assigned to responses so that a test score can be calculated? This is done through scaling the test items, using any with “b,” “c,” and “d.” All respondents who disagree with “a” but agree with “b” should also agree with “c” and “d,” and so
one of several available methods. forth.
o Scalogram Analysis: where resulting data of guttman scale are analyzed
Rating Scales: • an item-analysis procedure and approach to test development that involves a graphic mapping of a testtaker’s
− grouping of words, statements, or symbols on which judgements of the strength of a particular trait, attitude, or emotion are responses.
indicated by the testtaker. • appeals to test developers in consumer psychology, where an objective may be to learn if a consumer who will
− can be used to record judgments of oneself, others, experiences, or objects, and they can take several forms purchase one product will purchase another product.
− On the MDBS-R, the ratings that the testtaker makes for each of the 30 test items are added together to obtain a final score.
Scores range from a low of 30 (if the testtaker indicates that all 30 behaviors are never justified) to a high of 300 (if the testtaker 4, Equal-Appearing Intervals
indicates that all 30 situations are always justified). o scaling method used to obtain data that are presumed to be interval in nature
− The use of rating scales of any type results in ordinal-level data. o Item means and standard deviations are also considered.
o A low standard deviation is indicative of a good item; the judges agreed about the meaning of the
Summative Scale:
item with respect to its reflection of attitudes toward suicide.
− the final test score is obtained by summing the ratings across all the items
o Is an example of a scaling method of the direct estimation variety
1. Likert Scale Indirect estimation, there is no need to transform the testtaker’s responses into some other scale
o used extensively in Each item presents the testtaker with five alternative responses (sometimes seven), usually on an agree–
disagree or approve–disapprove continuum.
o The use of rating scales of any type results in ordinal-level data. With reference to the Likert scale item, for example, if the
response never justified is assigned the value 1, rarely justified the value 2, and so on, then a higher score indicates greater
permissiveness with regard to cheating on taxes.
o Example of Summative Scale

2. Unidimensional/Multidimensional
o Rating scales differ in the number of dimensions underlying the ratings being made.

3. Method of Paired Comparisons

o Testtakers are presented with pairs of stimuli (two photographs, two objects, two statements), which they are asked to
compare.
o testtakers receive a higher score for selecting the option deemed more justifiable by the majority of a group of judges.
o The judges would have been asked to rate the pairs of options before the distribution of the test, and a list of the options
selected by the judges would be provided along with the scoring instructions as an answer key. 385-431
B. Writing Items − No more than a dozen or so premises should be included
− should both be homogeneous—that is, lists of the same sort of thing. Our film school example
In the grand scheme of test construction, considerations related to the actual writing of the test’s items go hand provides a homogeneous list of premises (all names of actors) and a homogeneous list of responses
in hand with scaling considerations. The prospective test developer or item writer immediately faces three (all names of film characters).
questions related to the test blueprint: − Care must be taken to ensure that one and only one premise is matched to one and only one
response.
■ What range of content should the items cover? − probability of obtaining a correct response by guessing on a four-alternative multiple-choice
■ Which of the many different types of item formats should be employed? question is .25, or 25%.
■ How many items should be written in total and for each content area covered?
iii. Binary-Choice (True-False/Agree-Disagree/Yes-No/Right-Wrong/Fact-Opinion)
Item Pool is the reservoir or well from which items will or will not be drawn for the final version of the test. − contains only two possible responses
− advisable that the first draft contain approximately twice the number of items that the final version of the − usually takes the form of a sentence that requires the testtaker to indicate whether the
test will contain. statement is or is not a fact
− The test developer may write a large number of items from personal experience or academic acquaintance − cannot contain distractor alternatives
with the subject matter. − A disadvantage of the binary-choice item is that the probability of obtaining a correct
− Help may also be sought from others, including experts. response purely on the basis of chance (guessing) on any one item is .5, or 50%.3
− For psychological tests designed to be used in clinical settings, clinicians, patients, patients’ family
members, clinical staff, and others may be interviewed for insights that could assist in item writing. 2. Constructed-Response Format
− For psychological tests designed to be used by personnel psychologists, interviews with members of a i. Completion Item/Short answer Item: provide a word or phrase that completes a sentence,
targeted industry or organization will likely be of great value. ii. Essay Item: Beyond a paragraph or two
− For psychological tests designed to be used by school psychologists, interviews with teachers, o requires the testtaker to respond to a question by writing a composition, typically one
administrative staff, educational psychologists, and others may be invaluable. that demonstrates recall of facts, understanding, analysis, and/or interpretation.
o An essay item is useful when the test developer wants the examinee to demonstrate a
Item Format depth of knowledge about a single topic.
− Variables such as the form, plan, structure, arrangement, and layout of individual test items o the essay question not only permits the restating of learned material but also allows
− Considerations related to variables such as the purpose of the test and the number of examinees to be for the creative integration and expression of the material in the testtaker’s own words.
tested at one time o Whereas these latter types of items require only recognition, an essay requires recall,
organization, planning, and writing ability.
1. Selected-Response Format o A drawback of the essay item is that it tends to focus on a more limited area than can
require testtakers to select a response from a set of alternative responses. be covered in the same amount of time/ subjectivity in scoring and inter-scorer
differences.
i. Multiple-Choice
three elements Writing Items for Computer Administration
(1) a stem – the question/statement two advantages of digital media
(2) a correct alternative or option, and
(3) several incorrect alternatives or options variously referred to as distractors or foils. 1. Item Bank
relatively large and easily accessible collection of test questions
ii. Matching • Computerized Adaptive Testing (CAT)
require testtakers to supply or to create the correct answer, not merely to select it. interactive, computer administered test-taking process wherein items presented to the testtaker
determine which response is best associated with which premise. are based in part on the test taker’s performance on previous items.
✓ Floor Effect
two columns • Ceiling Effect
(1) premises on the left 2. Item Branching
(2) responses on the right.
− Providing more options than needed minimizes guessing or chance factor
− state in the directions that each response may be a correct answer once, more than once, or not at
all.
− The wording of the premises and the responses should be fairly short and to the point.
C. Scoring Items

Cumulative Model
− Rule that the higher the score on the test, the higher the testtaker is on the ability, trait, or other
characteristic that the test purports to measure.

Class
TEST TRYOUT
ITEM ANALYSIS

The Item-Difficulty Index

The Item-Reliability Index

The Item-Validity Index

The Item-Discrimination Index

The Item-Characteristic Curves

Other Considerations in Item Analysis

Qualitative Item Analysis

TEST REVISION

Test Revision as a Stage in New Test Development

Test Revision in the Life Cycle of an Existing Test

The Use of IRT in Building and Revising Tests

INSTRUCTOR-MADE TESTS FOR IN-CLASS USE

Addressing Concerns About Classroom Tests

Chapter 8: Test Development TEST CONCEPTUALIZATION
1. Test Conceptualization
Test Development an umbrella term for all that goes into the process of creating a test.
✓ A review of the available literature on existing tests designed to measure a particular construct
➢ Some tests are conceived of and constructed but neither tried-out, nor item-analyzed, might indicate that such tests leave much to be desired in psychometric soundness.
✓ An emerging social phenomenon or pattern of behavior might serve as the stimulus for the
nor revised. development of a new test.
✓ The development of a new test may be in response to a need to assess mastery in an emerging
6. Test Conceptualization occupation or profession.

− idea for a test Some Preliminary Questions

1. What is the test designed to measure?
7. Test Construction 2. What is the objective of the test?
3. Is there a need for this test?
− stage in the process of test development that entails writing test items (or re- 4. Who will use this test?
writing or revising existing items) 5. Who will take this test?
6. What content will the test cover?
− formatting items, setting scoring rules, and otherwise designing and building a test
7. How will the test be administered?
8. What is the ideal format of the test?
8. Test Tryout 9. Should more than one form of the test be developed?
10. What special training will be required of test users for administering or interpreting the test?
− administered to a representative sample of test takers under conditions that
11. What types of responses will be required of testtakers?
simulate the conditions that the final version of the test will be administered under 12. Who benefits from an administration of this test?
13. Is there any potential for harm as the result of an administration of this test?
14. How will meaning be attributed to scores on this test?
9. Item Analysis
− Statistical procedures employed to assist in making judgments about which items Norm-referenced versus criterion-referenced tests: Item development issues
are good as they are, which items need to be revised, and which items should be • A good item on a norm-referenced achievement test is an item for which high scorers on the test
respond correctly. Low scorers on the test tend to respond to that same item incorrectly.
discarded.
• On a criterion-oriented test, this same pattern of results may occur: High scorers on the test get a
− The analysis of the test’s items may include analyses of item reliability, item validity, particular item right whereas low scorers on the test get that same item wrong. However, that is not
and item discrimination. Depending on the type of test, item-difficulty level may be what makes an item good or acceptable from a criterion-oriented perspective. Ideally, each item on a
criterion-oriented test addresses the issue of whether the testtaker—a would-be physician, engineer,
analyzed as well. piano student, or whoever—has met certain criteria.

10. Test Revision Pilot Work/Pilot Study/Pilot Research

• research refer, to the preliminary research surrounding the creation of a prototype of the test.
− action taken to modify a test’s content or format for the purpose of improving the • Test items may be pilot studied (or piloted) to evaluate whether they should be included in the final
test’s effectiveness as a tool of measurement form of the instrument.
− This action is usually based on item analyses, as well as related information derived • In developing a structured interview to measure introversion/extraversion, for example, pilot research
may involve open-ended interviews with research subjects believed for some reason (perhaps on the
from the test tryout. basis of an existing test) to be introverted or extraverted.
• Interviews with parents, teachers, friends, and others who know the subject might also be arranged.
• Pilot work is a necessity when constructing tests or other measuring instruments for publication
and wide distribution.
TEST CONSTRUCTION 2. Likert Scale
2. Test Construction • One type of summative rating scale, the Likert scale
(Likert, 1932), is used extensively in psychology, usually
Scaling to scale attitudes.
− process of setting rules for assigning numbers in measurement • Each item presents the testtaker with five alternative
− process by which a measuring device is designed and calibrated and by which numbers (or other indices) – scale responses (sometimes seven), usually on an agree–
values – are assigned to different amounts of the trait, attribute, or characteristic being measured. disagree or approve disapprove continuum.
− Measurement: assignment of numbers according to rules • If Katz et al. had used a Likert scale, an item on their test might have looked like this:
• Likert (1932) experimented with different weightings of the five categories but concluded that assigning weights of 1 (for
L.L. THURSTONE
endorsement of items at one extreme) through 5 (for endorsement of items at the other extreme) generally worked best.
− credited for being at the forefront of efforts to develop methodologically sound scaling methods
− He adapted psychophysical scaling methods to the study of psychological variables such as attitudes and values
− scaling—a procedure for obtaining a measure of item difficulty across samples of test takers who vary in ability 3. Unidimensional vs. Multidimensional
• Unidimensional: meaning that only one dimension is presumed to underlie the ratings.
Types of Scales • Multidimensional: meaning that more than one dimension is thought to guide the test taker’s responses.
1. Age-based Scale – If the testtaker’s test performance as a function of age is of critical interest
2. Grade-based Scale – If the testtaker’s test performance as a function of grade is of critical interest 4. Method of Paired Comparisons
3. Stanine Scale – If all raw scores on the test are to be transformed into scores that can range from 1 to 9 • Testtakers are presented with pairs of stimuli (two photographs,
4. Unidimensional – only one dimension is presumed to underlie the ratings
two objects, two statements), which they are asked to compare.
5. Multidimensional – more than one dimension is thought to guide the testtaker’s responses
They must select one of the stimuli according to some rule; for
6. Comparative – entails judgments of a stimulus in comparison with every other stimulus on the scale
7. Categorical example, the rule that they agree more with one statement than the other, or the rule that they find one stimulus more
appealing than the other. Had Katz et al. used the method of paired comparisons, an item on their scale might have
• There is no best type of scale. Test developers scale a test in the manner they believe is optimally suited to their looked like the one that follows.
conception of the measurement of the trait (or whatever) that is being measured.
5. Comparative Scaling
Scaling Methods • Sorting Task stimuli such as printed cards, drawings, photographs, or other objects are typically presented to testtakers
− Generally speaking, a testtaker is presumed to have more or less of the characteristic measured by a (valid) test for evaluation.
as a function of the test score. The higher or lower the score, the more or less of the characteristic the testtaker • Comparative scaling, entails judgments of a stimulus in comparison with every other stimulus on the scale. A version of
presumably possesses. But how are numbers assigned to responses so that a test score can be calculated? the MDBS-R that employs comparative scaling might feature 30 items, each printed on a separate index card.
• Testtakers would be asked to sort the cards from most justifiable to least justifiable. Comparative scaling could also
1. Rating Scale be accomplished by providing testtakers with a list of 30 items on a sheet of paper and asking them to rank the
• a grouping of words, statements, or symbols on which judgments of the strength of a particular trait , attitude, or justifiability of the items from 1 to 30.
emotion are indicated by the testtaker
• can be used to record judgments of oneself, others, experiences, or objects. 6. Categorical Scaling
• A moral-issues opinion measure called the Morally Debatable Behaviors Scale Revised (MDBS-R; Katz et al., 1994), • Stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some
developed to be “a practical means of assessing what people believe, the strength of their convictions, as well as continuum. In our running MDBS-R example, testtakers might be given 30 index cards, on each of which is printed one
individual differences in moral tolerance” (p. 15), the MDBS-R contains 30 items. Each item contains a brief description of the 30 items. Testtakers would be asked to sort the cards into three piles: those behaviors that are never justified,
of a moral issue or behavior on which testtakers express their opinion by means of a 10-point scale that ranges from those that are sometimes justified, and those that are always justified.
“never justified” to “always justified.”
• On the MDBS-R, the ratings that the testtaker makes for each of the 30 test items are added together to obtain a final 7. Guttman Scale
score. Scores range from a low of 30 (if the testtaker indicates that all 30 behaviors are never justified) to a high of 300 (if • Items on it range sequentially from weaker to
the testtaker indicates that all 30 situations are always justified). Because the final test score is obtained by stronger expressions of the attitude, belief, or
summing the ratings across all the items, it is termed a summative scale. feeling being measured. A feature of Guttman scales
is that all respondents who agree with the stronger
statements of the attitude will also agree with milder
statements. Using the MDBS-R scale as an example,
consider the following statements that reflect attitudes toward suicide.
Writing Items
1. What range of content should the items cover? Constructed-Response Format
2. Which of the many different types of item formats should be employed?
A. Completion Item
3. How many items should be written in total and for each content area covered?
• A completion item requires the examinee to provide a word or phrase that completes a sentence.
• When devising a standardized test using a multiple-choice format, it is usually advisable that the first draft contain approximately twice the number of
items that the final version of the test will contain.1 If, for example, a test called “American History: 1940 to 1990” is to have 30 questions in its final • Example: The standard deviation is generally considered the most useful measure of
version, it would be useful to have as many as 60 items in the item pool. • A good completion item should be worded so that the correct answer is specific. Completion items that can be correctly
• Ideally, these items will adequately sample the domain of the test. answered in many ways lead to scoring problems.
• Item Pool is the reservoir or well from which items will or will not be drawn for the final version of the test.
B. Short Answer
Item Format
• Example: What descriptive statistic is generally considered the most useful measure of variability?
• the form, plan, structure, arrangement, and layout of individual test items
• A completion item may also be referred to as a short answer item. It is desirable for completion or short answer items to be written
Two types of item format: clearly enough that the testtaker can respond succinctly—that is, with a short answer. There are no hard-and-fast rules for how short
1. Selected-Response Format require testtakers to select a response from a set of alternative responses. an answer must be to be considered a short answer; a word, a term, a sentence, or a paragraph may qualify.
− Multiple choice, matching type, true or false– If a test is designed to measure achievement with items written in a selected response format, then • Beyond a paragraph or two, the item is more properly referred to as an essay item.
examinees must select the response that is keyed as correct. – If the test is designed to measure the strength of a particular trait and items are
written in a selected-response format, then examinees must select the alternative that best answers the question with respect to themselves. C. Essay
2. Constructed-Response Format require testtakers to supply or to create the correct answer, not merely to select it.
• a test item that requires the testtaker to respond to a question by writing a composition, typically one that demonstrates recall of
− Completion Item, Short answer, Essay
facts, understanding, analysis, and/or interpretation.
Selected Response Format • Example: Compare and contrast definitions and techniques of classical and operant conditioning. Include examples of how principles
of each have been applied in clinical as well as educational settings.
A. Multiple-Choice • An essay item is useful when the test developer wants the examinee to demonstrate a depth of knowledge about a single topic.
three elements • Essay question not only permits the restating of learned material but also allows for the creative integration and expression of the
(1) a stem – the question/statement
material in the testtaker’s own words.
(2) a correct alternative or option, and
(3) several incorrect alternatives or options variously referred to as distractors or foils. • The skills tapped by essay items are different from those tapped by true–false and matching items. Whereas these latter types of items
require only recognition, an essay requires recall, organization, planning, and writing ability.
B. Matching
Require testtakers to supply or to create the correct answer, not merely to select it. determine which response is best associated with which premise.
two columns Drawbacks of the Essay item:
(3) premises on the left • It tends to focus on a more limited area than can be covered in the same amount of time when using a series of selected-response
(4) responses on the right. items or completion items.
− two columns must contain different numbers of items.
− Providing more options than needed minimizes guessing or chance factor • Subjectivity in scoring and inter-scorer differences.
− state in the directions that each response may be a correct answer once, more than once, or not at all.
− Wording of the premises and the responses should be fairly short and to the point.
− No more than a dozen or so premises should be included. SCORING ITEMS
− The lists of premises and responses should both be homogeneous or lists of the same sort of thing. 1. Cumulative Scoring
− Example: homogenous list of premises (all names of actors) and a homogeneous list of responses (all names of film characters). • the higher the score on the test, the higher the testtaker is on the ability, trait, or other characteristic that the test purports to
C. Binary-Choice Item measure. For each testtaker response to targeted items made in a particular way, the testtaker earns cumulative credit with regard to a
• A multiple-choice item that contains only two possible responses particular construct.
• Takes the form of a sentence that requires the testtaker to indicate whether the statement is or is not a fact.
2. Class Scoring or (also referred to as category scoring)
Varieties of Binary-Choice Item • testtaker responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses
• True–false item
is presumably similar in some way.
• Agree or disagree
• This approach is used by some diagnostic systems wherein individuals must exhibit a certain number of symptoms to qualify for a
• Yes or no
• Right or wrong specific diagnosis
• Fact or opinion
3. Ipsative Scoring
Characteristics of a Good Binary Choice item: • Departs radically in rationale from either cumulative or class models. A typical objective in ipsative scoring is comparing a testtaker’s
• contains a single idea, is not excessively long, and is not subject to debate score on one scale within a test to another scale within that same test.
• Like multiple-choice items, binary-choice items are readily applicable to a wide range of subjects.
• Example, a personality test called the edwards personal preference schedule (epps), which is designed to measure the relative
• Unlike multiple-choice items, binary-choice items cannot contain distractor alternatives.
strength of different psychological needs. The epps ipsative scoring system yields information on the strength of various needs in
• For this reason, binary-choice items are typically easier to write than multiple-choice items and can be written relatively quickly.
• A disadvantage of the binary-choice item is that the probability of obtaining a correct response purely on the basis of chance(guessing) on any one item relation to the strength of other needs of the testtaker. On the basis of such an ipsatively scored personality test, it would be possible
is .5, or 50%. In contrast, the probability of obtaining a correct response by guessing on a four-alternative multiple-choice question is .25, or 25%. to draw only intra-individual conclusions about the testtaker. Therefore, “john’s need for achievement is higher than his need for
affiliation.”
3. TEST TRYOUT 4. ITEM ANALYSIS

• Having created a pool of items from which the final version of the test will be developed, What Is a Good Item?
the test developer will try out the test. The test should be tried out on people who are 1. A good test item is reliable and valid.
similar in critical respects to the people for whom the test was designed. 2. A good test item helps to discriminate testtakers.
• Example, if a test is designed to aid in decisions regarding the selection of corporate • A good test item is one that is answered correctly (or in an expected manner) by high
employees with management potential at a certain level, it would be appropriate to try out scorers on the test as a whole. Conversely, a good test item is one that is answered
the test on corporate employees at the targeted level. incorrectly by low scorers on the test as a whole.

CHARACTERISTICS OF TEST TRYOUT How does a test developer identify good items?
• The number of people on whom the test should be tried out should be no fewer than 5 • After the first draft of the test has been administered to a representative
subjects and preferably as many as 10. The more subjects in the tryout the better • Group of examinees, the test developer analyzes test scores and responses to individual
because the weaker the role of chance in subsequent data analysis. A definite risk in items – Item Analysis
using too few subjects during test tryout comes during factor analysis of the findings, • The different types of statistical scrutiny that the test data can potentially undergo at
when what we might call this point are referred to collectively as Item Analysis. Although item analysis tends to be
Phantom Factors—factors that actually are just artifacts of the small sample size—may regarded as a quantitative endeavor, it may also be qualitative.
emerge. • Among the tools test developers might employ to analyze and select items are:
• Test tryout should be executed under conditions as identical as possible to the 1. An index of the item’s difficulty
conditions under which the standardized test will be administered; all instructions, 2. An index of the item’s reliability
and everything from the time limits allotted for completing the test to the atmosphere at 3. An index of the item’s validity
the test site, should be as similar as possible. Therefore, the test developer endeavors to 4. An index of item discrimination
ensure that differences in response to the test’s items are due in fact to the items, not
to extraneous factors.
1. Item Difficulty Index 2. The Item-Reliability Index
• If everyone gets the item right then the item is too easy; • The item-reliability index provides an indication of the internal consistency of a test; the
if everyone gets the item wrong, the item is too difficult. higher this index, the greater the test’s internal consistency.
• An index of an item’s difficulty is obtained by calculating the proportion of the total • This index is equal to the product of the item-score standard deviation (s) and the
number of testtakers who answered the item correctly. correlation (r) between the item score and the total test score.
• A lowercase italic “p” (p) is used to denote item difficulty, and a subscript refers to the
item number (so p1 is read “item difficulty index for item 1”). The value of an item- Factor Analysis and Inter-Item Consistency
difficulty index can theoretically range from 0 (if no one got the item right) to 1 (if everyone • A statistical tool useful in determining whether items on a test appear to be measuring
got the item right). the same thing(s) is factor analysis. Through the judicious use of factor analysis, items
− If 50 of the 100 examinees answered item 2 correctly, then the item difficulty index for that do not “load on” the factor that they were written to tap (or, items that do not appear
this item would be equal to 50 divided by 100, or .5 (p2 = .5). to be measuring what they were designed to measure) can be revised or eliminated. If
− If 75 of the examinees got item 3 right, then p3 would be equal to .75 and we could say too many items appear to be tapping a particular area, the weakest of such items can be
that item 3 was easier than item 2. eliminated.
• *Note: The larger the item-difficulty index, the easier the item. • Additionally, factor analysis can be useful in the test interpretation process, especially
• Because p refers to the percent of people passing an item, the higher the p for an item, when comparing the constellation of responses to the items from two or more group.
the easier the item. The statistic referred to as an item-difficulty index in the context of • Thus, for example, if a particular personality test is administered to two groups of
achievement testing may be an item endorsement index in other contexts, such as hospitalized psychiatric patients, each group with a different diagnosis, then the same
personality testing. items may be found to load on different factors in the two groups. Such information will
• An Index of The Difficulty of The Average test item for a particular test can be calculated compel the responsible test developer to revise or eliminate certain items from the test or
by averaging the item difficulty indices for all the test’s items. This is accomplished by to describe the differential findings in the test manual.
summing the item-difficulty indices for all test items and dividing by the total
number of items on the test.
• For maximum discrimination among the abilities of the testtakers, the optimal average
item difficulty is approximately .5, with individual items on the test ranging in difficulty
from about .3 to .8.
• The midpoint representing the optimal item difficulty is obtained by summing the
chance success proportion and 1.00 and then dividing the sum by 2, or
3. The Item-Validity Index 4. The Item-Discrimination Index
• Indicate how adequately an item separates between high scorers and low scorers on an entire test.
• The item-validity index is a statistic designed to provide an indication of the degree to
− an item on an achievement test is not doing its job if it is answered correctly by respondents who least
which a test is measuring what it purports to measure. The higher the item-validity understand the subject matter.
index, the greater the test’s criterion-related validity. The item-validity index can be − an item on a test purporting to measure a particular personality trait is not doing its job if responses indicate
calculated once the following two statistics are known: that people who score very low on the test as a whole (indicating absence or low levels of the trait in
question) tend to score very high on the item (indicating that they are very high on the trait in question—
– the item-score standard deviation (sd) symbol s1
contrary to what the test as a whole indicates).
– the correlation between the item score and the criterion score • The item-discrimination index is a measure of item discrimination, symbolized by a lowercase italic “d” (d).
– The item-score standard deviation of item 1 (denoted by the symbol s1) can be − This estimate of item discrimination compares performance on a particular item with performance in the
calculated using the index of the item’s difficulty (p1) in the following formula: upper and lower regions of a distribution of continuous test scores. The optimal boundary lines for what we
refer to as the “upper” and “lower” areas of a distribution of scores will demarcate the upper and lower 27%
of the distribution of scores—provided the distribution is normal (Kelley, 1939).
− As the distribution of test scores becomes more platykurtic (flatter), the optimal boundary line for defining
upper and lower increases to near 33% (Cureton, 1957). Allen and Yen (1979, p. 122) assure us that “for most
applications, any percentage between 25 and 33 will yield similar estimates.”
• The correlation between the score on item 1 and a score on the criterion measure
• The item-discrimination index is a measure of the difference between the proportion of high scorers answering an
(denoted by the symbol r1 C) is multiplied by item 1’s item score standard deviation (s1), item correctly and the proportion of low scorers answering the item correctly; the higher the value of d, the
and the product is equal to an index of an item’s validity (s1 r1 C). greater the number of high scorers answering the item correctly. A negative d-value on a particular item is a
• Calculating the item-validity index will be important when the test developer’s goal is to red flag because it indicates that low-scoring examinees are more likely to answer the item correctly than high-
scoring examinees. This situation calls for some action such as revising or eliminating the item.
maximize the criterion-related validity of the test. A visual representation of the best
• Example: A history teacher gave the American History Test to a total of 119 students who were just weeks away
items on a test (if the objective is to maximize criterion related validity) can be achieved from completing ninth grade. The teacher isolated the upper (U) and lower (L) 27% of the test papers, with a total
by plotting each item’s item-validity index and item-reliability index. of 32 papers in each group.
• Observe that 20 testtakers in the U group answered Item 1 correctly and that 16 testtakers in the L group
answered Item 1 correctly. With an item-discrimination index equal to .13, Item 1 is probably a reasonable item
because more U-group members than L-group members answered it correctly.

The higher the value of d, the more adequately the item discriminates the higher-scoring from the lower-scoring
testtakers. For this reason, Item 2 is a better item than Item 1 because Item 2’s item-discrimination index is .63
• Example: The highest possible value of d is +1.00. This value indicates that all members of the U group answered
the item correctly whereas all members of the L group answered the item incorrectly. If the same proportion of
members of the U and L groups pass the item, then the item is not discriminating between testtakers at all and d,
appropriately enough, will be equal to 0.

*The lowest value that an index of item discrimination can take is −1. A d equal to −1 is a test developer’s
nightmare: It indicates that all members of the U group failed the item and all members of the L group passed it. On
the face of it, such an item is the worst possible type of item and is in dire need of revision or elimination. However,
through further investigation of this unanticipated finding, the test developer might learn or discover something new
about the construct being measured.
Other Considerations in Item Analysis • Qualitative item analysis - is a general term for various nonstatistical procedures designed to
• Guessing. In achievement testing, the problem of how to handle testtaker guessing is one that explore how individual test items work. The analysis compares individual test items to each other
has eluded any universally acceptable solution. Methods designed to detect guessing (S.-R. and to the test as a whole. In contrast to statistically based procedures, qualitative methods
Chang et al., 2011), minimize the effects of guessing (Kubinger et al., 2010), and statistically involve exploration of the issues through verbal means such as interviews and group
correct for guessing (Espinosa & Gardeazabal, 2010) have been proposed, but no such method discussions conducted with testtakers and other relevant parties. Some of the topics
has achieved universal acceptance. researchers may wish to explore qualitatively are in table.
• To better appreciate the complexity of the issues, consider the following three criteria that any
correction for guessing must meet as well as the other interacting issues that must be Expert Panels
addressed: • In addition to interviewing testtakers individually or in groups, expert panels may also provide
✓ 1. A correction for guessing must recognize that, when a respondent guesses at an answer on qualitative analyses of test items. A Sensitivity Review is a study of test items, typically
an achievement test, the guess is not typically made on a totally random basis. It is more conducted during the test development process, in which items are examined for fairness to all
reasonable to assume that the testtaker’s guess is based on some knowledge of the prospective testtakers and for the presence of offensive language, stereotypes, or
subject matter and the ability to rule out one or more of the distractor alternatives. situations.
However, the individual testtaker’s amount of knowledge of the subject matter will vary from • Some of the possible forms of content bias that may find their way into any achievement test were
one item to the next. identified as follows (Stanford Special Report, 1992, pp. 3–4).
✓ 2. A correction for guessing must also deal with the problem of omitted items. Sometimes, i. Status: Are the members of a particular group shown in situations that do not involve authority
instead of guessing, the testtaker will simply omit a response to an item. Should the omitted or leadership?
item be scored “wrong”? Should the omitted item be excluded from the item analysis? Should ii. Stereotype: Are the members of a particular group portrayed as uniformly having certain
the omitted item be scored as if the testtaker had made a random guess? Exactly how should (1) aptitudes
the omitted item be handled? (2) interests
✓ 3. Just as some testtakers may be luckier than others in guessing the choices that are keyed (3) occupations, or
correct. Any correction for guessing may seriously underestimate or overestimate the effects (4) personality characteristics?
of guessing for lucky and unlucky testtakers. To date, no solution to the problem of guessing iii. Familiarity: Is there greater opportunity on the part of one group to
has been deemed entirely satisfactory. The responsible test developer addresses the problem (1) be acquainted with the vocabulary or
of guessing by including in the test manual (2) experience the situation presented by an item?
(1) explicit instructions regarding this point for the examiner to convey to the examinees and iv. Offensive Choice of Words:
(2) specific instructions for scoring and interpreting omitted items. (1) Has a demeaning label been applied, or
(2) has a male term been used where a neutral term could be substituted?
Qualitative Item Analysis v. Other: Panel members were asked to be specific regarding any other indication of bias they
• Test users have had a long-standing interest in understanding test performance from the detected.
perspective of testtakers (Fiske, 1967; Mosier, 1947). The calculation of item-validity, item-
reliability, and other such quantitative indices represents one approach to understanding • On the basis of qualitative information from an expert panel or testtakers themselves, a test user
testtakers. Another general class of research methods is referred to as qualitative. or developer may elect to modify or revise the test. Revision typically involves rewording items,
• Qualitative methods - are techniques of data generation and analysis that rely primarily on deleting items, or creating new items. Note that there is another meaning of test revision beyond
verbal rather than mathematical or statistical procedures. Encouraging testtakers—on a group that associated with a stage in the development of a new test.
or individual basis—to discuss aspects of their test-taking experience is, in essence, eliciting or • After a period of time, many existing tests are scheduled for republication in new versions or
generating “data” (words). These data may then be used by test developers, users, and publishers editions. The development process that the test undergoes as it is modified and revised is called,
to improve various aspects of the test. test revision
TEST REVISION Test Revision in the Life Cycle of an Existing Test
5. Test Revision • Time waits for no person. We all get old, and tests get old, too. Just like people, some tests seem to age
• We first consider aspects of test revision as a stage in the development of a new test. Later we will consider more gracefully than others.
aspects of test revision in the context of modifying an existing test to create a new edition. Examples: Rorschach Inkblot Test seems to have held up quite well over the years.
• The stimulus materials for another projective technique, the Thematic Apperception Test (TAT), are showing
Test Revision as a Stage in New Test Development their age. There comes a time in the life of most tests when the test will be revised in some way or its
• Having conceptualized the new test, constructed it, tried it out, and item-analyzed it both quantitatively and publication will be discontinued.
qualitatively, what remains is to act judiciously on all the information and mold the test into its final form. When is that time?
• On the basis of that information generated at the item analysis stage, some items from the original item • No hard-and-fast rules exist for when to revise a test.
pool will be eliminated and others will be rewritten. How is information about the difficulty, validity, • The American Psychological Association (APA, 1996b, Standard 3.18) said that an existing test be
reliability, discrimination, and bias of test items—along with information from the item characteristic
kept in its present form as long as it remains “useful” but that it should be revised “when
curves—integrated and used to revise the test?
significant changes in the domain represented, or new conditions of test use and
interpretation, make the test inappropriate for its intended use.”
Ways of approaching test revision:
• One approach is to characterize each item according to its strengths and weaknesses. Test developers • Practically speaking, many tests are deemed to be due for revision when any of the following
may find that they must balance various strengths and weaknesses across items. If many otherwise good conditions exist:
items tend to be somewhat easy, the test developer may purposefully include some more difficult items (1) The stimulus materials look dated and current test takers cannot relate to them.
even if they have other problems. Those more difficult items may be specifically targeted for rewriting. (2) The verbal content of the test, including the administration instructions and the test items,
• Items demonstrating excellent item discrimination, leading to the best possible test discrimination, will contains dated vocabulary that is not readily understood by current testtakers.
be made a priority. (3) As popular culture changes and words take on new meanings, certain words or expressions in
• Write a large item pool – Poor items can be eliminated in favor of those that were shown on the test tryout the test items or directions may be perceived as inappropriate or even offensive to a particular
to be good items.
group and must therefore be changed.
• The next step is to administer the revised test under standardized conditions to a second appropriate
(4) The test norms are no longer adequate as a result of group membership changes in the
sample of examinees.
population of potential testtakers.
• On the basis of an item analysis of data derived from this administration of the second draft of the test, the
test developer may deem the test to be in its finished form. Once the test is in finished form, the test’s (5) The test norms are no longer adequate as a result of age-related shifts in the abilities
norms may be developed from the data, and the test will be said to have been “standardized” on this measured over time, and so an age extension of the norms (upward, downward, or in both
(second) sample. directions) is necessary.
• When the item analysis of data derived from a test administration indicates that the test is not yet in (6) The reliability or the validity of the test, as well as the effectiveness of individual test items,
finished form, the steps of revision, tryout, and item analysis are repeated until the test is satisfactory can be significantly improved by a revision.
and standardization can occur. Once the test items have been finalized, professional test development (7) The theory on which the test was originally based has been improved significantly, and these
procedures dictate that conclusions about the test’s validity await a cross-validation of findings. changes should be reflected in the design and content of the test.
The steps to revise an existing test parallel those to create a brand-new one.
a) In the test conceptualization phase, the test developer must think through the objectives of
the revision and how they can best be met.
b) In the test construction phase, the proposed changes are made.
c) Test tryout, item analysis, and test revision (in the sense of making final refinements) follow.
• Formal Item-Analysis methods must be employed to evaluate the stability of items between
revisions of the same test (Knowles & Condon, 2000).
• Ultimately, scores on a test and on its updated version may not be directly comparable.
• A key step in the development of all tests—brand-new or revised editions—is cross-validation.

Cross-validation
• refers to the revalidation of a test on a sample of testtakers other than those on whom test
performance was originally found to be a valid predictor of some criterion.
Format of Item Advantages Disadvantages
Co-validation Multiple-choice - Can sample a great deal of content in a - Not useful for expression of original or
• as a test validation process conducted on two or more tests using the same sample of relatively short time.- Allows for precise creative thought.- Not all subject matter lends
interpretation and little ambiguity (more itself to reduction to one and only one best
testtakers. When used in conjunction with the creation of norms or the revision of existing norms,
than other formats).- Minimizes answer.- Time-consuming to construct.- May
this process may also be referred to as co norming. "bulling" or guessing.- May be machine- test trivial knowledge.- Guessing may distort
Developing Item Banks or computer-scored. results.
1. Developing an item bank is not simply a matter of collecting a large number of items. Many item Binary-choice - Can sample a great deal of content in a - Susceptible to guessing, especially for "test-
items (e.g., relatively short time.- Easy to construct wise" students.- Difficult to detect use of test-
banking efforts begin with the collection of appropriate items from existing instruments true/false) and score.- May be machine- or taking strategies.- Ambiguous statements may
(Instruments A, B, and C) or new items. computer-scored. lead to misinterpretation.- Can be misused
2. All items available for use as well as new items created especially for the item bank constitute without careful validation.
the item pool. Matching - Efficient for evaluating recall of related - Similar to other selected-response formats,
facts.- Good for large amounts of does not test ability to create a correct
3. The item pool is then evaluated by content experts, potential respondents, and survey experts content.- Easy to score, especially via answer.- Clues in choices may aid guessing.-
using a variety of qualitative and quantitative methods. The items that “make the cut” after such machine.- Can be part of paper-based May overemphasize trivial knowledge.
scrutiny constitute the preliminary item bank. or computer-based tests.
Completion or - Effective for partial knowledge and - May test only surface-level knowledge.-
4. Administration of all of the questionnaire items to a large and representative sample of the target
short-answer low-level objectives.- Relatively easy to Limited response format (usually one word or
population. (fill-in-the- construct.- Useful in online testing.- a few words).- Scoring may be inconsistent.-
5. After administration of the preliminary item bank to the entire sample of respondents, responses blank) Easy to guess with limited clues. Typically hand-scored.
to the items are evaluated with regard to several variables such as validity, reliability, domain Essay - Good for measuring complex, creative, - May not cover wide content area.- Scoring is
and original thought.- Effective when subjective and time-consuming.- Test taker
coverage, and differential item functioning. The final item bank will consist of a large set of items
well constructed.- Encourages deep with limited knowledge may write off-topic.-
all measuring a single domain (or, a single trait or ability). learning and integration.- Can assess Grading can be biased or unreliable.- Typically
6. A test developer may then use the banked items to create one or more tests with a fixed writing and organization skills. hand-scored.
number of items. For example, a teacher may create two different versions of a math test in order
to minimize efforts by testtakers to cheat. The item bank can also be used for purposes of
computerized-adaptive testing.
CHAPTER 9 INTELLIGENCE AND ITS MEASUREMENT The Secondary-School Level 340

WHAT IS INTELLIGENCE? The College Level and Beyond 341

Perspective on Intelligence DIAGNOSTIC TESTS 344

MEASURING INTELLIGENCE Reading Tests 345

Some Tasks Used to Measure Intelligence Math Tests 346

Some Tests Used to Measure Intelligence PSYCHOEDUCATIONAL TEST BATTERIES 346

ISSUES IN THE ASSESSMENT OF INTELLIGENCE The Kaufman Assessment Battery for Children (K-ABC) and the

Culture and Measured Intelligence Kaufman Assessment Battery for Children, Second Edition (KABC-II) 346

The Flynn Effect The Woodcock-Johnson IV (WJ IV) 348

The Construct Validity of Tests of Intelligence OTHER TOOLS OF ASSESSMENT IN EDUCATIONAL SETTINGS 348

A PERSPECTIVE Performance, Portfolio, and Authentic Assessment 349

Factors Analysis Peer Appraisal Techniques 351

THE ROLE OF TESTING AND ASSESSMENT IN EDUCATION Measuring Study Habits, Interests, and Attitudes 352

THE CASE FOR AND AGAINST EDUCATIONAL TESTING

IN THE SCHOOLS

THE COMMON CORE STATE STANDARDS

Response to Intervention

Dynamic Assessment

ACHIEVEMENT TESTS 328

Measures of General Achievement 328

Measures of Achievement in Specific Subject Areas 329

APTITUDE TESTS 331

The Preschool Level 333

The Elementary-School Level 338

CHAPTER 9 INTELLIGENCE AND ITS MEASUREMENT 303 Perspectives on Intelligence
WHAT IS INTELLIGENCE?
Interactionism refers to the complex concept by which heredity and environment are
Intelligence as a multifaceted capacity that manifests itself in different ways
presumed to interact and influence the development of one’s intelligence.
across the life span. In general, intelligence includes the abilities to:
▪ acquire and apply knowledge
▪ reason logically Louis L. Thurstone
▪ plan effectively Primary Mental Abilities: consisted of separate tests
▪ infer perceptively 1. Verbal Meaning
▪ make sound judgments and solve problems
2. Perceptual Speed
▪ grasp and visualize concepts
▪ pay attention
3. Reasoning
▪ be intuitive 4. Number Facility
▪ find the right words and thoughts with facility 5. Rote Memory
▪ cope with, adjust to, and make the most of new situations 6. Word Fluency
1. Francis Galton (1883) 7. Spatial Relations
• View: Intelligence = Sensory abilities.
• Rationale: The more perceptive your senses (sight, hearing, etc.), the more information you take in, which allows
better judgment and intelligence. Perspectives
• Method: Measured intelligence using sensorimotor and perception-based tests.
• Legacy: First to suggest intelligence is heritable, contributing to the nature vs. nurture debate. 1. Interactionism refers to the complex concept by which heredity and environment are
2. Alfred Binet (1895) presumed to interact and influence the development of one’s intelligence. Examples:
• View: Intelligence is a complex, interactive set of abilities. Binet, Wechsler, and Piaget
• Criticism of Galton: the abilities used cannot be separated because they interact to produce the solution
• Focus: Called for more complex measurements of intelligence. 2. In factor-analytic theories, the focus is on identifying the ability or groups of abilities
• Did Not Define Intelligence Explicitly, but emphasized reasoning, judgment, memory, and abstraction.
deemed to constitute intelligence.
3. David Wechsler (1958)
• Definition: Intelligence is a global or aggregate capacity to: 3. In information-processing theories, the focus is on identifying the specific mental
o Act purposefully, processes that constitute intelligence
o Think rationally,
o Deal effectively with the environment.
• Belief: Intelligence is made up of multiple qualitatively different abilities (e.g., verbal and performance-based).
• Not Just Cognitive: He also included nonintellective traits such as:
o Motivation,
o Persistence,
o Personality,
o Moral and social values.

4. Jean Piaget (1954, 1971)

• View: Intelligence = A form of biological adaptation.
• Cognitive Development: Occurs through interaction with the environment, not just maturation or learning.
• Process: Psychological structures reorganize over time through four unchangeable stages of development.
• Key Idea: Symbolic thought replaces physical trial-and-error as intelligence evolves.
A. Factor-Analytic Theories the focus is squarely on identifying the ability or groups of abilities GARDNER
deemed to constitute intelligence. ➢ Seven intelligence tests
− a group of statistical techniques designed to determine the existence of underlying relationships 1. Logical-Mathematical
between sets of variables, including test scores 2. Bodily-Kinesthetic
− theorists have used factor analysis to study correlations between tests measuring varied 3. Linguistics
abilities presumed to reflect the underlying attribute of intelligence
4. Musical
5. Spatial
CHARLES SPEARMAN (1904)
6. Inter between personal – ability to understand other people
➢ found that measures of intelligence tended to correlate to various degrees with each
other 7. Intrawithin personal – correlative ability, turned inward
➢ Influential Theory of General Intelligence = General Intellectual Ability Factor (g)/ partially ➢ His descriptions of interpersonal intelligence and intrapersonal intelligence,
tapped by all other mental abilities. have found expression in popular books written by others on the subject of so-
o Formally known as Two-Factor Theory of Intelligence called emotional intelligence.
o g : represent the portion of the variance that all intelligence tests have in common
general intelligence factor RAYMOND B CATELL and modified by HORN
o s : remaining portions of the variance being accounted for either by specific components
➢ Crystallized Intelligence (Gc)
specific factor of intelligence (specific to a single intellectual activity only)
o e : or by error components of this general factor
− acquired skills and knowledge that are dependent on exposure to a
➢ The greater the magnitude of g in a test of intelligence, the better the test was thought to particular culture as well as on formal and informal education
predict overall intelligence. (vocabulary, for example).
➢ It was g rather than s that was assumed to afford the best prediction of overall − Retrieval of information and application of general knowledge
intelligence. ➢ Fluid Intelligence (Gf)
➢ Abstract-reasoning problems were thought to be the best measures of g in formal tests. − nonverbal, relatively culture-free, and independent of specific instruction
➢ Group Factors neither as general as g nor as specific as s (such as memory for digits).
Examples of these broad group factors include linguistic, mechanical, and arithmetical Added by Horn
abilities. ➢ Visual Processing (Gv)
➢ Auditory Processing (Ga)
GUILFORD
➢ Quantitative Processing (Gq)
➢ sought to explain mental activities by deemphasizing, if not eliminating, any reference to
➢ Speed of Processing (Gs)
g.
➢ Facility with reading and writing (Gw)
THURSTONE ➢ Short-term memory (Gsm)
➢ Intelligence as being composed of seven “primary abilities.” However, after designing ➢ Long-term storage and retrieval (Glr)
tests to measure these abilities and noting a moderate correlation between the tests, o Vulnerable Abilities decline with age and tend not to return to preinjury
Thurstone became convinced it was difficult, if not impossible, to develop an levels following brain damage such as Gv
intelligence test that did not tap g. o Maintained Abilities tend not to decline with age and may return to
preinjury levels following brain damage such as Gq
CARROLL MCGREW AND FLANAGAN subsequently modified McGrew’s initial CHC model.
➢ Three-Stratum Theory of Cognitive Abilities ➢ ten “broad-stratum” abilities and over seventy “narrow-stratum”
➢ Strata (plr. Stratum) ➢ each broad-stratum ability subsuming two or more narrow-stratum abilities.
o Top Stratum: g, or general intelligence 1. Fluid intelligence (Gf)
o Second Stratum: eight abilities and processes 2. Crystallized intelligence (Gc)
1. Fluid Intelligence (Gf) 3. Quantitative knowledge (Gq)
2. Crystallized Intelligence (Gc) 4. Reading/ writing ability (Grw)
3. General Memory and Learning (Y) 5. Short-term memory (Gsm)
4. Broad Visual Perception (V) 6. Visual processing (Gv)
5. Broad Auditory Perception (U) 7. Auditory processing (Ga)
6. Broad Retrieval Capacity (R) 8. Long-term storage and retrieval (Glr)
7. Broad Cognitive Speediness (S) 9. Processing speed (Gs), and
8. Processing/Decision Speed (T) 10. Decision/reaction time or speed (Gt).
o Third Stratum: Level factors and/or Speed Factors ➢ The model was the product of efforts designed to improve the practice of psychological
➢ Hierarchical Model all of the abilities listed in a stratum are subsumed by or assessment in education (sometimes referred to as Psychoeducational Assessment).
incorporated in the strata above. ➢ Cross-Battery Assessment of students, or assessment that employs tests from different
test batteries and entails interpretation of data from specified subtests to provide a
CATTELL-HORN-CARROLL (CHC) MODEL proposed by Kevin S. McGrew comprehensive assessment.
Differences ➢ McGrew and Flanagan drew heavily on Carroll’s (1993) writings for definitions of many of the
➢ For Carroll, g is the third-stratum factor, subsuming Gf, Gc, and the remaining broad and narrow abilities listed and also for the codes for these abilities.
six other broad, second-stratum abilities.
➢ g has no place in the Cattell-Horn model. E.L. THORNDIKE MULTIFACTOR THEORY
➢ whether or not abilities labeled “quantitative knowledge” and “reading/writing ➢ three clusters of ability
ability” should each be considered a distinct like in Cattell-Horn model 1. social intelligence (dealing with people),
o For Carroll, all of these abilities are first-stratum, narrow abilities. 2. concrete intelligence (dealing with objects),
➢ Differences between notation, the specific definitions of abilities, and the 3. abstract intelligence (dealing with verbal and mathematical symbols).
grouping of narrow factors related to memory. ➢ incorporated a general mental ability factor (g) into the theory, defining it as the total number
➢ Similarity with broad abilities and subsume narrow abilities but differs with g of modifiable neural connections or “bonds” available in the brain.
factor ➢ one’s ability to learn is determined by the number and speed of the bonds that can be
marshaled.
B. Information-Processing Theories the focus is on identifying the specific mental
processes that constitute intelligence.

➢ derives from the work of the Russian neuropsychologist ALEKSANDR LURIA

➢ approach focuses on the mechanisms by which information is processed—how
information is processed
➢ Two basic types of information-processing styles
1. Simultaneous (Parallel Processing)
▪ information is integrated all at one time
▪ Information is integrated and synthesized at once and as a whole.
▪ Looking at a painting, images, map reading
2. Successive (Sequential Processing)
▪ each bit of information is individually processed in sequence
▪ logical and analytic in nature
▪ piece by piece and one piece after the other
▪ information is arranged and rearranged so that it makes sense

DAS AND NAGLIERI

• Planning, Attention, Simultaneous and Successive PASS MODEL
1. Planning
to strategy development for problem solving
2. Attention (arousal)
receptivity to information
3. Simultaneous and 4. Successive
type of information processing employed
MEASURING INTELLIGENCE “Just Think…”: Applications of Adult Intelligence Test Data

Measure Intelligence In addition to clinical or vocational uses, adult intelligence data might also be used in:
− entails sampling an examinee’s performance on different types of tests and tasks as a
function of developmental level • Neuropsychological research (e.g., effects of aging on cognition).

− Intelligence measurement involves sampling an individual’s performance on various tasks • Pre-employment screening (in roles requiring high cognitive demand, though this use is controversial).
• Forensic psychology (e.g., assessing criminal responsibility or risk of recidivism).
suited to their developmental stage.
− Intelligence testing is more than just getting a score—it's also about understanding how a person
Some Tests Used to Measure Intelligence
thinks and solves problems.
• From the test user’s standpoint, several considerations figure into a test’s appeal:
1. The theory (if any) on which the test is based
Some Tasks Used to Measure Intelligence 2. The ease with which the test can be administered
3. The ease with which the test can be scored
4. The ease with which results can be interpreted for a particular purpose
Infants (Birth–18 Months):
5. The adequacy and appropriateness of the norms
• Focus: Sensorimotor development.
6. The acceptability of the published reliability and validity indices
• Examples of Tasks: 7. The test’s utility in terms of costs versus benefits
o Turning over.
o Lifting the head.
Subtest What It Measures
o Imitating gestures.
Information General knowledge and memory
o Tracking moving objects with the eyes.
Comprehension Common sense, social judgment
• Challenges:
Similarities Abstract thinking and verbal reasoning
o Infants cannot understand instructions like “cooperate” or “be patient.”
o Assessment often relies on structured interviews with parents or caregivers. Arithmetic Mental math, concentration, short-term memory
o Requires examiners skilled in building rapport with preverbal children. Vocabulary Word knowledge and verbal ability
Picture Naming Expressive language
Digit Span Short-term memory and attention
Children:
Letter-Number Sequencing Working memory, processing speed, sequencing
• Focus: Verbal and performance abilities.
Picture Completion Visual detail recognition and nonverbal intelligence
• Examples of Tasks:
Picture Arrangement Understanding of social situations and cause-effect logic
o Vocabulary and general knowledge.
Block Design Visual-spatial reasoning, problem-solving
o Social judgment and reasoning.
o Memory (auditory and visual). Object Assembly Visual-motor coordination, persistence
o Attention and spatial skills. Coding Processing speed and learning ability
• Instruction: Often includes practice or teaching items before actual test items to help the child understand the Symbol Search Visual processing speed
task. Matrix Reasoning Nonverbal reasoning
Word Reasoning Verbal abstraction
Adults: Picture Concepts Categorical reasoning
Cancellation Selective visual attention
• Focus (per Wechsler):
o General knowledge retention.
o Quantitative reasoning.
o Expressive language and memory.
o Social judgment.
• Usage of Intelligence Tests:
o Less often for educational purposes.
o More often for:
▪ Clinical evaluations (e.g., dementia, brain injury).
▪ Legal competency (e.g., ability to make a will).
▪ Insurance assessments (e.g., disability claims).
▪ Career guidance and vocational planning.
Historical Background Fourth Edition − Previously, different items were grouped by age and the test was referred to as
Thorndike an age scale.
• Binet-Simon Scale (1905): Created by Alfred Binet and Theodore Simon to identify
− In contrast to an age scale, a point scale is a test organized into subtests by
children with developmental disabilities in France. category of item, not by age at which most testtakers are presumed capable of
• Brought to the U.S. by Goddard in 1908 and 1910. responding in the way that is keyed as correct.
− The model was one based on the Cattell-Horn (Horn & Cattell, 1966) model of
• Kuhlmann (1912) extended the scale to assess infants as young as 3 months.
intelligence. A test composite—formerly described as a deviation IQ score—
• Lewis Terman at Stanford revised and standardized the test in the U.S., leading to the could also be obtained.
Stanford-Binet Intelligence Scale, a foundational instrument still in use today. − Test Composite may be defined as a test score or index derived from the
combination of, and/or a mathematical transformation of, one or more subtest
scores.
The Stanford-Binet Intelligence Scales: Fifth Edition (SB5) Fifth Edition − was designed for administration to assessees as young as 2 and as old as 85
First Edition: Stanford- − the first published intelligence test to provide organized and detailed (or older).
Binet administration and scoring instructions − The test yields a number of composite scores, including a Full Scale IQ derived
− It was also the first American test to employ the concept of IQ. from the administration of ten subtests.
Used in 1908 − first test to introduce the concept of an Alternate Item, an item to be − Subtest scores all have a mean of 10 and a standard deviation of 3.
substituted for a regular item under specified conditions (such as the situation − All composite scores have a mean set at 100 and a standard deviation of 15.
in which the examiner failed to properly administer the regular item). − test yields five Factor Index scores corresponding to each of the five factors
− Earlier versions of the Stanford Binet had employed the ratio IQ, which was that the test is presumed to measure
based on the concept of mental age (the age level at which an individual − was based on the Cattell-Horn-Carroll (CHC) theory of intellectual abilities.
appears to be functioning intellectually as indicated by the level of items nominal categories designated by certain cutoff boundaries for quick reference
responded to correctly). The ratio IQ is the ratio of the testtaker’s mental age
IQ Range Label
divided by his or her chronological age, multiplied by 100 to eliminate
decimals. 145–160 Very gifted / Highly advanced
1962 − Innovations in the 1937 scale included the development of wtwo equivalent 130–144 Gifted / Very advanced
Lewis and Maud forms, labeled L (for Lewis) and M
Merrill − new types of tasks for use with preschool-level and adult-level testtakers. 120–129 Superior
11 years to complete − A serious criticism of the test remained: lack of representation of minority 110–119 High average
groups during the test’s development.
90–109 Average
− 1960 revision, consisted of only a single form (labeled L-M) and included the 80–89 Low average
items considered to be the best from the two forms of the 1937 test, with no
new items added to the test 70–79 Borderline impaired or delayed
Terman’s death (1956) − use of the deviation IQ tables in place of the ratio IQ tables. 55–69 Mildly impaired or delayed
− Ratio IQ = mental age/chronological age x 100
40–54 Moderately impaired or delayed
Third Edition (1972) − deviation IQ was used in place of the ratio IQ.
− Deviation IQ reflects a comparison of the performance of the individual with
the performance of others of the same age in the standardization sample. Binet-Simon Scale 1908 – by Alfred Binet & Theodore Simon
− test performance is converted into a standard score with a mean of 100 and a
standard deviation of 16. If an individual performs at the same level as the Stanford-Binet Intelligence Scale (1916) – by Terman
average person of the same age, the deviation IQ is 100. If performance is a
standard deviation above the mean for the examinee’s age group, the deviation - first American test to employ the concept of IQ with detailed administration and
IQ is 116.
scoring instructions
B. WECHSLER TEST

1. 1939 : instrument for evaluating the intellectual capacity of its multilingual, multinational,
and multicultural clients.
− point scale
− six verbal subtests and five subtests
2. 1942: equivalent alternate
3. 1955: scale for adults
a. wais-r: alternate version
b. wais III – more user friendly and norms were expanded
c. wais iv: core or supplemental
i. Verbal Comprehension,
ii. Working Memory,
iii. Perceptual Reasoning
iv. Processing Speed.
v. General ability index

GROUP TESTS OF INTELLIGENCE

1. After World War I – Army Alpha Test (literature) and Army Beta Test (illiterate)

OTHER MEASURES OF INTELLECTUAL ABILITIES

Cognitive Style: consistency with which one acquires and processes information
1. Originality: ability to produce innovative and nonobvious
2. Fluence: ease responses
3. Flexibility: variety of ideas and ability to shift approach
4. Elaboration: richness of explanation

Convergent Thinking – deductive narrow down solutions and eventually arrive at one solution
Divergent Thinking – several solutions possible.
ISSUES IN THE ASSESSMENT OF INTELLIGENCE Culture-Fair
Measured intelligence is influenced by many factors beyond innate ability: • Definition: A test designed to minimize the influence of culture in assessing intelligence. It aims to
• The definition of intelligence used by the test developer. be equally applicable across cultural groups.
• Features:
• Examiner variables: their diligence and how much feedback they provide.
o Uses nonverbal, abstract tasks (e.g., matrices, mazes, classifications).
• Test-taker factors: prior practice/coaching, motivation, and test familiarity. o Instructions are often given orally or through pantomime, minimizing language demands.
• Interpretation errors by those analyzing the test results. o Avoids references to specific knowledge, traditions, or values of any one culture.
Intelligence scores can vary significantly due to these influences, making the assessment • Goal: Provide a neutral testing environment so that no cultural group is unfairly advantaged.
less reliable or valid. • Key Issue: Despite efforts, culture-fair tests have lower predictive validity (they don't predict real-
world success as well), and minority group members still often score lower
Culture and Measured Intelligence
• Culture shapes what is considered intelligent behavior.
Culture Bias
• Definition: Occurs when test items favor the dominant culture, unintentionally disadvantaging
• Different cultural and subcultural groups value and promote different abilities, leading
individuals from other cultural backgrounds.
to varied performance on standardized tests. • Examples:
Example: Zambian vs. English children performed differently depending on the material o Test content assumes knowledge, experiences, or values common in White, middle-class
used (wire vs. pencil/paper) due to familiarity, not intelligence. American culture (e.g., specific vocabulary, customs).
• Intelligence tests often reflect the dominant culture (e.g., White, Western, middle-class) o Subcultural values like group identity, present-time orientation, or modesty may lead to
lower scores despite equal cognitive ability.
and may disadvantage those from other cultural backgrounds.
• Consequences:
Blacks, Hispanics, and Native Americans often score lower on intelligence tests than o Can misrepresent true intelligence.
Whites or Asians, but these findings are controversial due to: o Reinforces inequality and underestimates ability in Black, Hispanic, Native American, and
− Sampling biases other minority populations.
− Difficulty separating genetic from environmental effects • Findings: Cultural groups may value and express intelligence differently (e.g., verbal debate in the
West vs. modesty and restraint in the East).
− Diverse subgroups being lumped together
• Intelligence definitions and expressions are culturally bound.
Culture-Specific
Efforts Toward Culture-Free and Culture-Fair Testing • Definition: A test designed specifically for one cultural group, reflecting its language, values, and
• Alfred Binet aimed to measure “natural intelligence” without the influence of education or shared experiences.
wealth. • Purpose: To measure intelligence more validly within that cultural context, rather than comparing
• Attempts to create culture-free tests (often nonverbal) failed to be valid predictors of real-world across groups.
success. • Example:
o They lack predictive validity and don’t engage the same processes as traditional tests. o BITCH (Black Intelligence Test of Cultural Homogeneity): Designed for African-Americans
o Minority group members often still scored lower on them. using culturally familiar content (e.g., slang, brands, customs).
• Result: true culture-free testing is impossible. • Criticism:
o May appear more like a satirical or sociocultural awareness tool than a traditional IQ test.
• Shift toward culture-fair tests, which:
o Raises questions about what defines "intelligence."
o Minimize cultural influences in instructions, content, and responses.
o Tailored for Black Americans, including culturally relevant content.
o Use nonverbal tasks (e.g., figure classification, mazes). o Demonstrated that test performance can depend on cultural familiarity, not cognitive
• These too have limited success: ability.
o Still don’t fully equalize outcomes.
o Often less predictive of real-world performance. Culture Loading
• defined as the extent to which a test incorporates the vocabulary, concepts, traditions,
knowledge, and feelings associated with a particular culture.
The Flynn Effect THE MOST COMMONLY USED INTELLIGENCE TEST
• Discovered by James R. Flynn, who noted that IQ scores have been rising over
generations—this is now known as the Flynn Effect. \SBIT- Stanford-Binet Intelligence Tests
• The gains are especially evident from the date a test is normed (standardized), The Wechsler Tests: WAIS, WISC, WPPSI, WASI, WIAT
suggesting newer generations score higher on older tests. progressive rise in
CFIT- Culture Fair Intelligence Test
intelligence test scores that is expected to occur on a normed test intelligence from
the date when the test was first normed. OLMAT- Otis-Lennon Mental Abilities Test
• However, the gains in IQ do not reflect actual increases in true intelligence, as they OLSAT- Otis-Lennon School Ability Test
are not accompanied by academic or practical improvements.
• Flynn suggested psychologists could manipulate test versions to either increase or RPM- Raven's Progressive Matrices
decrease a child's chances of receiving special services—a controversial and DAT- Differential Aptitude Tests for Personnel and Career Assessment
ethically complex recommendation.
PKP- Panukat ng Katalinuhang Pilipino
Practical Consequences
• The Flynn Effect affects school placements, social service eligibility, and even legal
decisions (e.g., whether a person with an intellectual disability can be executed).
• Defense attorneys may exploit outdated tests to make defendants appear more
intelligent than they are—raising ethical concerns.

Theoretical Implications
• The Flynn Effect raises questions about fluid vs. crystallized intelligence:
o Cattell’s theory: Crystallized intelligence should show more gain
(environmental learning).
o Flynn’s findings: Gains are mainly in fluid intelligence (problem-solving,
abstract thinking), which contradicts some expectations.

The Construct Validity of Tests of Intelligence

• How valid a test is depends on how “intelligence” is defined:

o Spearman’s g = one general factor.
o Guilford’s theory = many distinct abilities.
o Thorndike’s model = g + multiple specific intelligences (e.g., social, abstract).

Social and Ethical Implications

• There is still debate over the definition of intelligence and how best to measure it.
• Group differences in IQ scores exist, but individual differences are much greater.
• Intelligence tests can predict life outcomes (education, job performance, income),
but we should focus more on environmental factors to improve results.
CHAPTER 11: Personality Assessment: An Overview 385 PERSONALITY ASSESSMENT
PERSONALITY AND PERSONALITY ASSESSMENT 354
Personality assessment may be defined as the measurement and evaluation of
For laypeople, personality refers to components of an individual’s makeup that can elicit psychological traits, states, values, interests, attitudes, worldview, acculturation,
positive or negative reactions from others. Someone who consistently tends to elicit positive sense of humor, cognitive and behavioral styles, and/or related individual
reactions from others is thought to have a “good personality.” Someone who consistently characteristics. In this chapter we overview the process of personality assessment,
tends to elicit not-so-good reactions from others is thought to have a “bad personality” or, including different approaches to the construction of personality tests.
perhaps worse yet, “no personality.”
Traits, Types, and States 386
Personality has been defined in many different ways in psychological literature:
Broad Definitions: Personality Traits – relatively enduring dispositions; tendency to act, think, or feel in a
o McClelland (1951) defined personality as a full conceptualization of a person’s behavior. certain manner in any given circumstances and that distinguish one person from
o Menninger (1953) offered a holistic definition, including everything about an individual— another.
physical, emotional, and psychological aspects.
Focused or Contextual Definitions: Personality Types – general description of people; as a constellation of traits that is
o Some definitions are narrow, focusing on specific traits (e.g., Goldstein, 1963). similar in pattern to one identified category of personality within a taxonomy of
o Others emphasize the social context of personality (e.g., Sullivan, 1953).
Critical Views:
personalities.
o Byrne (1974) criticized personality psychology as a vague field, calling it “psychology’s
Personality state – emotional reaction that vary from one situation to another
garbage bin” for research that doesn’t fit elsewhere.
Theoretical Relativism: Self-concept – a person’s self-definition; an organized and relatively consistent set of
o Hall and Lindzey (1970) argued that there is no universally applicable definition of
personality. They claimed definitions depend on the theoretical perspective used and
assumptions that a person has about himself or herself.
encouraged readers to choose the one they find most useful.

Working Definition
For practical purposes, the passage adopts a concise and inclusive definition:
Personality an individual’s unique constellation of psychological traits that is relatively stable over
time.

This definition includes differences in:

• Values
• Interests
• Attitudes
• Worldview
• Acculturation
• Sense of humor
• Cognitive and behavioral styles
• Personality states
PERSONALITY ASSESSMENT: SOME BASIC QUESTIONS Other Applications:
1. Health Psychology
referral for personality assessment.
• Certain traits like perfectionism, neuroticism, and self-criticism are linked to
For what type of employment is a person with this type of personality best suited? physical and mental health issues.
Is this individual sufficiently well adjusted for military service? 2. Corporate/HR
• Used in hiring, promotion, firing, and transfer decisions.
What emotional and other adjustment-related factors may be responsible for this student’s 3. Military
level of academic achievement? • Used to identify individuals with leadership potential.
4. Career Interests
What pattern of traits and states does this psychotherapy client evince, and to what extent • Studies explore how interests relate to personality.
may this pattern be deemed pathological?
Cross-Species Personality Research
How has this patient’s personality been affected by neurological trauma?
Personality assessment even extends to animals:
• Dog personality studies (Gosling et al., 2003) help examine environmental
WHY ASSESS PERSONALITY?
influences.
Personality assessment is used to answer a wide variety of questions in both clinical and
• Gorilla studies (Weiss et al., 2002; Eckardt et al.) help explore heritability of
research contexts.
personality.
Clinical Referral Questions:
These are examples of why someone might be referred for a personality assessment:
JUST THINK…
• What job suits this person’s personality?
What differences would you expect in accuracy and reliability between:
• Is this person mentally/emotionally fit for military service?
• Self-reports of personality
• What emotional/personality factors affect this student’s academic performance?
• Reports by others (e.g., a friend describing your personality)?
• What pattern of traits and states does this client show, and is it pathological?
This raises the issue of subjectivity, bias, and insight in personality reporting.
• How has neurological trauma changed the patient’s personality?
These kinds of questions help answer the broader clinical question:
"Why assess personality?" Preview: What's Next?
The discussion will now move toward how personality assessments are carried out,
including:
Basic Research Applications:
• Who is being assessed?
Personality assessment isn't just for therapy—it’s widely used in scientific research too.
• What traits or behaviors are assessed?
Examples of research uses:
• Where is the assessment conducted?
• Identifying what personality traits predict health knowledge (Beier & Ackerman,
• How is the assessment conducted?
2003)
• Understanding commitment styles in romantic relationships (Frank & Brandstaetter,
Let me know if you'd like:
2002)
• A comparison chart for clinical vs. research applications
• Studying how teams respond to their weakest link (Jackson & LePine, 2003)
• A deeper look into any of the studies mentioned
• Identifying individuals prone to terrorism (national defense)
• A visual mind map summarizing the uses of personality assessment
• Tracking trait development over time (McCrae et al., 2002)
• Studying moral development (Eisenberg et al., 2002)
Who?
2. Another person as the referent
1. The self as the primary referent
Who is Being Assessed and Who Does the Assessing? Key Concept: Third-Party Ratings in Personality Assessment
• Self-report methods involve the assessee providing information about themselves (e.g., interviews, questionnaires, diaries). When the subject cannot or will not provide reliable self-reports (e.g., young children or individuals with subtle
• Informant-report methods involve others (e.g., parents, teachers) providing personality-related information. symptoms), assessments may rely on third-party informants such as:
• In both types, the goal is to understand the individual’s personality. • Parents
• Teachers
Self as the Primary Referent
• Spouses
• Personality assessments often aim to help individuals understand themselves better.
• Supervisors
• Some private information (e.g., sexual behavior) can only be provided by the individual, even if self-report has limitations.
• Example: The Sexual Sensation Seeking Scale relies on self-report because no external verification method exists.
• Trained observers

Self-Concept in Personality Assessment Examples and Research Findings

• Self-concept = attitudes, beliefs, and opinions about oneself. • PIC & PIC-2: Standardized tools where parents report on a child's behavior using true-false items.
• Tools like the Beck Self-Concept Test and Piers-Harris Self-Concept Scale assess self-concept through comparisons or statements. o Found to be valid but not without concerns about bias.
• The BYI-II assesses self-concept and other emotional/behavioral traits in youth. • Spouse ratings (South et al., 2011): Found to correlate well with self-ratings, more so than ratings by peers.
• Self-concept differentiation: Some people perceive themselves differently in different roles, while others are more consistent.
o Low differentiation is associated with better psychological health due to a more unified sense of self. Sources of Bias in Informant Ratings
Limitations of Self-Report • Leniency or Severity Bias: Some raters are too generous or too harsh.
• Assumptions: Accuracy relies on honest and insightful responses. • Central Tendency Bias: Rating most things around the midpoint.
• Risks: People may "fake good" or misrepresent themselves (e.g., in job interviews, custody battles, social settings). • Halo Effect: One favorable trait influences the entire evaluation.
• Self-Interest Bias: Ratings influenced by the rater’s own goals or beliefs.
Case Study: Personality of Gorillas • Emotional Bias: Attracted to, repelled by, or in competition with the subject.
The Study • Lack of Training or Knowledge: Leads to inaccurate observations.
• Subject: Cantsbee, a silverback gorilla leader known for his long-standing leadership. • Personal Preferences or Stereotypes: May distort objectivity.
• Conducted by: Eckardt et al. (2015) using the Hominoid Personality Questionnaire (HPQ) adapted from the Big Five traits.
• Raters: Experienced Rwandan trackers at the Karisoke Research Center, familiar with individual gorillas.
Context Matters
• Languages: English and French; training ensured reliable and culturally understandable ratings.
Findings • Ratings may differ by setting:
• Cantsbee scored: o A parent might see a child as hyperactive at home, but the teacher might see normal behavior in
o High in Dominance (2nd highest) school.
o High in Sociability • These are not errors, but contextual differences in behavior (Achenbach et al., 1987).
o Below average in Openness o Greater agreement is seen for younger children and externalizing behaviors (like aggression) than for
• Correlations: Personality scores matched archival behavioral records (e.g., grooming behavior linked to dominance). internalizing ones (like anxiety).
• Dominant gorillas: Mediate conflicts, are approached more often, and stare less at others. o Context-specific evaluation should inform treatment planning.
Implications
• The study supports evolutionary continuity in personality traits.
3. The cultural background of assess
• Comparative studies of human and nonhuman primate personalities can offer insights into the evolution of the Big Five.
Cultural Context
Summary of Key Concepts
Concept Description
Self-report Individual describes their own traits/feelings • Assessments must consider cultural and linguistic differences.
Informant-report Others describe the individual (e.g., parents) • Important questions:
Self-concept How people see themselves psychologically o Was the test developed fairly across cultures?
Self-concept differentiation Variation in self-concept across social roles
o Are interpretations biased?
Gorilla personality research Used human-derived tools to assess ape personalities
Big Five Model Framework used for both human and animal personality studies o Is the test appropriate for this cultural group?
Limitations Includes potential dishonesty and situational distortion in self-reports • Increasing focus on culturally informed assessment (to be discussed more in later chapters).
What What is assessed in a personality assessment? Where?
When conducting a personality assessment, two main things are evaluated: Where are personality assessments conducted? Traditional sites for personality assessment, as well as other varieties of
1. Primary Content Area Sampled assessment, are schools, clinics, hospitals, academic research laboratories, employment counseling and vocational
selection centers, and the offices of psychologists and counselors. In addition to such traditional venues, contemporary
This refers to what the test is specifically trying to measure, such as:
assessors may be found observing behavior and making assessments in natural settings, ranging from the assessee’s own
• Traits like introversion or extraversion home (Marx, 1998; McElwain, 1998; Polizzi, 1998) to the incarcerated assessee’s prison cell (Glassbrenner, 1998).
• States like test anxiety
• Behaviors in certain settings (e.g., a child’s hyperactivity in a classroom) How?
Some tests, like observational checklists, target specific behaviors, while others assess a broad range of 1. Scope and Theory
emotional, cognitive, or interpersonal factors.
Scope:
2. Testtaker’s Response Style
Personality assessments can vary widely in scope:
Besides measuring traits or behaviors, many personality tests also look at how the person is answering the
test, regardless of the content.
This involves checking for patterns of responding, also called response styles or response sets, which can • Broad scope tests aim to cover many personality variables (e.g., the California Psychological Inventory
impact the accuracy or validity of the results. with 434 true–false items measuring traits like responsibility and dominance).
• Narrow scope tests focus on specific aspects of personality, such as locus of control (whether a
person attributes outcomes to internal or external causes).
Common Response Styles (from Table 11–1):
Response Style Definition Theory-Based vs. Atheoretical Tests:
Socially Desirable Responding Trying to look good by giving socially acceptable answers • Some tests are grounded in a specific personality theory (e.g., the Blacky Pictures Test based on
Acquiescence Tendency to agree with everything psychoanalytic theory).
Nonacquiescence Tendency to disagree with everything • Others are atheoretical, developed empirically without a guiding theory (e.g., the Minnesota
Deviance Giving strange or unusual answers Multiphasic Personality Inventory (MMPI)), allowing flexibility in interpretation.
Extreme responding Always choosing the most extreme answers (e.g., "Strongly
Agree"/"Strongly Disagree") 2. Procedures and Item Formats
Gambling/cautiousness Guessing when unsure, or avoiding guesses altogether
Overly positive self-presentation Portraying oneself in an unrealistically good light (e.g., “I’m Methods of Assessment:
perfect”) Personality can be assessed through:
• Interviews (structured or unstructured)
Impression Management • Paper-and-pencil or computerized tests
This refers to a deliberate effort to control how others perceive you during a test. It can include: • Behavioral observations
• Phenomena of Enhancement – exaggerating good qualities • Case history or portfolio review
• Denial – refusing to acknowledge negative traits • Physiological measures
• Self-deception – believing your own overly positive self-image
Some tests try to detect these patterns using validity scales. Degree of Structure:
• Some assessments are highly structured (e.g., following a strict interview guide, clear tasks
Validity Scales
like copying a sentence).
• These are subsections of a test that help determine if the answers are:
o Honest • Others are unstructured or ambiguous (e.g., responding to inkblots or storytelling tasks that
o Careful allow for open interpretation).
o Affected by a response style (e.g., faking good, random guessing) Example:
Tests like the MMPI include multiple validity scales to help ensure the interpretation is accurate. • The “Wall Situation” during WWII assessed leadership and emotional stability via a
simulated group problem-solving task under stress.
Why it matters: • Handwriting analysis (graphology) is a less supported method where a test-taker copies text,
Understanding how someone responds is important because it affects how we interpret the results. If a person and personality is inferred.
answers dishonestly or inconsistently, the conclusions about their personality could be flawed.
As Nunnally (1978) argued, response styles might themselves reveal personality traits, not just biases.
3. Frame of Reference in Personality Assessment Practical Considerations and Reflection
• Definition: • The frame of reference matters because it shapes the meaning of personality data (e.g., present self vs. ideal self).
Frame of reference refers to the specific perspective or context in which personality is measured. This can include the time • Q-sort and other adaptable formats provide flexibility for different contexts and populations.
frame (past, present, future) or contextual factors involving people, places, and events. • The choice between nomothetic/idiographic and normative/ipsative approaches depends on the assessment’s purpose.
• Common Frame of Reference: 5. Issues in Personality Test Development and Use Core Questions in Personality Test Development
Most personality measures ask the assessee about their current self—how they are right now. When creating a personality test, developers must consider:
• Alternative Frames of Reference: 1. Target Test-Takers – Who is the test for?
Personality measurement can be adapted to explore different perspectives, such as: 2. Mode of Assessment – Will it use:
o How I could be ideally (ideal self) o Self-report (e.g., questionnaires)?
o How I am in a particular context (e.g., at work) o Observers/Raters (e.g., teachers, supervisors)?
o How others see me 3. Rater Requirements – If using raters:
o How I see others o What training do they need?
• Example: o How to ensure inter-rater reliability?
Comparing present self-perception with future expectations can reveal optimism or pessimism. 4. Content Area – What specific personality traits or constructs will the test measure?
5. Response Styles – How will response biases like social desirability or random responding be managed?
The Q-Sort Technique 6. Item Format – True/false, Likert scale, etc.
7. Scoring and Interpretation – How will results be interpreted meaningfully?
• Overview:
Developed by Stephenson (1953), the Q-sort is a method where individuals sort statements into a rank order based on how
descriptive they are of the person or situation. Issues with Self-Report
Self-report methods have both advantages and drawbacks:
• Usage:
o Can reflect how people see themselves or want to see themselves.
• Pro: Direct source (the individual knows themselves).
o Statements might include “I am confident,” “I try hard to please others,” or “I am uncomfortable in social situations.” • Con: Responses may be:
• Clinical Application:
o Truthful
Carl Rogers used Q-sort to measure discrepancies between the actual self and the ideal self in therapy. Larger discrepancies
o Misguided due to lack of self-awareness
indicate greater therapeutic goals. o Intentionally false (to make a good impression)
• Research and Broader Applications:
o Random (not taken seriously)
Q-sort has been adapted for leadership evaluation, vocational classification, and even measuring attachment security in
children and animals. Debate Over Validity Scales
These scales aim to detect dishonest or distorted responses.
Other Item Formats Adaptable to Frame of Reference • Pro-Inclusion (e.g., Ben-Porath & Waller, 1992):
1. Adjective Checklist Format: Validity scales are vital to identify misleading answers.
Respondents check adjectives that describe themselves or others. The frame of reference changes depending on instructions • Anti-Inclusion (e.g., Costa & McCrae, 1997):
(e.g., how they feel now vs. how they felt over the past year). People may endorse desirable traits because they are genuinely true—not due to faking.
2. Sentence Completion Format: Costa & McCrae argue that external validation (e.g., peer ratings) is better than adding "social desirability" scales, which might wrongly
Respondents complete sentences that reveal feelings about themselves or others (e.g., “I would describe my feeling about label honest answers as dishonest.
myself as ____”).
Cultural and Language Concerns
4. Personality Test Scoring and Interpretation
• Translation Issues: Direct translations may lose or distort meaning.
• Different Scoring Methods:
• Trait Interpretation: Personality constructs (like "prudence") may vary culturally.
o Some tests use simple tallies.
o Others require computers for complex scoring. • Test Norms: A test normed in one culture may be inappropriate or misleading in another.
o Some need clinicians to interpret qualitative responses (e.g., projective tests like Rorschach). o Example: MMPI may show minority groups with higher psychopathology, but these differences might reflect culture,
not true clinical symptoms.
• Nomothetic vs. Idiographic Approaches:
o Nomothetic: Privacy and Ethics
Focuses on measuring a limited set of universal personality traits across all people (e.g., 16 PF, Big Five).
o Idiographic: • Threat assessments (e.g., to detect terrorism or stalking risks) are increasing.
Focuses on an individual's unique set of personality traits without fitting them into predetermined categories. Used in • There’s a tension between:
case studies or personal records. o Public safety
• Normative vs. Ipsative Scoring: o Individual privacy and rights
o Normative:
Compares an individual’s scores to a larger population (e.g., how a pilot’s traits compare to all pilots). Final Thoughts
o Ipsative: Developers and users of personality assessments must carefully consider:
Compares traits within the same individual to see which traits are stronger or weaker relative to their own profile. • Scientific rigor (validity, reliability)
• Cultural sensitivity
• Ethical responsibility
• Practical application of results
DEVELOPING INSTRUMENTS TO ASSESS PERSONALITY 377 • Actuarial Approach:
o Uses fixed risk factors and scoring.
Logic and Reason 377 o More objective but less flexible for case-specific details.
3. Assessment Tools and Models
Theory 380
• Biopsychosocial (BPS) Model (Meloy, 2000):
Developing Instruments to Assess Personality o Considers psychological, social, and biological risk factors.
1. Tools for Test Development o Focuses on qualitative, case-specific information instead of numeric scores.
• Logic and Reason: 4. Information Sources
o Known as the content-oriented approach. • Social media posts, public records, and personal communications (emails, letters).
o Items are written based on common-sense logic of the trait being measured. • These are used to gauge mental state, intentions, and risk levels.
Example: For extraversion, a logical item might be “I consider myself an 5. Ethical and Practical Considerations
outgoing person.” • Decisions must balance public safety and civil rights (e.g., free speech and privacy).
o Used in early tests like the Woodworth Personal Data Sheet (1917) for • Investigations can impact a person’s life even if no crime is committed.
screening soldiers. • Premature or unnecessary investigation may worsen the threat.
• Advantages: 6. Skills and Preparation for Students
o Quick and inexpensive. • Study forensic psychology and forensic assessment.
o Easily self-administered. • Read works by J. Reid Meloy, a key figure in threat assessment.
o Suitable for clinical and managed care settings. • Gain experience through internships or volunteer work in threat-related fields.
o Generates fast computerized results.
2. Role of Research and Clinical Experience Would you like a visual concept map or a Q&A practice based on this content?
• Review of literature helps refine test items.
• Clinicians provide insight into what behaviors or symptoms are commonly seen.
Data Reduction Methods 380
• Experts contribute through consultation and correspondence.
3. Theory-Driven Approaches 1. The Big Five
• Items are based on a particular theory of personality.
o Example: A psychoanalytic test might include items about dreams or Criterion Groups 383
unconscious motives.
• Interpretation of results also aligns with the chosen theory. 1. MMPI-1
• Still uses research, clinical knowledge, and expert opinions, but grounded in a 2. MMPI-2
theoretical model.
3. MMPI_2_RF
Meet an Assessment Professional: Dr. Rick Malone
1. Role 4. MMPI-A
• Forensic psychiatrist in the U.S. Army Criminal Investigation Command (CID).
5. The MMPI and its revisions and progeny in perspective
• Focuses on threat assessment to protect military officials.
2. Methods Used
• Structured Professional Judgment (SPJ):
o Balances clinical judgment and scientific guidelines.
o Uses evidence-based minimum risk factors.
o Allows evaluator discretion but avoids rigid scoring systems.
Data Reduction Methods

Definition: 4. Big Five Tests

These are statistical techniques like factor analysis and cluster analysis used to simplify complex data. In
test development, they help identify the smallest number of variables (factors) needed to explain the NEO PI-R (Costa & McCrae):
interrelationships among items or traits.
• Measures the Big Five.
Example (Paint Analogy): • Each domain has 6 facets, totaling 30 traits.
Imagine picking a paint color for your room. From thousands of options, you realize all can be traced back • Used clinically and in research.
to three primary colors (Red, Yellow, Blue). Similarly, factor analysis in personality research tries to
reduce thousands of traits into a few core ones. Other Instruments:

• Big Five Inventory (BFI) – Short and free (44 items).

• Ten Item Personality Inventory (TIPI) – Ultra-short (2 items per factor).
2. Cattell’s 16 Personality Factors • Five-Factor Nonverbal Personality Questionnaire (FF-NPQ) – Uses pictures instead of words.

Background:

• Inspired by Allport and Odbert’s catalog of 18,000 personality traits (1936). 5. Criterion Group Method (Empirical Criterion Keying)
• Cattell reduced this to 171, then 36 surface traits, and ultimately to 16 source traits through factor Uses known groups (criterion group vs. control group) to find which test items differentiate between them.
analysis.
• Resulted in the 16 Personality Factor Questionnaire (16 PF). Steps:

Challenge: 1. Create a large pool of test items.

Other researchers argued some of Cattell’s 16 traits are not distinct and are intercorrelated, suggesting 2. Administer them to:
there may be fewer core personality traits. o Group 1: People with the trait (criterion group).
o Group 2: A general/random sample (control group).
3. Analyze which items significantly distinguish the groups.
4. Use those items in the final test and standardize it with a representative sample.
3. The Big Five (Five-Factor Model)
Purpose:
Supported by: Ensures that the test items actually reflect the trait of interest by empirically proving they distinguish
Costa & McCrae, Goldberg, and others. people who have it from those who don’t.
The five traits are:

• Neuroticism (Emotional Stability)

• Extraversion Key Takeaways:
• Openness to Experience
• Agreeableness • Data reduction helps simplify and structure psychological traits.
• Conscientiousness • Factor analysis revealed both the 16 PF (Cattell) and the Big Five models.
• The Big Five is currently the most widely accepted model.
Comparison with Cattell: • Test construction often combines rational/theoretical methods, factor analysis, and empirical
Cattell’s 16 traits can be grouped into five broader categories similar to the Big Five. So while he stood by keying to ensure validity.
his 16 PF model, his own data also supported a five-factor solution.
1. Cultural Sensitivity in Personality Assessment
• Personality assessment cannot be treated as a routine task when dealing with culturally and linguistically diverse o Avoid over-reliance on tools not normed for the population
populations (e.g., Native American, Hispanic, Asian, African American). o Be open to modifying standard practices to align with cultural needs
• Using a standardized approach developed predominantly for one group can be inappropriate or even irresponsible.
PERSONALITY AND PERSONALITY ASSESSMENT 354
• Assessors must be trained to understand cultural contexts and how these influence behavior and cognition (López,
Personality 354
2000).
Personality Assessment 355
Traits, Types, and States 355
2. Importance of Acculturation
PERSONALITY ASSESSMENT: SOME BASIC QUESTIONS 359
• Acculturation is the process by which individuals internalize the values, behaviors, and norms of a culture. It begins at or Who? 360
even before birth. What? 366
• It involves family, media, peers, and education as cultural agents. Where? 368
• Tools to assess acculturation exist but vary in psychometric quality—many are more appropriate for generating How? 368
hypotheses than making critical decisions. ix
Contents
3. Areas for Consideration Before Assessment DEVELOPING INSTRUMENTS TO ASSESS PERSONALITY 377
Assessors should explore: Logic and Reason 377
• Level of acculturation Theory 380
• Values Data Reduction Methods 380
Criterion Groups 383
• Personal and social identity
PERSONALITY ASSESSMENT AND CULTURE 395
• Worldview Acculturation and Related Considerations 396
• Language proficiency and preference
These elements provide critical context and can be as valuable as the assessment tools themselves.

4. Interviewing and Questions for Cultural Insight

• A list of interview questions (Table 11–6) helps explore cultural background, identity, family dynamics, beliefs, and
worldview.
• Example questions include:
o “Describe your family and your role in it.”
o “What traditions were passed down to you?”
o “How do you see yourself 10 years from now?”
• Rapport and cultural sensitivity are essential for meaningful answers.

5. Values in Personality
• Values represent what an individual deems important.
o Instrumental values: means to an end (e.g., honesty, ambition).
o Terminal values: end goals (e.g., self-respect, a comfortable life).
• Cultural background heavily influences values and, in turn, motivation and personality.

6. Identity and Identification

• Identity: How individuals define themselves, often in relation to their cultural group.
• Identification (Levine & Padilla, 1980): The process of adopting behaviors typical of a group.
• Responses to questions like “What do you call yourself ethnically?” can reveal comfort or conflict in identity (e.g.,
avoidance of ethnic labels may indicate internalized stigma).

7. Worldview
• Worldview: How individuals interpret the world around them, shaped by culture and experience.
• It influences personality expression, decision-making, and interpersonal interactions.

8. Practical Implications
• Cultural background affects:
o Personality assessment results
o Interpretation of those results
o The relevance and validity of certain tools
• A culturally competent assessor should:
o Integrate cultural data into assessments
CHAPTER ASSESSMENT FOR EDUCATION

The Role of Testing and Assessment in Education

• Educators are interested in answers to diverse questions as students progress through

school.

• A small sampling of those questions might be:

– How well have students learned and mastered the subject matter they were
taught?

– To what extent are students able to apply what they have learned to novel
circumstances and situations?

– Which students have demonstrated the ability or skills necessary to move on to

the next level of learning?

– Which students have demonstrated the ability or skills necessary for

independent living?

– What are the challenges or obstacles that are preventing an individual student
from meeting educational objectives, and how can those obstacles best be
overcome?

– How effective are teachers in assisting students to master specific curriculum

goals?

– Do passing test scores on a curriculum-specific test genuinely reflect the fact

that the test takers have mastered the curriculum?

– Do failing test scores on a curriculum-specific test really reflect the fact that
the test takers have not mastered the content of the curriculum?
1. Learning Disability Other Tools of Assessment in Educational Settings
➢ severe discrepancy between achievement and intellectual ability
➢ Is diagnosed if a significant discrepancy existed between the child’s measured intellectual ability (usually on 8. Performance Assessment
an intelligence test) and the level of achievement that could reasonably be expected from the child in one or
more areas (including oral expression, listening comprehension, written expression, basic reading skills,
➢ More than choosing the correct response
reading comprehension, mathematics calculation, and mathematics reasoning). ➢ essay questions and the development of an art project are examples of performance tasks. By
contrast, true–false questions and multiple-choice test items would not be considered
2. Specific Learning Disability
performance tasks.
➢ As defined in 2007 by Public Law 108-147, it is a disorder in one or more of the basic psychological
processes involved in understanding or in using language, spoken or written, which disorder may manifest
➢ performance task as a work sample designed to elicit representative knowledge, skills, and
itself in the imperfect ability to listen, think, speak, read, write, spell, or do mathematical calculation. values from a particular domain of study.
➢ evaluation of performance tasks according to criteria developed by experts
3. Dynamic Assessment
➢ It is an approach to assessment that departs from reliance on, and can be contrasted to, fixed (so-called 9. Portfolio Assessment (under Performance Ass) work sample.
“static” tests. Dynamic assessment encompasses an approach to exploring learning potential that is based
on a test-intervention-retest model. ➢ evaluation of one’s work samples.

4. Achievement Tests 10. Authentic Assessment/Performance-Based Assessment (under Performance Ass)

➢ Achievement tests are designed to measure accomplishment. Degree of learning from exposure of learning
experiences. Progress ➢ relevant, meaningful tasks. demonstrate the student’s transfer of that study to real-world
Example: An achievement test for a first-grader might have as its subject matter the English language activities
alphabet ➢ writing skills, for example, would therefore be based on writing samples rather than on
– “Relatively defined learning experience” may mean something as broad as what was learned from responses to multiple-choice tests.
four years of college, or something much narrower, such as how to prepare dough for use in making
pizza. 11. Peer Appraisal Techniques
5. Aptitude Tests/Prognostic Tests ➢ asking that individual’s peer group to make the evaluation
➢ informal learning or life experiences whereas achievement tests tend to focus on the learning that has ➢ Peers call attention when it have not come to the attention of the person in charge.
occurred as a result of relatively structured input.
➢ supply information about the group’s dynamics
➢ Potential. Measure readiness.
12. Biopsychosocial Assessment
6. Diagnostic Test
➢ evaluative is applied to tests or test data that are used to make judgments (such as pass–fail and admit– ➢ relevant biological, psychological, social, cultural, and environmental variables
reject decisions). diagnostic information, is applied to tests or test data used to pinpoint a student’s
difficulty, usually for remedial purposes.

7. Psychoeducational Test Batteries

Two types
1. Measure abilities related to academic success
2. Measure educational achievement in areas such as reading and arithmetic
➢ Allow for normative comparisons and evaluation to plan educational interventions
➢ The Kaufman Assessment Battery for Children (K-ABC) and the Kaufman Assessment Battery for Children,
Second Edition (KABC-II) 2 ½ to 12 1/2
− simultaneous skills and sequential skills
➢ The Woodcock-Johnson IV (WJ IV) (Woodcock et al., 2000)
− Consists of two normed batteries (achievement and cognitive to measure oral language ability)
What are the possible sources of data when making an assessment? ✔ Appearance: how a person looks
A. Interview ✔ Behavior: observable action
✔ Orientation: oriented to person, place, and time: awareness of all
✓ To arrive at a diagnosis
✔ Memory: ability to recall part/current infos
✓ To pinpoint areas that must be addressed in psychotherapy
✔ Sensorium: overall level of alertness and cognitive functioning
✓ To determine whether an individual will harm himself or others
✔ Psychomotor activity: Physical movements related to mental state.
Type ✔ State of consciousness: Level of awareness/responsiveness.
✔ Affect: Observable emotional expression.
1. Content – wide ranging/focus narrowly on a particular content
✔ Mood: Internal emotional state (subjective).
2. Structure – structured/unstructured ✔ Personality: Stable patterns of thinking, feeling, and behaving.
✔ Thought content: What the person is thinking.
3. Tone – Stress Interview: placing the interview in a pressured state for a reason
✔ Thought processes: How thoughts are organized
a. Cognitive Interview: rapport is established and the interviewee is encouraged to use ✔ Intellectual resources: Level of intelligence and problem-solving ability.
imagery and focused retrieval to recall info ✔ Insight: Awareness and understanding of one’s condition.
✔ Judgment: Ability to make sound decisions.
b. Collaborative Interview: participatory, collaborative role

What basic information must we gather?

• Demographic data B. Case History Data
• Reason for referral • obtained by interviewing the assessee and/or significant others in that person’s life
• Past medical history
• Present medical conditions • hospital records, school records, military records, employment records, and related
documents
• Familial medical history
• Past psychological history C. Psychological Tests
• Past history with medical or psychological professionals
• Test battery- a group of tests administered together to gather information about an individual
• Current psychological conditions from a variety of instruments.
Mental Status Examination
-> standard battery: one intelligence test, at least one personality test, and a test designed to screen
➢ parallel to the general physical examination conducted by a physician for neurological deficit
➢ a special clinical interview
➢ used to screen for intellectual, emotional, and neurological deficits, typically includes
questioning or observation
The Psychological Report Elements of a Typical Report of Psychological Assessment
A. Demographic Data
- the end product of assessment name, address, telephone number, education, occupation, religion, marital status, date of birth,
• The psychological report is the end product of assessment. place of birth, ethnic membership, citizenship, and date of testing
B. Reason for Referral
✔ integrate the assessment data into a functional whole • Why was this patient referred for psychological assessment?
• A brief description of the client and a statement of the general reason for conducting the
✔ explained in a manner that is relevant and clear evaluation
• “shotgun” report- vague, stereotyped, and overinclusive; neither specific nor C. Tests Administered/ Evaluation Procedures
-names of the tests that were administered, date of the test administration
practical
-testing and interviews, a review of relevant records
• “case- focused” - centers on the specific problems; reveals unique aspects of the D. Behavioral Observations
client and provides specific accurate descriptions -appearance: facial expressions, clothes, body type, mannerisms, and movements
-general behavioral observations, or examiner-client interaction
• The hypothesis-oriented model: all the interpretations based on the test data are -degree of cooperativeness
directed toward answering whether this hypothesis is supported E. Background Information (relevant history)
-evaluate these areas in relationship to the overall purpose of the report
• A domain-oriented report discusses the client in relation to specific topics such as
-specify where the information came from
cognitive abilities, interpersonal relationships, vocational abilities, or sexuality
-family background, personal history, medical history, history of the problem, and current
-> referral question is still answered but is addressed by responding to specific domains life situation
relating to the referral question F. Findings and Interpretations
-Test results
-extraneous variables that might in some way have affected the test
results
• The high reliability or validity of a test or assessment procedure may be cast to the
*The Findings section of the report is where all the background material, behavioral observations, and
wind if the assessment report is not written in an organized and readable fashion.
test data are integrated to provide an answer to the referral question.
✔ goal of the assessment G. Summary- provide a concise statement of who the examinee is, why the examinee was
referred for testing, what was found, and what needs to be done.
audience for whom the report is intended H. Recommendations
-> directly relate to what specifically can be done for this client in his or her particular environment.

Stages of Test Development
100% (5)
Stages of Test Development
3 pages
Psychological Testing and Assessment An Introduction To Tests and Measurement 8th Edition Cohen Test Bank Download
100% (20)
Psychological Testing and Assessment An Introduction To Tests and Measurement 8th Edition Cohen Test Bank Download
37 pages
MODULE 8: Test Development: PSY 112: Psychological Assessment
No ratings yet
MODULE 8: Test Development: PSY 112: Psychological Assessment
59 pages
Test-Development-and-Administration (Edited)
No ratings yet
Test-Development-and-Administration (Edited)
5 pages
Unit I
No ratings yet
Unit I
63 pages
7 Test Development
No ratings yet
7 Test Development
24 pages
Topic-12B-Test-Development-by-cohen 2
No ratings yet
Topic-12B-Test-Development-by-cohen 2
66 pages
V. Test Development 2
No ratings yet
V. Test Development 2
29 pages
Selecting Measuring Instruments CH05
No ratings yet
Selecting Measuring Instruments CH05
37 pages
Of Tests and Testing
No ratings yet
Of Tests and Testing
48 pages
Midterms Psychological Assessment 1
No ratings yet
Midterms Psychological Assessment 1
13 pages
5 Test Development
No ratings yet
5 Test Development
30 pages
Test Production Process
No ratings yet
Test Production Process
36 pages
Wim J. Van Der Linden (Auth.) - Linear Models For Optimal Test Design 2005
No ratings yet
Wim J. Van Der Linden (Auth.) - Linear Models For Optimal Test Design 2005
420 pages
Test Construction
No ratings yet
Test Construction
9 pages
CHAPTER 8 Clavillas Garma Garcia, J. Layog
No ratings yet
CHAPTER 8 Clavillas Garma Garcia, J. Layog
41 pages
Psych Assessment Unit VIII
No ratings yet
Psych Assessment Unit VIII
10 pages
Chapter 8 Test Development
No ratings yet
Chapter 8 Test Development
4 pages
Assessment Trans Chapter 8
No ratings yet
Assessment Trans Chapter 8
8 pages
Test Development-SR
No ratings yet
Test Development-SR
9 pages
Types of Norm
No ratings yet
Types of Norm
9 pages
PsychAssess 5 TestDevelopment
No ratings yet
PsychAssess 5 TestDevelopment
4 pages
Developing A Testing Tool 2
No ratings yet
Developing A Testing Tool 2
24 pages
Test Development
No ratings yet
Test Development
30 pages
Reporting - Test Development
No ratings yet
Reporting - Test Development
5 pages
Week13 - Ã Ä Renci
No ratings yet
Week13 - Ã Ä Renci
41 pages
Test Dev
No ratings yet
Test Dev
7 pages
Psych Assessment Chapter 4
No ratings yet
Psych Assessment Chapter 4
32 pages
Unit 2
No ratings yet
Unit 2
13 pages
Fundamentals of Machine Component 7th Design, Edition, Robert C. Juvinall
100% (1)
Fundamentals of Machine Component 7th Design, Edition, Robert C. Juvinall
403 pages
Test Construction
No ratings yet
Test Construction
6 pages
C H A P T E R 8 Test Development
No ratings yet
C H A P T E R 8 Test Development
9 pages
"Development of Large Scale Student Assessment Test": Chapter 13)
No ratings yet
"Development of Large Scale Student Assessment Test": Chapter 13)
24 pages
MODULE 7. LESSON PROPER Psych Asses
No ratings yet
MODULE 7. LESSON PROPER Psych Asses
8 pages
Present - For Cent - Ethiopia Regional State
No ratings yet
Present - For Cent - Ethiopia Regional State
50 pages
Planning A Procedure of A Test Roll No 5
No ratings yet
Planning A Procedure of A Test Roll No 5
23 pages
Unit 1 Test Development
No ratings yet
Unit 1 Test Development
57 pages
Test Development of Assessment
No ratings yet
Test Development of Assessment
26 pages
Week 7 - Test Development
No ratings yet
Week 7 - Test Development
12 pages
REVIEWER
No ratings yet
REVIEWER
8 pages
STUDY GUIDE-TMI Module 1 (LO3)
100% (1)
STUDY GUIDE-TMI Module 1 (LO3)
5 pages
Finals Psychass Reviewer
No ratings yet
Finals Psychass Reviewer
11 pages
PSYTEST
No ratings yet
PSYTEST
33 pages
General Steps of Test Construction in Psychological Testing
0% (1)
General Steps of Test Construction in Psychological Testing
13 pages
Test Construction
No ratings yet
Test Construction
35 pages
Test Construction and Development
No ratings yet
Test Construction and Development
3 pages
Test Development
No ratings yet
Test Development
5 pages
Why Is Research A Cyclical Process?
No ratings yet
Why Is Research A Cyclical Process?
17 pages
Characteristics, Construction and Evaluation of Psychological Tests
100% (1)
Characteristics, Construction and Evaluation of Psychological Tests
52 pages
Questions To Ask When Evaluating Tests
No ratings yet
Questions To Ask When Evaluating Tests
5 pages
Intro Measurement PDF
No ratings yet
Intro Measurement PDF
258 pages
Banaras Hindu University (BHU) Entrance Test 2021
No ratings yet
Banaras Hindu University (BHU) Entrance Test 2021
144 pages
Test, Measurement, Evaluation, and Assessment
No ratings yet
Test, Measurement, Evaluation, and Assessment
8 pages
Slide 6 - Test Construction and Adaptation
No ratings yet
Slide 6 - Test Construction and Adaptation
34 pages
Roma Flores Psychological Test Development Procedures
No ratings yet
Roma Flores Psychological Test Development Procedures
13 pages
Chapter 8 Test Development
100% (1)
Chapter 8 Test Development
3 pages
Test Construction and Development
No ratings yet
Test Construction and Development
3 pages
Chapter 8 Test Development (Unfinished)
50% (2)
Chapter 8 Test Development (Unfinished)
25 pages
NOTES: The Process of Test Development
No ratings yet
NOTES: The Process of Test Development
5 pages
Test Development
No ratings yet
Test Development
17 pages
CIPS L4 - 5 - 6 - CR - 8pp - A4 - Prep - For - CR - Exams - 0819 - v3
No ratings yet
CIPS L4 - 5 - 6 - CR - 8pp - A4 - Prep - For - CR - Exams - 0819 - v3
9 pages
Unit 8
No ratings yet
Unit 8
32 pages
E2258ec50bbf964af5ce
No ratings yet
E2258ec50bbf964af5ce
10 pages
Test Conceptualization: Norm-Referenced Vs Criterion-Referenced
No ratings yet
Test Conceptualization: Norm-Referenced Vs Criterion-Referenced
7 pages
3900 CHP 8
No ratings yet
3900 CHP 8
3 pages
UFE Canada
No ratings yet
UFE Canada
17 pages
F.RELIABILITY and VALIDITY
No ratings yet
F.RELIABILITY and VALIDITY
23 pages
Telangana State Public Service Commission Group-Iv Services Notification No.: 19/2022 DATED.01/12/2022
No ratings yet
Telangana State Public Service Commission Group-Iv Services Notification No.: 19/2022 DATED.01/12/2022
3 pages
AWS Certified Developer - Associate
No ratings yet
AWS Certified Developer - Associate
2 pages
Academic Profiling
No ratings yet
Academic Profiling
2 pages
NS621 Extended Literature Review A Guide For Studentsv.4
No ratings yet
NS621 Extended Literature Review A Guide For Studentsv.4
5 pages
Unit I: The Writing Process: Reading To Write: Becoming A Critical Reader Brent Staples. "Cutting and Pasting: A Senior Thesis" (Both
No ratings yet
Unit I: The Writing Process: Reading To Write: Becoming A Critical Reader Brent Staples. "Cutting and Pasting: A Senior Thesis" (Both
3 pages
Testbank & Ebook Machine Learning For Time Series Forecasting With Python by Francesca Lazzeri 9781119682387 111968238X Instant
No ratings yet
Testbank & Ebook Machine Learning For Time Series Forecasting With Python by Francesca Lazzeri 9781119682387 111968238X Instant
17 pages
G42 Applicant Guide
No ratings yet
G42 Applicant Guide
17 pages
CCD 103 - Design Thinking
No ratings yet
CCD 103 - Design Thinking
7 pages
Research Monograph
No ratings yet
Research Monograph
66 pages
Creating A Reflective Report: Theo Smits, FNWI Jose Van Alst, IOWO
No ratings yet
Creating A Reflective Report: Theo Smits, FNWI Jose Van Alst, IOWO
3 pages
NID Kurukshetra Notification
No ratings yet
NID Kurukshetra Notification
10 pages
文獻回顧在研究計劃中的重要性
100% (1)
文獻回顧在研究計劃中的重要性
12 pages
CTA - EB - CV - 02704 - D - 2024MAY13 - ASS Riekerman
No ratings yet
CTA - EB - CV - 02704 - D - 2024MAY13 - ASS Riekerman
27 pages
Bhagyshree
No ratings yet
Bhagyshree
14 pages
Braou PG Examination Registration IV TH Semster
No ratings yet
Braou PG Examination Registration IV TH Semster
1 page
Assessment Breakdown 2023 PSIH2724
No ratings yet
Assessment Breakdown 2023 PSIH2724
5 pages
MMBC Ia2
No ratings yet
MMBC Ia2
2 pages
Eap (B) - Lesson 1
No ratings yet
Eap (B) - Lesson 1
10 pages
Teacher's Belief About Students and The Intention of Students To Drop Out of Secondary Education in Flanders
No ratings yet
Teacher's Belief About Students and The Intention of Students To Drop Out of Secondary Education in Flanders
11 pages
COM 2BC3 W20 Outline - Djelalian Pepper
No ratings yet
COM 2BC3 W20 Outline - Djelalian Pepper
9 pages
Monash Master of Pharmacy 2021 Entry Brochure
No ratings yet
Monash Master of Pharmacy 2021 Entry Brochure
3 pages
101 Assessment Tips: Enhancing Understanding of Vocational Quality
From Everand
101 Assessment Tips: Enhancing Understanding of Vocational Quality
Vanessa McCarthy
No ratings yet
Assessment That Works: How Do You Know How Much They Know? a Guide to Asking the Right Questions
From Everand
Assessment That Works: How Do You Know How Much They Know? a Guide to Asking the Right Questions
John Sleigh
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Finals Psych Ass Reviewer

Uploaded by

Finals Psych Ass Reviewer

Uploaded by

CHAPTER 8 Test Development

A . Some Preliminary Questions Criterion-Oriented Test

Types of Scales 2. Categorical Scaling

3. Method of Paired Comparisons

The Item-Difficulty Index

The Item-Reliability Index

The Item-Validity Index

The Item-Discrimination Index

The Item-Characteristic Curves

Other Considerations in Item Analysis

Qualitative Item Analysis

Test Revision as a Stage in New Test Development

Test Revision in the Life Cycle of an Existing Test

The Use of IRT in Building and Revising Tests

Addressing Concerns About Classroom Tests

− idea for a test Some Preliminary Questions

10. Test Revision Pilot Work/Pilot Study/Pilot Research

WHAT IS INTELLIGENCE? The College Level and Beyond 341

Perspective on Intelligence DIAGNOSTIC TESTS 344

MEASURING INTELLIGENCE Reading Tests 345

Some Tasks Used to Measure Intelligence Math Tests 346

Some Tests Used to Measure Intelligence PSYCHOEDUCATIONAL TEST BATTERIES 346

The Flynn Effect The Woodcock-Johnson IV (WJ IV) 348

A PERSPECTIVE Performance, Portfolio, and Authentic Assessment 349

Factors Analysis Peer Appraisal Techniques 351

THE CASE FOR AND AGAINST EDUCATIONAL TESTING

THE COMMON CORE STATE STANDARDS

ACHIEVEMENT TESTS 328

Measures of General Achievement 328

Measures of Achievement in Specific Subject Areas 329

APTITUDE TESTS 331

The Preschool Level 333

The Elementary-School Level 338

4. Jean Piaget (1954, 1971)

➢ derives from the work of the Russian neuropsychologist ALEKSANDR LURIA

DAS AND NAGLIERI

GROUP TESTS OF INTELLIGENCE

OTHER MEASURES OF INTELLECTUAL ABILITIES

The Construct Validity of Tests of Intelligence

• How valid a test is depends on how “intelligence” is defined:

Social and Ethical Implications

This definition includes differences in:

Self-Concept in Personality Assessment Examples and Research Findings

Definition: 4. Big Five Tests

• Big Five Inventory (BFI) – Short and free (44 items).

Challenge: 1. Create a large pool of test items.

• Neuroticism (Emotional Stability)

4. Interviewing and Questions for Cultural Insight

6. Identity and Identification

The Role of Testing and Assessment in Education

• Educators are interested in answers to diverse questions as students progress through

• A small sampling of those questions might be:

– Which students have demonstrated the ability or skills necessary to move on to

– Which students have demonstrated the ability or skills necessary for

– How effective are teachers in assisting students to master specific curriculum

– Do passing test scores on a curriculum-specific test genuinely reflect the fact

4. Achievement Tests 10. Authentic Assessment/Performance-Based Assessment (under Performance Ass)

7. Psychoeducational Test Batteries

What basic information must we gather?

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.