Finals Psych Ass Reviewer
Finals Psych Ass Reviewer
TEST CONCEPTUALIZATION
Some Preliminary Questions
Pilot Work
TEST CONSTRUCTION
Scaling
Writing Items
Scoring Items
TEST TRYOUT
What Is a Good Item?
ITEM ANALYSIS
The Item-Difficulty Index
The Item-Reliability Index
The Item-Validity Index
The Item-Discrimination Index
Item-Characteristic Curves
Other Considerations in Item Analysis
Qualitative Item Analysis
TEST REVISION 260
Test Revision as a Stage in New Test Development
Test Revision in the Life Cycle of an Existing Test
The Use of IRT in Building and Revising Tests
INSTRUCTOR-MADE TESTS FOR IN-CLASS USE
Addressing Concerns About Classroom Tests
CLOSE-UP Creating and Validating a Test of Asexuality 231
MEET AN ASSESSMENT PROFESSIONAL Meet Dr. Scott Birkeland 254
EVERYDAY PSYCHOMETRICS Adapting Tools of Assessment for
Use with Specific Cultural Groups 261
SELF-ASSESSMENT 272
TEST DEVELOPMENT
Test Development an umbrella term for all that goes into the process of creating a test.
➢ Some tests are conceived of and constructed but neither tried-out, nor item-analyzed,
nor revised.
1. Test Conceptualization
− idea for a test
2. Test Construction
− stage in the process of test development that entails writing test items (or re-writing
or revising existing items)
− formatting items, setting scoring rules, and otherwise designing and building a test
3. Test Tryout
− administered to a representative sample of test takers under conditions that
simulate the conditions that the final version of the test will be administered under
4. Item Analysis
− Statistical procedures employed to assist in making judgments about which items
are good as they are, which items need to be revised, and which items should be
discarded.
− The analysis of the test’s items may include analyses of item reliability, item validity,
and item discrimination. Depending on the type of test, item-difficulty level may be
analyzed as well.
5. Test Revision
− action taken to modify a test’s content or format for the purpose of improving the
test’s effectiveness as a tool of measurement
− This action is usually based on item analyses, as well as related information derived
from the test tryout.
TEST CONCEPTUALIZATION Norm-referenced versus Criterion-referenced Tests: Item Development Issues
A measurement interest related to aspects of the LGBT (lesbian, gay, bi-sexual, and transgender) experience has increased.
The present authors propose that in the interest of comprehensive inclusion, an “A” should be added to the end of “LGBT” so ➢ Good item on a norm-referenced achievement test is an item for which high scorers on the test respond
that this term is routinely abbreviated as “LGBTA.” The additional “A” would acknowledge the existence of asexuality as a correctly. Low scorers tend to respond to that same item incorrectly.
sexual orientation or preference.
2. Unidimensional/Multidimensional
o Rating scales differ in the number of dimensions underlying the ratings being made.
Cumulative Model
− Rule that the higher the score on the test, the higher the testtaker is on the ability, trait, or other
characteristic that the test purports to measure.
Class
TEST TRYOUT
ITEM ANALYSIS
• Having created a pool of items from which the final version of the test will be developed, What Is a Good Item?
the test developer will try out the test. The test should be tried out on people who are 1. A good test item is reliable and valid.
similar in critical respects to the people for whom the test was designed. 2. A good test item helps to discriminate testtakers.
• Example, if a test is designed to aid in decisions regarding the selection of corporate • A good test item is one that is answered correctly (or in an expected manner) by high
employees with management potential at a certain level, it would be appropriate to try out scorers on the test as a whole. Conversely, a good test item is one that is answered
the test on corporate employees at the targeted level. incorrectly by low scorers on the test as a whole.
CHARACTERISTICS OF TEST TRYOUT How does a test developer identify good items?
• The number of people on whom the test should be tried out should be no fewer than 5 • After the first draft of the test has been administered to a representative
subjects and preferably as many as 10. The more subjects in the tryout the better • Group of examinees, the test developer analyzes test scores and responses to individual
because the weaker the role of chance in subsequent data analysis. A definite risk in items – Item Analysis
using too few subjects during test tryout comes during factor analysis of the findings, • The different types of statistical scrutiny that the test data can potentially undergo at
when what we might call this point are referred to collectively as Item Analysis. Although item analysis tends to be
Phantom Factors—factors that actually are just artifacts of the small sample size—may regarded as a quantitative endeavor, it may also be qualitative.
emerge. • Among the tools test developers might employ to analyze and select items are:
• Test tryout should be executed under conditions as identical as possible to the 1. An index of the item’s difficulty
conditions under which the standardized test will be administered; all instructions, 2. An index of the item’s reliability
and everything from the time limits allotted for completing the test to the atmosphere at 3. An index of the item’s validity
the test site, should be as similar as possible. Therefore, the test developer endeavors to 4. An index of item discrimination
ensure that differences in response to the test’s items are due in fact to the items, not
to extraneous factors.
1. Item Difficulty Index 2. The Item-Reliability Index
• If everyone gets the item right then the item is too easy; • The item-reliability index provides an indication of the internal consistency of a test; the
if everyone gets the item wrong, the item is too difficult. higher this index, the greater the test’s internal consistency.
• An index of an item’s difficulty is obtained by calculating the proportion of the total • This index is equal to the product of the item-score standard deviation (s) and the
number of testtakers who answered the item correctly. correlation (r) between the item score and the total test score.
• A lowercase italic “p” (p) is used to denote item difficulty, and a subscript refers to the
item number (so p1 is read “item difficulty index for item 1”). The value of an item- Factor Analysis and Inter-Item Consistency
difficulty index can theoretically range from 0 (if no one got the item right) to 1 (if everyone • A statistical tool useful in determining whether items on a test appear to be measuring
got the item right). the same thing(s) is factor analysis. Through the judicious use of factor analysis, items
− If 50 of the 100 examinees answered item 2 correctly, then the item difficulty index for that do not “load on” the factor that they were written to tap (or, items that do not appear
this item would be equal to 50 divided by 100, or .5 (p2 = .5). to be measuring what they were designed to measure) can be revised or eliminated. If
− If 75 of the examinees got item 3 right, then p3 would be equal to .75 and we could say too many items appear to be tapping a particular area, the weakest of such items can be
that item 3 was easier than item 2. eliminated.
• *Note: The larger the item-difficulty index, the easier the item. • Additionally, factor analysis can be useful in the test interpretation process, especially
• Because p refers to the percent of people passing an item, the higher the p for an item, when comparing the constellation of responses to the items from two or more group.
the easier the item. The statistic referred to as an item-difficulty index in the context of • Thus, for example, if a particular personality test is administered to two groups of
achievement testing may be an item endorsement index in other contexts, such as hospitalized psychiatric patients, each group with a different diagnosis, then the same
personality testing. items may be found to load on different factors in the two groups. Such information will
• An Index of The Difficulty of The Average test item for a particular test can be calculated compel the responsible test developer to revise or eliminate certain items from the test or
by averaging the item difficulty indices for all the test’s items. This is accomplished by to describe the differential findings in the test manual.
summing the item-difficulty indices for all test items and dividing by the total
number of items on the test.
• For maximum discrimination among the abilities of the testtakers, the optimal average
item difficulty is approximately .5, with individual items on the test ranging in difficulty
from about .3 to .8.
• The midpoint representing the optimal item difficulty is obtained by summing the
chance success proportion and 1.00 and then dividing the sum by 2, or
3. The Item-Validity Index 4. The Item-Discrimination Index
• Indicate how adequately an item separates between high scorers and low scorers on an entire test.
• The item-validity index is a statistic designed to provide an indication of the degree to
− an item on an achievement test is not doing its job if it is answered correctly by respondents who least
which a test is measuring what it purports to measure. The higher the item-validity understand the subject matter.
index, the greater the test’s criterion-related validity. The item-validity index can be − an item on a test purporting to measure a particular personality trait is not doing its job if responses indicate
calculated once the following two statistics are known: that people who score very low on the test as a whole (indicating absence or low levels of the trait in
question) tend to score very high on the item (indicating that they are very high on the trait in question—
– the item-score standard deviation (sd) symbol s1
contrary to what the test as a whole indicates).
– the correlation between the item score and the criterion score • The item-discrimination index is a measure of item discrimination, symbolized by a lowercase italic “d” (d).
– The item-score standard deviation of item 1 (denoted by the symbol s1) can be − This estimate of item discrimination compares performance on a particular item with performance in the
calculated using the index of the item’s difficulty (p1) in the following formula: upper and lower regions of a distribution of continuous test scores. The optimal boundary lines for what we
refer to as the “upper” and “lower” areas of a distribution of scores will demarcate the upper and lower 27%
of the distribution of scores—provided the distribution is normal (Kelley, 1939).
− As the distribution of test scores becomes more platykurtic (flatter), the optimal boundary line for defining
upper and lower increases to near 33% (Cureton, 1957). Allen and Yen (1979, p. 122) assure us that “for most
applications, any percentage between 25 and 33 will yield similar estimates.”
• The correlation between the score on item 1 and a score on the criterion measure
• The item-discrimination index is a measure of the difference between the proportion of high scorers answering an
(denoted by the symbol r1 C) is multiplied by item 1’s item score standard deviation (s1), item correctly and the proportion of low scorers answering the item correctly; the higher the value of d, the
and the product is equal to an index of an item’s validity (s1 r1 C). greater the number of high scorers answering the item correctly. A negative d-value on a particular item is a
• Calculating the item-validity index will be important when the test developer’s goal is to red flag because it indicates that low-scoring examinees are more likely to answer the item correctly than high-
scoring examinees. This situation calls for some action such as revising or eliminating the item.
maximize the criterion-related validity of the test. A visual representation of the best
• Example: A history teacher gave the American History Test to a total of 119 students who were just weeks away
items on a test (if the objective is to maximize criterion related validity) can be achieved from completing ninth grade. The teacher isolated the upper (U) and lower (L) 27% of the test papers, with a total
by plotting each item’s item-validity index and item-reliability index. of 32 papers in each group.
• Observe that 20 testtakers in the U group answered Item 1 correctly and that 16 testtakers in the L group
answered Item 1 correctly. With an item-discrimination index equal to .13, Item 1 is probably a reasonable item
because more U-group members than L-group members answered it correctly.
The higher the value of d, the more adequately the item discriminates the higher-scoring from the lower-scoring
testtakers. For this reason, Item 2 is a better item than Item 1 because Item 2’s item-discrimination index is .63
• Example: The highest possible value of d is +1.00. This value indicates that all members of the U group answered
the item correctly whereas all members of the L group answered the item incorrectly. If the same proportion of
members of the U and L groups pass the item, then the item is not discriminating between testtakers at all and d,
appropriately enough, will be equal to 0.
*The lowest value that an index of item discrimination can take is −1. A d equal to −1 is a test developer’s
nightmare: It indicates that all members of the U group failed the item and all members of the L group passed it. On
the face of it, such an item is the worst possible type of item and is in dire need of revision or elimination. However,
through further investigation of this unanticipated finding, the test developer might learn or discover something new
about the construct being measured.
Other Considerations in Item Analysis • Qualitative item analysis - is a general term for various nonstatistical procedures designed to
• Guessing. In achievement testing, the problem of how to handle testtaker guessing is one that explore how individual test items work. The analysis compares individual test items to each other
has eluded any universally acceptable solution. Methods designed to detect guessing (S.-R. and to the test as a whole. In contrast to statistically based procedures, qualitative methods
Chang et al., 2011), minimize the effects of guessing (Kubinger et al., 2010), and statistically involve exploration of the issues through verbal means such as interviews and group
correct for guessing (Espinosa & Gardeazabal, 2010) have been proposed, but no such method discussions conducted with testtakers and other relevant parties. Some of the topics
has achieved universal acceptance. researchers may wish to explore qualitatively are in table.
• To better appreciate the complexity of the issues, consider the following three criteria that any
correction for guessing must meet as well as the other interacting issues that must be Expert Panels
addressed: • In addition to interviewing testtakers individually or in groups, expert panels may also provide
✓ 1. A correction for guessing must recognize that, when a respondent guesses at an answer on qualitative analyses of test items. A Sensitivity Review is a study of test items, typically
an achievement test, the guess is not typically made on a totally random basis. It is more conducted during the test development process, in which items are examined for fairness to all
reasonable to assume that the testtaker’s guess is based on some knowledge of the prospective testtakers and for the presence of offensive language, stereotypes, or
subject matter and the ability to rule out one or more of the distractor alternatives. situations.
However, the individual testtaker’s amount of knowledge of the subject matter will vary from • Some of the possible forms of content bias that may find their way into any achievement test were
one item to the next. identified as follows (Stanford Special Report, 1992, pp. 3–4).
✓ 2. A correction for guessing must also deal with the problem of omitted items. Sometimes, i. Status: Are the members of a particular group shown in situations that do not involve authority
instead of guessing, the testtaker will simply omit a response to an item. Should the omitted or leadership?
item be scored “wrong”? Should the omitted item be excluded from the item analysis? Should ii. Stereotype: Are the members of a particular group portrayed as uniformly having certain
the omitted item be scored as if the testtaker had made a random guess? Exactly how should (1) aptitudes
the omitted item be handled? (2) interests
✓ 3. Just as some testtakers may be luckier than others in guessing the choices that are keyed (3) occupations, or
correct. Any correction for guessing may seriously underestimate or overestimate the effects (4) personality characteristics?
of guessing for lucky and unlucky testtakers. To date, no solution to the problem of guessing iii. Familiarity: Is there greater opportunity on the part of one group to
has been deemed entirely satisfactory. The responsible test developer addresses the problem (1) be acquainted with the vocabulary or
of guessing by including in the test manual (2) experience the situation presented by an item?
(1) explicit instructions regarding this point for the examiner to convey to the examinees and iv. Offensive Choice of Words:
(2) specific instructions for scoring and interpreting omitted items. (1) Has a demeaning label been applied, or
(2) has a male term been used where a neutral term could be substituted?
Qualitative Item Analysis v. Other: Panel members were asked to be specific regarding any other indication of bias they
• Test users have had a long-standing interest in understanding test performance from the detected.
perspective of testtakers (Fiske, 1967; Mosier, 1947). The calculation of item-validity, item-
reliability, and other such quantitative indices represents one approach to understanding • On the basis of qualitative information from an expert panel or testtakers themselves, a test user
testtakers. Another general class of research methods is referred to as qualitative. or developer may elect to modify or revise the test. Revision typically involves rewording items,
• Qualitative methods - are techniques of data generation and analysis that rely primarily on deleting items, or creating new items. Note that there is another meaning of test revision beyond
verbal rather than mathematical or statistical procedures. Encouraging testtakers—on a group that associated with a stage in the development of a new test.
or individual basis—to discuss aspects of their test-taking experience is, in essence, eliciting or • After a period of time, many existing tests are scheduled for republication in new versions or
generating “data” (words). These data may then be used by test developers, users, and publishers editions. The development process that the test undergoes as it is modified and revised is called,
to improve various aspects of the test. test revision
TEST REVISION Test Revision in the Life Cycle of an Existing Test
5. Test Revision • Time waits for no person. We all get old, and tests get old, too. Just like people, some tests seem to age
• We first consider aspects of test revision as a stage in the development of a new test. Later we will consider more gracefully than others.
aspects of test revision in the context of modifying an existing test to create a new edition. Examples: Rorschach Inkblot Test seems to have held up quite well over the years.
• The stimulus materials for another projective technique, the Thematic Apperception Test (TAT), are showing
Test Revision as a Stage in New Test Development their age. There comes a time in the life of most tests when the test will be revised in some way or its
• Having conceptualized the new test, constructed it, tried it out, and item-analyzed it both quantitatively and publication will be discontinued.
qualitatively, what remains is to act judiciously on all the information and mold the test into its final form. When is that time?
• On the basis of that information generated at the item analysis stage, some items from the original item • No hard-and-fast rules exist for when to revise a test.
pool will be eliminated and others will be rewritten. How is information about the difficulty, validity, • The American Psychological Association (APA, 1996b, Standard 3.18) said that an existing test be
reliability, discrimination, and bias of test items—along with information from the item characteristic
kept in its present form as long as it remains “useful” but that it should be revised “when
curves—integrated and used to revise the test?
significant changes in the domain represented, or new conditions of test use and
interpretation, make the test inappropriate for its intended use.”
Ways of approaching test revision:
• One approach is to characterize each item according to its strengths and weaknesses. Test developers • Practically speaking, many tests are deemed to be due for revision when any of the following
may find that they must balance various strengths and weaknesses across items. If many otherwise good conditions exist:
items tend to be somewhat easy, the test developer may purposefully include some more difficult items (1) The stimulus materials look dated and current test takers cannot relate to them.
even if they have other problems. Those more difficult items may be specifically targeted for rewriting. (2) The verbal content of the test, including the administration instructions and the test items,
• Items demonstrating excellent item discrimination, leading to the best possible test discrimination, will contains dated vocabulary that is not readily understood by current testtakers.
be made a priority. (3) As popular culture changes and words take on new meanings, certain words or expressions in
• Write a large item pool – Poor items can be eliminated in favor of those that were shown on the test tryout the test items or directions may be perceived as inappropriate or even offensive to a particular
to be good items.
group and must therefore be changed.
• The next step is to administer the revised test under standardized conditions to a second appropriate
(4) The test norms are no longer adequate as a result of group membership changes in the
sample of examinees.
population of potential testtakers.
• On the basis of an item analysis of data derived from this administration of the second draft of the test, the
test developer may deem the test to be in its finished form. Once the test is in finished form, the test’s (5) The test norms are no longer adequate as a result of age-related shifts in the abilities
norms may be developed from the data, and the test will be said to have been “standardized” on this measured over time, and so an age extension of the norms (upward, downward, or in both
(second) sample. directions) is necessary.
• When the item analysis of data derived from a test administration indicates that the test is not yet in (6) The reliability or the validity of the test, as well as the effectiveness of individual test items,
finished form, the steps of revision, tryout, and item analysis are repeated until the test is satisfactory can be significantly improved by a revision.
and standardization can occur. Once the test items have been finalized, professional test development (7) The theory on which the test was originally based has been improved significantly, and these
procedures dictate that conclusions about the test’s validity await a cross-validation of findings. changes should be reflected in the design and content of the test.
The steps to revise an existing test parallel those to create a brand-new one.
a) In the test conceptualization phase, the test developer must think through the objectives of
the revision and how they can best be met.
b) In the test construction phase, the proposed changes are made.
c) Test tryout, item analysis, and test revision (in the sense of making final refinements) follow.
• Formal Item-Analysis methods must be employed to evaluate the stability of items between
revisions of the same test (Knowles & Condon, 2000).
• Ultimately, scores on a test and on its updated version may not be directly comparable.
• A key step in the development of all tests—brand-new or revised editions—is cross-validation.
Cross-validation
• refers to the revalidation of a test on a sample of testtakers other than those on whom test
performance was originally found to be a valid predictor of some criterion.
Format of Item Advantages Disadvantages
Co-validation Multiple-choice - Can sample a great deal of content in a - Not useful for expression of original or
• as a test validation process conducted on two or more tests using the same sample of relatively short time.- Allows for precise creative thought.- Not all subject matter lends
interpretation and little ambiguity (more itself to reduction to one and only one best
testtakers. When used in conjunction with the creation of norms or the revision of existing norms,
than other formats).- Minimizes answer.- Time-consuming to construct.- May
this process may also be referred to as co norming. "bulling" or guessing.- May be machine- test trivial knowledge.- Guessing may distort
Developing Item Banks or computer-scored. results.
1. Developing an item bank is not simply a matter of collecting a large number of items. Many item Binary-choice - Can sample a great deal of content in a - Susceptible to guessing, especially for "test-
items (e.g., relatively short time.- Easy to construct wise" students.- Difficult to detect use of test-
banking efforts begin with the collection of appropriate items from existing instruments true/false) and score.- May be machine- or taking strategies.- Ambiguous statements may
(Instruments A, B, and C) or new items. computer-scored. lead to misinterpretation.- Can be misused
2. All items available for use as well as new items created especially for the item bank constitute without careful validation.
the item pool. Matching - Efficient for evaluating recall of related - Similar to other selected-response formats,
facts.- Good for large amounts of does not test ability to create a correct
3. The item pool is then evaluated by content experts, potential respondents, and survey experts content.- Easy to score, especially via answer.- Clues in choices may aid guessing.-
using a variety of qualitative and quantitative methods. The items that “make the cut” after such machine.- Can be part of paper-based May overemphasize trivial knowledge.
scrutiny constitute the preliminary item bank. or computer-based tests.
Completion or - Effective for partial knowledge and - May test only surface-level knowledge.-
4. Administration of all of the questionnaire items to a large and representative sample of the target
short-answer low-level objectives.- Relatively easy to Limited response format (usually one word or
population. (fill-in-the- construct.- Useful in online testing.- a few words).- Scoring may be inconsistent.-
5. After administration of the preliminary item bank to the entire sample of respondents, responses blank) Easy to guess with limited clues. Typically hand-scored.
to the items are evaluated with regard to several variables such as validity, reliability, domain Essay - Good for measuring complex, creative, - May not cover wide content area.- Scoring is
and original thought.- Effective when subjective and time-consuming.- Test taker
coverage, and differential item functioning. The final item bank will consist of a large set of items
well constructed.- Encourages deep with limited knowledge may write off-topic.-
all measuring a single domain (or, a single trait or ability). learning and integration.- Can assess Grading can be biased or unreliable.- Typically
6. A test developer may then use the banked items to create one or more tests with a fixed writing and organization skills. hand-scored.
number of items. For example, a teacher may create two different versions of a math test in order
to minimize efforts by testtakers to cheat. The item bank can also be used for purposes of
computerized-adaptive testing.
CHAPTER 9 INTELLIGENCE AND ITS MEASUREMENT The Secondary-School Level 340
ISSUES IN THE ASSESSMENT OF INTELLIGENCE The Kaufman Assessment Battery for Children (K-ABC) and the
Culture and Measured Intelligence Kaufman Assessment Battery for Children, Second Edition (KABC-II) 346
The Construct Validity of Tests of Intelligence OTHER TOOLS OF ASSESSMENT IN EDUCATIONAL SETTINGS 348
THE ROLE OF TESTING AND ASSESSMENT IN EDUCATION Measuring Study Habits, Interests, and Attitudes 352
IN THE SCHOOLS
Response to Intervention
Dynamic Assessment
Measure Intelligence In addition to clinical or vocational uses, adult intelligence data might also be used in:
− entails sampling an examinee’s performance on different types of tests and tasks as a
function of developmental level • Neuropsychological research (e.g., effects of aging on cognition).
− Intelligence measurement involves sampling an individual’s performance on various tasks • Pre-employment screening (in roles requiring high cognitive demand, though this use is controversial).
• Forensic psychology (e.g., assessing criminal responsibility or risk of recidivism).
suited to their developmental stage.
− Intelligence testing is more than just getting a score—it's also about understanding how a person
Some Tests Used to Measure Intelligence
thinks and solves problems.
• From the test user’s standpoint, several considerations figure into a test’s appeal:
1. The theory (if any) on which the test is based
Some Tasks Used to Measure Intelligence 2. The ease with which the test can be administered
3. The ease with which the test can be scored
4. The ease with which results can be interpreted for a particular purpose
Infants (Birth–18 Months):
5. The adequacy and appropriateness of the norms
• Focus: Sensorimotor development.
6. The acceptability of the published reliability and validity indices
• Examples of Tasks: 7. The test’s utility in terms of costs versus benefits
o Turning over.
o Lifting the head.
Subtest What It Measures
o Imitating gestures.
Information General knowledge and memory
o Tracking moving objects with the eyes.
Comprehension Common sense, social judgment
• Challenges:
Similarities Abstract thinking and verbal reasoning
o Infants cannot understand instructions like “cooperate” or “be patient.”
o Assessment often relies on structured interviews with parents or caregivers. Arithmetic Mental math, concentration, short-term memory
o Requires examiners skilled in building rapport with preverbal children. Vocabulary Word knowledge and verbal ability
Picture Naming Expressive language
Digit Span Short-term memory and attention
Children:
Letter-Number Sequencing Working memory, processing speed, sequencing
• Focus: Verbal and performance abilities.
Picture Completion Visual detail recognition and nonverbal intelligence
• Examples of Tasks:
Picture Arrangement Understanding of social situations and cause-effect logic
o Vocabulary and general knowledge.
Block Design Visual-spatial reasoning, problem-solving
o Social judgment and reasoning.
o Memory (auditory and visual). Object Assembly Visual-motor coordination, persistence
o Attention and spatial skills. Coding Processing speed and learning ability
• Instruction: Often includes practice or teaching items before actual test items to help the child understand the Symbol Search Visual processing speed
task. Matrix Reasoning Nonverbal reasoning
Word Reasoning Verbal abstraction
Adults: Picture Concepts Categorical reasoning
Cancellation Selective visual attention
• Focus (per Wechsler):
o General knowledge retention.
o Quantitative reasoning.
o Expressive language and memory.
o Social judgment.
• Usage of Intelligence Tests:
o Less often for educational purposes.
o More often for:
▪ Clinical evaluations (e.g., dementia, brain injury).
▪ Legal competency (e.g., ability to make a will).
▪ Insurance assessments (e.g., disability claims).
▪ Career guidance and vocational planning.
Historical Background Fourth Edition − Previously, different items were grouped by age and the test was referred to as
Thorndike an age scale.
• Binet-Simon Scale (1905): Created by Alfred Binet and Theodore Simon to identify
− In contrast to an age scale, a point scale is a test organized into subtests by
children with developmental disabilities in France. category of item, not by age at which most testtakers are presumed capable of
• Brought to the U.S. by Goddard in 1908 and 1910. responding in the way that is keyed as correct.
− The model was one based on the Cattell-Horn (Horn & Cattell, 1966) model of
• Kuhlmann (1912) extended the scale to assess infants as young as 3 months.
intelligence. A test composite—formerly described as a deviation IQ score—
• Lewis Terman at Stanford revised and standardized the test in the U.S., leading to the could also be obtained.
Stanford-Binet Intelligence Scale, a foundational instrument still in use today. − Test Composite may be defined as a test score or index derived from the
combination of, and/or a mathematical transformation of, one or more subtest
scores.
The Stanford-Binet Intelligence Scales: Fifth Edition (SB5) Fifth Edition − was designed for administration to assessees as young as 2 and as old as 85
First Edition: Stanford- − the first published intelligence test to provide organized and detailed (or older).
Binet administration and scoring instructions − The test yields a number of composite scores, including a Full Scale IQ derived
− It was also the first American test to employ the concept of IQ. from the administration of ten subtests.
Used in 1908 − first test to introduce the concept of an Alternate Item, an item to be − Subtest scores all have a mean of 10 and a standard deviation of 3.
substituted for a regular item under specified conditions (such as the situation − All composite scores have a mean set at 100 and a standard deviation of 15.
in which the examiner failed to properly administer the regular item). − test yields five Factor Index scores corresponding to each of the five factors
− Earlier versions of the Stanford Binet had employed the ratio IQ, which was that the test is presumed to measure
based on the concept of mental age (the age level at which an individual − was based on the Cattell-Horn-Carroll (CHC) theory of intellectual abilities.
appears to be functioning intellectually as indicated by the level of items nominal categories designated by certain cutoff boundaries for quick reference
responded to correctly). The ratio IQ is the ratio of the testtaker’s mental age
IQ Range Label
divided by his or her chronological age, multiplied by 100 to eliminate
decimals. 145–160 Very gifted / Highly advanced
1962 − Innovations in the 1937 scale included the development of wtwo equivalent 130–144 Gifted / Very advanced
Lewis and Maud forms, labeled L (for Lewis) and M
Merrill − new types of tasks for use with preschool-level and adult-level testtakers. 120–129 Superior
11 years to complete − A serious criticism of the test remained: lack of representation of minority 110–119 High average
groups during the test’s development.
90–109 Average
− 1960 revision, consisted of only a single form (labeled L-M) and included the 80–89 Low average
items considered to be the best from the two forms of the 1937 test, with no
new items added to the test 70–79 Borderline impaired or delayed
Terman’s death (1956) − use of the deviation IQ tables in place of the ratio IQ tables. 55–69 Mildly impaired or delayed
− Ratio IQ = mental age/chronological age x 100
40–54 Moderately impaired or delayed
Third Edition (1972) − deviation IQ was used in place of the ratio IQ.
− Deviation IQ reflects a comparison of the performance of the individual with
the performance of others of the same age in the standardization sample. Binet-Simon Scale 1908 – by Alfred Binet & Theodore Simon
− test performance is converted into a standard score with a mean of 100 and a
standard deviation of 16. If an individual performs at the same level as the Stanford-Binet Intelligence Scale (1916) – by Terman
average person of the same age, the deviation IQ is 100. If performance is a
standard deviation above the mean for the examinee’s age group, the deviation - first American test to employ the concept of IQ with detailed administration and
IQ is 116.
scoring instructions
B. WECHSLER TEST
1. 1939 : instrument for evaluating the intellectual capacity of its multilingual, multinational,
and multicultural clients.
− point scale
− six verbal subtests and five subtests
2. 1942: equivalent alternate
3. 1955: scale for adults
a. wais-r: alternate version
b. wais III – more user friendly and norms were expanded
c. wais iv: core or supplemental
i. Verbal Comprehension,
ii. Working Memory,
iii. Perceptual Reasoning
iv. Processing Speed.
v. General ability index
1. After World War I – Army Alpha Test (literature) and Army Beta Test (illiterate)
Convergent Thinking – deductive narrow down solutions and eventually arrive at one solution
Divergent Thinking – several solutions possible.
ISSUES IN THE ASSESSMENT OF INTELLIGENCE Culture-Fair
Measured intelligence is influenced by many factors beyond innate ability: • Definition: A test designed to minimize the influence of culture in assessing intelligence. It aims to
• The definition of intelligence used by the test developer. be equally applicable across cultural groups.
• Features:
• Examiner variables: their diligence and how much feedback they provide.
o Uses nonverbal, abstract tasks (e.g., matrices, mazes, classifications).
• Test-taker factors: prior practice/coaching, motivation, and test familiarity. o Instructions are often given orally or through pantomime, minimizing language demands.
• Interpretation errors by those analyzing the test results. o Avoids references to specific knowledge, traditions, or values of any one culture.
Intelligence scores can vary significantly due to these influences, making the assessment • Goal: Provide a neutral testing environment so that no cultural group is unfairly advantaged.
less reliable or valid. • Key Issue: Despite efforts, culture-fair tests have lower predictive validity (they don't predict real-
world success as well), and minority group members still often score lower
Culture and Measured Intelligence
• Culture shapes what is considered intelligent behavior.
Culture Bias
• Definition: Occurs when test items favor the dominant culture, unintentionally disadvantaging
• Different cultural and subcultural groups value and promote different abilities, leading
individuals from other cultural backgrounds.
to varied performance on standardized tests. • Examples:
Example: Zambian vs. English children performed differently depending on the material o Test content assumes knowledge, experiences, or values common in White, middle-class
used (wire vs. pencil/paper) due to familiarity, not intelligence. American culture (e.g., specific vocabulary, customs).
• Intelligence tests often reflect the dominant culture (e.g., White, Western, middle-class) o Subcultural values like group identity, present-time orientation, or modesty may lead to
lower scores despite equal cognitive ability.
and may disadvantage those from other cultural backgrounds.
• Consequences:
Blacks, Hispanics, and Native Americans often score lower on intelligence tests than o Can misrepresent true intelligence.
Whites or Asians, but these findings are controversial due to: o Reinforces inequality and underestimates ability in Black, Hispanic, Native American, and
− Sampling biases other minority populations.
− Difficulty separating genetic from environmental effects • Findings: Cultural groups may value and express intelligence differently (e.g., verbal debate in the
West vs. modesty and restraint in the East).
− Diverse subgroups being lumped together
• Intelligence definitions and expressions are culturally bound.
Culture-Specific
Efforts Toward Culture-Free and Culture-Fair Testing • Definition: A test designed specifically for one cultural group, reflecting its language, values, and
• Alfred Binet aimed to measure “natural intelligence” without the influence of education or shared experiences.
wealth. • Purpose: To measure intelligence more validly within that cultural context, rather than comparing
• Attempts to create culture-free tests (often nonverbal) failed to be valid predictors of real-world across groups.
success. • Example:
o They lack predictive validity and don’t engage the same processes as traditional tests. o BITCH (Black Intelligence Test of Cultural Homogeneity): Designed for African-Americans
o Minority group members often still scored lower on them. using culturally familiar content (e.g., slang, brands, customs).
• Result: true culture-free testing is impossible. • Criticism:
o May appear more like a satirical or sociocultural awareness tool than a traditional IQ test.
• Shift toward culture-fair tests, which:
o Raises questions about what defines "intelligence."
o Minimize cultural influences in instructions, content, and responses.
o Tailored for Black Americans, including culturally relevant content.
o Use nonverbal tasks (e.g., figure classification, mazes). o Demonstrated that test performance can depend on cultural familiarity, not cognitive
• These too have limited success: ability.
o Still don’t fully equalize outcomes.
o Often less predictive of real-world performance. Culture Loading
• defined as the extent to which a test incorporates the vocabulary, concepts, traditions,
knowledge, and feelings associated with a particular culture.
The Flynn Effect THE MOST COMMONLY USED INTELLIGENCE TEST
• Discovered by James R. Flynn, who noted that IQ scores have been rising over
generations—this is now known as the Flynn Effect. \SBIT- Stanford-Binet Intelligence Tests
• The gains are especially evident from the date a test is normed (standardized), The Wechsler Tests: WAIS, WISC, WPPSI, WASI, WIAT
suggesting newer generations score higher on older tests. progressive rise in
CFIT- Culture Fair Intelligence Test
intelligence test scores that is expected to occur on a normed test intelligence from
the date when the test was first normed. OLMAT- Otis-Lennon Mental Abilities Test
• However, the gains in IQ do not reflect actual increases in true intelligence, as they OLSAT- Otis-Lennon School Ability Test
are not accompanied by academic or practical improvements.
• Flynn suggested psychologists could manipulate test versions to either increase or RPM- Raven's Progressive Matrices
decrease a child's chances of receiving special services—a controversial and DAT- Differential Aptitude Tests for Personnel and Career Assessment
ethically complex recommendation.
PKP- Panukat ng Katalinuhang Pilipino
Practical Consequences
• The Flynn Effect affects school placements, social service eligibility, and even legal
decisions (e.g., whether a person with an intellectual disability can be executed).
• Defense attorneys may exploit outdated tests to make defendants appear more
intelligent than they are—raising ethical concerns.
Theoretical Implications
• The Flynn Effect raises questions about fluid vs. crystallized intelligence:
o Cattell’s theory: Crystallized intelligence should show more gain
(environmental learning).
o Flynn’s findings: Gains are mainly in fluid intelligence (problem-solving,
abstract thinking), which contradicts some expectations.
• There is still debate over the definition of intelligence and how best to measure it.
• Group differences in IQ scores exist, but individual differences are much greater.
• Intelligence tests can predict life outcomes (education, job performance, income),
but we should focus more on environmental factors to improve results.
CHAPTER 11: Personality Assessment: An Overview 385 PERSONALITY ASSESSMENT
PERSONALITY AND PERSONALITY ASSESSMENT 354
Personality assessment may be defined as the measurement and evaluation of
For laypeople, personality refers to components of an individual’s makeup that can elicit psychological traits, states, values, interests, attitudes, worldview, acculturation,
positive or negative reactions from others. Someone who consistently tends to elicit positive sense of humor, cognitive and behavioral styles, and/or related individual
reactions from others is thought to have a “good personality.” Someone who consistently characteristics. In this chapter we overview the process of personality assessment,
tends to elicit not-so-good reactions from others is thought to have a “bad personality” or, including different approaches to the construction of personality tests.
perhaps worse yet, “no personality.”
Traits, Types, and States 386
Personality has been defined in many different ways in psychological literature:
Broad Definitions: Personality Traits – relatively enduring dispositions; tendency to act, think, or feel in a
o McClelland (1951) defined personality as a full conceptualization of a person’s behavior. certain manner in any given circumstances and that distinguish one person from
o Menninger (1953) offered a holistic definition, including everything about an individual— another.
physical, emotional, and psychological aspects.
Focused or Contextual Definitions: Personality Types – general description of people; as a constellation of traits that is
o Some definitions are narrow, focusing on specific traits (e.g., Goldstein, 1963). similar in pattern to one identified category of personality within a taxonomy of
o Others emphasize the social context of personality (e.g., Sullivan, 1953).
Critical Views:
personalities.
o Byrne (1974) criticized personality psychology as a vague field, calling it “psychology’s
Personality state – emotional reaction that vary from one situation to another
garbage bin” for research that doesn’t fit elsewhere.
Theoretical Relativism: Self-concept – a person’s self-definition; an organized and relatively consistent set of
o Hall and Lindzey (1970) argued that there is no universally applicable definition of
personality. They claimed definitions depend on the theoretical perspective used and
assumptions that a person has about himself or herself.
encouraged readers to choose the one they find most useful.
Working Definition
For practical purposes, the passage adopts a concise and inclusive definition:
Personality an individual’s unique constellation of psychological traits that is relatively stable over
time.
Background:
• Inspired by Allport and Odbert’s catalog of 18,000 personality traits (1936). 5. Criterion Group Method (Empirical Criterion Keying)
• Cattell reduced this to 171, then 36 surface traits, and ultimately to 16 source traits through factor Uses known groups (criterion group vs. control group) to find which test items differentiate between them.
analysis.
• Resulted in the 16 Personality Factor Questionnaire (16 PF). Steps:
5. Values in Personality
• Values represent what an individual deems important.
o Instrumental values: means to an end (e.g., honesty, ambition).
o Terminal values: end goals (e.g., self-respect, a comfortable life).
• Cultural background heavily influences values and, in turn, motivation and personality.
7. Worldview
• Worldview: How individuals interpret the world around them, shaped by culture and experience.
• It influences personality expression, decision-making, and interpersonal interactions.
8. Practical Implications
• Cultural background affects:
o Personality assessment results
o Interpretation of those results
o The relevance and validity of certain tools
• A culturally competent assessor should:
o Integrate cultural data into assessments
CHAPTER ASSESSMENT FOR EDUCATION
– How well have students learned and mastered the subject matter they were
taught?
– To what extent are students able to apply what they have learned to novel
circumstances and situations?
– What are the challenges or obstacles that are preventing an individual student
from meeting educational objectives, and how can those obstacles best be
overcome?
– Do failing test scores on a curriculum-specific test really reflect the fact that
the test takers have not mastered the content of the curriculum?
1. Learning Disability Other Tools of Assessment in Educational Settings
➢ severe discrepancy between achievement and intellectual ability
➢ Is diagnosed if a significant discrepancy existed between the child’s measured intellectual ability (usually on 8. Performance Assessment
an intelligence test) and the level of achievement that could reasonably be expected from the child in one or
more areas (including oral expression, listening comprehension, written expression, basic reading skills,
➢ More than choosing the correct response
reading comprehension, mathematics calculation, and mathematics reasoning). ➢ essay questions and the development of an art project are examples of performance tasks. By
contrast, true–false questions and multiple-choice test items would not be considered
2. Specific Learning Disability
performance tasks.
➢ As defined in 2007 by Public Law 108-147, it is a disorder in one or more of the basic psychological
processes involved in understanding or in using language, spoken or written, which disorder may manifest
➢ performance task as a work sample designed to elicit representative knowledge, skills, and
itself in the imperfect ability to listen, think, speak, read, write, spell, or do mathematical calculation. values from a particular domain of study.
➢ evaluation of performance tasks according to criteria developed by experts
3. Dynamic Assessment
➢ It is an approach to assessment that departs from reliance on, and can be contrasted to, fixed (so-called 9. Portfolio Assessment (under Performance Ass) work sample.
“static” tests. Dynamic assessment encompasses an approach to exploring learning potential that is based
on a test-intervention-retest model. ➢ evaluation of one’s work samples.