Psychological Assessment (Finals)
Psychological Assessment (Finals)
0psychological testing and asses ment as psychological test user and psychological assessor.
All fields of human endeavour use measurement in some We define psychological assessment as the gathering and
form, and each field has its own set of measuring tools integration of psychology-related data for the purpose
and measuring units. For example, if you’re recently of making a psychological evaluation that is
engaged or thinking about becoming engaged, you may accomplished through the use of tools such as tests,
have learned about a unit of measure called the carat. interviews, case studies, behavioural observation, and
If you’ve been shopping for a computer, you may have specially designed apparatuses and
learned something about a unit of measurement called a measurement procedures. We define psychological testing
byte. as the process of measuring psychology-related variables
by means of devices or procedures designed to obtain a
As a student of psychological measurement, you need a sample of behavior.
working familiarity with some of the commonly used
units of measure in psychology as well as knowledge of Varieties of assessment
some of the many measuring tools employed. In the The term assessment may be modified in a seemingly
pages that follow, you will gain that knowledge as endless number of ways, each such modification
well as an acquaintance with the history of referring to a particular variety or area of assessment.
measurement in psychology and an understanding of its Also intuitively obvious, the term educational assessment
theoretical basis. refers to, broadly speaking, the use of tests and other
tools to evaluate abilities and skills relevant to success
Testing and Assessment or failure in a school or pre-school context.
The roots of contemporary psychological testing and
assessment can be found in early twentieth- century For the record, the term retrospective assessment may be
France. In 1905, Alfred Binet and a colleague published a defined as the use of evaluative tools to draw
test designed to help place Paris schoolchildren in conclusions about psychological aspects of a person as
appropriate classes. During World War II, the military they existed at some point in time prior to the
would depend even more on psychological tests to screen assessment. Psychological assessment by means of
recruits for service. smartphones also serves as an example of an approach
to assessment called ecological momentary assessment
Following the war, more and more tests purporting to (EMA). EMA refers to the “in the moment” evaluation of
measure an ever-widening array of psychological specific problems and related cognitive and behavioral
variables were developed and used. There were tests to variables at the very time and place that they occur.
measure not only intelligence but also personality,
brain functioning, performance at work, and many The process of assessment
other aspects of psychological and social functioning. In general, the process of assessment begins with a
referral for assessment from a source such as a teacher,
Psychological Testing and Assessment school psychologist, counselor, judge, Other assessors
Defined view the process of assessment as more of a
The world’s receptivity to Binet’s test in the early collaboration between the assessor and the assessed. In
twentieth century spawned not only more tests but that approach, therapeutic self-discovery and new
more test developers, more test publishers, more test understandings are encouraged throughout the
users, and the emergence of what, logically enough, has assessment process.
become known as a testing enterprise. “Testing” was the
term used to refer to everything from the administration Another approach to assessment that seems to have
of a test (as in “Testing in progress”) to the interpretation picked up momentum in recent years, most notably in
of a test score (“The testing indicated that . . .”). educational settings, is referred to as dynamic
assessment (Poehner & van Compernolle, 2011). The term
The OSS model—using an innovative variety of dynamic may suggest that a psychodynamic or
evaluative tools along with data from the evaluations psychoanalytic approach to assessment is being applied.
of highly trained assessors—would later inspire what is The term dynamic may suggest that a psychodynamic or
now referred to as the assessment center approach to psychoanalytic approach to assessment is being applied.
personnel evaluation (Bray, 1982). Society at large is
best served by a clear definition of and differentiation
However, that is not the case. As used in the present and facial expressions in response to the interviewer, the
context, dynamic is used to describe the interactive, extent of eye contact, apparent willingness to cooperate,
changing, or varying nature of the assessment. In and general reaction to the demands of the interview.
general, dynamic assessment refers to an interactive
approach to psychological assessment that usually In its broadest sense, then, we can define an interview
follows a model of (1) evaluation, (2) intervention of as a method of gathering information through direct
some sort, and (3) evaluation. communication involving reciprocal exchange. In some
instances, what is called a panel interview (also
The Tools of Psychological Assessment referred to as a board interview) is employed. Here, more
The Test than one interviewer participates in the assessment.
A test may be defined simply as a measuring device or Motivational interviewing may be defined as a
procedure. When the word test is prefaced with a therapeutic dialogue that combines person- centered
modifier, it refers to a device or procedure designed to listening skills such as openness and empathy, with the
measure a variable related to that modifier. In a like use of cognition-altering techniques designed to
manner, the term psychological test refers to a device or positively affect motivation and effect therapeutic
procedure designed to measure variables related to change.
psychology (such as intelligence, personality, aptitude,
interests, attitudes, or values). The Portfolio
Students and professionals in many different fields of
The term format pertains to the form, plan, structure, endeavor ranging from art to architecture keep files of
arrangement, and layout of test items as well as to their work products. These work products—whether
related considerations such as time limits. retained on paper, canvas, film, video, audio, or some
Format is also used to refer to the form in which a test other medium—constitute what is called a portfolio. As
is administered: computerized, pencil-and-paper, or some samples of one’s ability and accomplishment, a portfolio
other form. may be used as a tool of evaluation.
In testing and assessment, we may formally define score Case history data refers to records, transcripts, and
as a code or summary statement, usually but not other accounts in written, pictorial, or other form that
necessarily numerical in nature, that reflects an preserve archival information, official and informal
evaluation of performance on a test, task, interview, or accounts, and other data and items relevant to an
some other sample of behavior. Scoring is the process of assessed. Case history data may include files or excerpts
assigning such evaluative codes or statements to from files maintained at institutions and agencies such
performance on tests, tasks, interviews, or other as schools, hospitals, employers, religious institutions,
behaviour samples. In the world of psychological and criminal justice agencies.
assessment, many different types of scores exist.
Behavioral Observation
Scores themselves can be described and categorized in If you want to know how someone behaves in a
many different ways. For example, one type of score is particular situation, observe his or her behaviour in that
the cut score. A cut score (also referred to as a cutoff situation. Such “down-home” wisdom underlies at least
score or simply a cutoff) is a reference point, usually one approach to evaluation. Behavioral observation, as
numerical, derived by judgment and used to divide a set it is employed by assessment professionals, may be
of data into two or more classifications. defined as monitoring the actions of others or oneself by
visual or electronic means while recording quantitative
The Interview and/or qualitative information regarding those actions.
In everyday conversation, the word interview conjures This variety of behavioral observation is referred to as
images of face-to-face talk. But the interview as a tool naturalistic observation.
of psychological assessment typically involves more
than talk. If the interview is conducted face-to-face, Role play may be defined as acting an improvised or
then the interviewer is probably taking note of not only partially improvised part in a simulated situation. A
the content of what is said but also the way it is being role-play test is a tool of assessment wherein assessed
said. are directed to act as if they were in a particular
situation. Assessed may then be evaluated with regard
More specifically, the interviewer is taking note of both to their expressed thoughts, behaviors, abilities, and
verbal and nonverbal behavior. Nonverbal behavior may other variables. (Note that role play is hyphenated
include the interviewee’s “body language,” movements,
when used as an adjective or a verb but not as a noun.). The test user
Role play is useful in evaluating various skills. Psychological tests and assessment methodologies are
used by a wide range of professionals, including
Computers as Tools clinicians, counselors, school psychologists, human
We have already made reference to the role computers resources personnel, consumer psychologists,
play in contemporary assessment in the context of experimental psychologists, and social psychologists.
generating simulations. They may also help in the
measurement of variables that in the past were quite The test taker
difficult to quantify. As test administrators, computers We have all been test takers. However, we have not all
do much more than replace the “equipment” that was so approached tests in the same way.
widely used in the past (a number 2 pencil). Computers
can serve as test administrators (online or off) and as Society at large
highly efficient test scorers. Within seconds they can The societal need for “organizing” and “systematizing” has
derive not only test scores but patterns of test scores. historically manifested itself in such varied questions as
Scoring may be done on-site (local processing) or “Who is a witch?,” “Who is schizophrenic?,” and “Who is
conducted at some central location (central processing). qualified?” The specific questions asked have shifted with
societal concerns.
The acronym CAPA refers to the term computer-assisted
psychological assessment. By the way, here the word Other parties
assisted typically refers to the assistance computers Beyond the four primary parties we have focused on
provide to the test user, not the testtaker. Another here, let’s briefly make note of others who may
acronym you may come across is CAT, this for participate in varied ways in the testing and assessment
computer adaptive testing. The adaptive in this term is a enterprise. Organizations, companies, and governmental
reference to the computer’s ability to tailor the test to agencies sponsor the development of tests for various
the testtaker’s ability or test-taking pattern. reasons, such as to certify personnel.
2. Measures of Correlation
1. T scores – Mean of 50, SD of 10 (Formula: z-score X 10 + a. Pearson’s Product Moment Correlation – parametric
50) test for interval data
2. Stanines – Mean of 5, SD of 2 (Formula: z-score X 2 + b. Spearman Rho’s Correlation – non-parametric test for
5) ordinal data
c. The Normal Curve and Standard Scores c. Kendall’s Coefficient of Concordance – non-parametric
test for ordinal data
d. Phi Coefficient – non-parametric test for dichotomous
nominal data
e. Lambda – non-parametric test for 2 groups
(dependent and independent variable) of nominal data
***Correlation Ranges:
1.00 : Perfect relationship
0.75 – 0.99 : Very strong relationship
0.50 – 0.74 : Strong relationship
0.25 – 0.49 : Weak relationship chapter v: psychometric properties of a good test
0.01 – 0.24 : Very weak relationship
0.00 : No relationship Reliability
- the stability or consistency of the measurement
3. Measures of Prediction
1. Goals or Reliabillity
a. Biserial Correlation – predictive test for artificially
a. estimate errors in psychological measurement
dichotomized and categorical data as criterion with
b. devise techniques to improve testing so errors are
continuous data as predictor
rerduced
b. Point-Biserial Correlation – predictive test for
genuinely dichotomized and categorical data as criterion 2. Sources of Measurement Error
with continuous data as predictors Source of Error Type of Test Prone to Appropriate Measures
c. Tetrachoric Correlation – predictive test for Each Error Source Used to Estimate Error
dichotomous data with categorical data as criterion and Appropriate Measures Tests score with a Scorer reliabilty
categorical data as predictors Used to Estimate Error degree of subjectivity
d. Simple Linear Regression – a predictive test which Time Sampling Error Tests of relatively Test-Retest Reliabillity
involves one criterion that is continuous in nature with stable traits or (rtt), a.k.a. Stability
only one predictor that is continuous behavior Coefficient
e. Multiple Linear Regression – a predictive test which Content Sampling Error Tests for which Alternate-form
involves one criterion that is continuous in nature with consistency of results, reliability (a.k.a.
more than one continuous predictor as a whole, is required coefficient of
f. Ordinal Regression – a predictive test which involves a equivalence) or
criterion that is ordinal in nature with more than one split-half reliability
predictors that are continuous in (a.k.a. coefficient of
internal consistency)
4. Chi-Square Test Inter-Item Inconsistency Tests that require Split-half reliability or
a. Goodness of Fit – used to measure differences and inter-item consistency more stringent internal
involves nominal data and only one variable with 2 or consistency measures,
more categories such as KR-20 or
b. Test of Independence – used to measure correlation Cronbach Alpha
and involves nominal data and two variables with two Inter-item Inconsistency Tests that require Internal consistency
or more categories and Content inter-item consistency measures and additional
Heterogeneity combined and homogeneity evidence of homogeneity
5. Comparison of Two Groups Time and Content Tests that require stability Delayed alternate-form
a. Paired t-test – a parametric test for paired groups Sampling error and consistency of result, reliability
combined as a whole
with normal distribution
b. Unpaired t-test – a parametric test for unpaired 3. Types of Reliability
groups with normal distribution A. Test-Retest Reliability
c. Wilcoxon Signed-Rank Test – a non-parametric test - compare the scores of individuals who have been
for paired groups with non-normal distribution measured twice by the instrument
d. Mann-Whitney U test – a non-parametric test for - this is not applicable for tests involving reasoning and
unpaired groups with non-normal distribution ingenuity
- longer interval will result to lower correlation
6. Comparison of Three or More Groups coefficient while shorter interval will result to higher
a. Repeated measures ANOVA – a parametric test for correlation
matched groups with normal distribution - the ideal time interval for test-retest reliability is 2-4
b. One-way/Two-Way ANOVA – a parametric test for weeks
unmatched groups with normal distribution - source of error variance is time sampling
c. Friedman F test – a non-parametric test for matched - utilizes Pearson r or Spearman rho
groups with non-normal distribution B. Parallel-Forms/Alternate Forms Reliability
d. Kruskal-Wallis H test – a non-parametric test for - same persons are tested with one form on the first
unmatched groups with non-normal distribution occasion and with another equivalent form on the
second
7. Factor Analysis - the administration of the second, equivalent form
either takes place immediately or fairly soon.
- the two forms should be truly paralleled, - the higher the reliability of the test, the lower the
independently constructed tests designed to meet the SEM
same specifications, contain the same > Error – long standing assumption that factors other
- number of items, have items which are expressed in the than what a test attempts to measure will influence
same form, have items that cover the same type of performance on the test
content, have items with the same range of difficulty, > Trait Error – are those sources of errors that reside
and have the same instructions, time limits, illustrative within an individual taking the test (such as, I didn’t
examples, format and all other aspects of the test study enough, I felt bad that missed blind date, I forgot
- has the most universal applicability to set the alarm, excuses)
- for immediate alternate forms, the source of error > Method Error – are those sources of errors that reside
variance is content sampling in the testing situation (such as lousy test instructions,
- for delayed alternate forms, the source of error too-warm room, or missing pages).
variance is time sampling and content sampling > Confidence Interval – a range or band of test scores
- utilizes Pearson r or Spearman rho that is likely to contain the true score
C. Split-Half Reliability > Standard error of the difference – a statistical measure
- two scores are obtained for each person by dividing the that can aid a test user in determining how large a
test into equivalent halves (odd-even split or top-bottom difference should be before it is considered statistically
split) significant
- the reliability of the test is directly related to the 6. Factors Affecting Test Reliability
length of the test a. Test Format
- the source of error variance is content sampling b. Test Difficulty
- utilizes the Spearman-Brown Formula c. Test Objectivity
D. Other Measures of Internal Consistency/Inter-Item d. Test Administration
Reliability – source of error variance is content sampling e. Test Scoring
and content heterogeneity f. Test Economy
> KR-20 g. Test Adequacy
– for dichotomous items with varying level of difficulty 7. What to do about low reliability?
> KR-21 - increase the number of items
– for dichotomous items with uniform level of difficulty - use factor analysis and item analysis
> Cronbach Alpha/Coefficient Alpha - use the correction of attenuation formula – a formula
– for non-dichotomous items (likert or other multiple that is being used to determine the exact correlation
choice) between two variables if the test is deemed affected by
> Average Proportional Distance error
– focuses on the degree of difference that exists between
item scores Validity
E. Inter-Rater/Inter-Observer Reliability - a judgment or estimate of how well a test measures
- degree of agreement between raters on a measure what it purports to measure in a particular test
- source of error variance is inter-scorer differences
1. Types of Validity
- often utilizes Cohen’s Kappa statistic
a. Face Validity
4. Reliability Ranges - the least stringent type of validity, whether a test
1 : perfect reliability (may indicate redundancy and looks valid to test users, examiners and examinees
homogeneity) Examples: ✓ An IQ test containing items which
≥ 0.9 : excellent reliability (minimum acceptability for measure memory, mathematical ability,
tests used for clinical diagnoses) verbal reasoning and abstract reasoning has a
≥ 0.8 < 0.9 : good reliability, good face validity.
≥ 0.7 < 0.8 : acceptable reliability (minimum ✓ An IQ test containing items which measure
acceptability for psychometric tests), depression and anxiety has a bad face validity.
≥ 0.6 < 0.7 : questionable reliability (but is still ✓ Inkblot test have low face validity because
acceptable for research purposes), test takers question whether the test really
≥ 0.5 < 0.6 : poor reliability, measures personality
< 0.5 : unacceptable reliability, ✓ A self-esteem rating scale which has items
0 : no reliability. like “I know I can do what other people can do.”
5. Standard Error of Measurement and “I usually feel that I would fail on a task.”
- an index of the amount of inconsistency or the amount has a good face validity.
of expected error in an individual’s score
b. Content Validity d. Construct Validity
Definitions and concepts What is a construct?
✓ whether the test covers the behavior domain to be ✓ An informed scientific idea developed or
measured which is built through the choice of hypothesized to describe or explain a behavior;
appropriate content areas, questions, tasks and items something built by mental synthesis.
✓ It is concerned with the extent to which the
✓ Unobservable, presupposed traits; something
test is representative of a defined body of content
that the researcher thought to have either high or
consisting of topics and processes.
✓ Content validation is not done by statistical analysis low correlation with other variables
but by the inspection of items. A panel of experts can Construct Validity defined
review the test items andrate them in terms of how ✓ A test designed to measure a construct must
closely they match the objective or domain estimate the existence of an inferred, underlying
specification. characteristic based on a limited sample of
✓ This considers the adequacy of representation of the behavior
conceptual domain the test is designed to cover. ✓ Established through a series of activities in
✓ If the test items adequately represent the domain
which a researcher simultaneously defines some
of possible items for a variable, then the test has
construct and develops instrumentation to
adequate content validity.
✓ Determination of content validity is often made by measure it.
expert judgment. ✓ A judgment about the appropriateness of
c. Criterion-Related Validity inferences drawn from test scores regarding
What is a criterion? individual standings on a variable called
✓ standard against which a test or a test score is construct.
evaluated. ✓ Required when no criterion or universe of
✓ A criterion can be a test score, psychiatric content is accepted as entirely adequate to define
diagnosis, training cost, index of absenteeism, the quality being measured.
amount of time. ✓ Assembling evidence about what a test means.
✓ Characteristics of a criterion: ✓ Series of statistical analysis that one variable
• Relevant is a separate variable.
• Valid and Reliable ✓ A test has a good construct validity if there is
• Uncontaminated: Criterion contamination an existing psychological theory which can
occurs if the criterion based on predictor support what the test items are measuring.
measures; the criterion used is a criterion of ✓ Establishing construct validity involves both
what is supposed to be the criterion logical analysis and empirical data (Example: In
Criterion-Related Validity Defined: measuring aggression, you have to check all past
✓ indicates the test effectiveness in estimating an research and theories to see how the researchers
individual’s behavior in a particular situation measure that variable/construct)
✓ Tells how well a test corresponds with a ✓ Construct validity is like proving a theory
particular criterion. through evidences and statistical analysis.
✓ A judgment of how adequately a test score can Evidences of Construct Validity
be used to infer an individual’s most probable ✓ Test is homogenous, measuring a single
standing on some measure of interest construct.
Types of Criterion-Related Validity: • Subtest scores are correlated to the total
✓ Concurrent Validity – the extent to which test test score.
scores may be used to estimate an Individual’s • Coefficient alpha may be used as
present standing on a criterion homogeneity evidence.
✓ Predictive – the scores on a test can predict • Spearman Rho can be used to correlate an
future behavior or scores on another test taken in item to another item.
the future • Pearson or point biserial can be used to
✓ Incremental Validity – this type of validity is correlate an item to the total test score.
related to predictive validity wherein it is defined (item-total correlation)
as the degree to which an additional predictor ✓ Test score increases or decreases as a function
explains something about the criterion measure of age, passage of time, or experimental
that is not explained by predictors already in use manipulation.
• Some variable/construct are expected to
change with age.
✓ Pretest, posttest differences • Leniency Error/Generosity Error
• Difference of scores from pretest and – a rating error that occurs as a result of a rater’s
posttest of a defined construct after careful tendency to be too forgiving and insufficiently
manipulation would provide validity critical
✓ Test scores differ from groups. • Central Tendency Error
• Also called a method of contrasted group – a type of rating error wherein the rater exhibits
• T-test can be used to test the difference of a general reluctance to issue ratings at either a
groups. positive or negative extreme and so all or most
✓ Test scores correlate with scores on other test in ratings cluster in the middle of the rating
accordance to what is predicted. continuum
• Discriminant Validation ✓ Proximity Error
> Convergent Validity – rating error committed due to proximity/similarity of
– a test correlates highly with other the traits being rated
variables with which it should correlate ✓ Primacy Effect
(example: Extraversion which is highly – “first impression” affects the rating
correlated sociability) ✓ Contrast Effect
> Divergent Validity – the prior subject of assessment affects the latter
– a test does not correlate significantly subject of assessment
with variables from which it should differ ✓ Recency Effect
(example: Optimism which is negatively – tendency to rate a person based from recent
correlated with Pessimism) recollections about that person
• Factor Analysis ✓ Halo Effect
– a retained statistical technique for analyzing – a type of rating error wherein the rater views the
the interrelationships of behavior data object of the rating with extreme favour and tends to
> Principal Components Analysis bestow ratings inflated in a positive direction
– a method of data reduction ✓ Impression Management
> Common Factor Analysis ✓ Acquiescence
– items do not make a factor, the factor ✓ Non-acquiescence
should predict scores on the item and is ✓ Faking-Good
classified into two (Exploratory Factor ✓ Faking-Bad
Analysis for summarizing data and 3. Test Fairness
Confirmatory Factor Analysis for – this is the extent to which a test is used in an
generalization of factors) impartial, just and equitable way
• Cross-Validation 4. Factors Influencing Test Validity
- revalidation of the test to a criterion based on a. Appropriateness of the test
another group different from the original group b. Directions/Instructions
from which the test was validated c. Reading Comprehension Level
> Validity Shrinkage d. Item Difficulty
– decrease in validity after cross validation. c. Norms
> Co-validation – designed as reference for evaluating or
– validation of more than one test from the interpreting individual test scores
same group 1. Basic Concepts
> Co-norming a. Norm
– norming more than one test from the - behavior that is usual or typical for
same group members of a group
2. Test Bias b. Norms
- this is a factor inherent in a test that systematically - reference scores against which an
prevents accurate, impartial measurement individual’s scores are compared
✓ Rating Error c. Norming
– a judgment resulting from the intentional or - process of establishing test norms
unintentional misuse of rating scales d. Norman
• Severity Error/Strictness Error - test developer who will use the norms
– less than accurate rating or error in evaluation 2. Establishing Norms
due to the rater’s tendency to be overly critical a. Target Population
b. Normative Sample
c. Norm Group responses of test takers. Scoring procedures should be
- Size audited as necessary to ensure consistency and accuracy
- Geographical Location of application
- Socioeconomic Level d. Interpretation
3. Types of Norms – There should be common interpretations among similar
a. Developmental Norms results. Many factors can impact the valid and useful
– Mental Age interpretations of test scores. These can be grouped into
* Basal Age several categories including psychometric, test taker,
* Ceiling Age and contextual, as well as others
* Partial Credits a. Psychometric Factors:
– Intelligence Quotient - factors such as the reliability, norms, standard error
– Grade Equivalent Norms of measurement, and validity of the instrument are
– Ordinal Scales important when interpreting test results. Responsible
test use considers these basic concepts and how each
impacts the scores and hence the interpretation of the
test development test results
b. Test Taker Factors:
A. Standardization - factors such as the test taker’s group membership and
how that membership may impact the results of the test
1. When to decide to standardize a test?
a. No test exists for a particular purpose is a critical factor in the interpretation of test results.
b. The existing tests for a certain purpose are not Specifically, the test user should evaluate how the test
adequate for one reason or the another. taker’s gender, age, ethnicity, race, socioeconomic status,
marital status, and so forth, impact on the individual’s
2. Basic Premises of standardization
- the independent variable is the individual being tested results
- the dependent variable is his behavior
c. Contextual Factors:
- behavior = person x situation - the relationship of the test to the instructional
program, opportunity to learn, quality of the
- in psychological testing, we make sure that it is the
person factor that will ‘stand out’ and the situation educational program, work and home environment, and
factor is controlled other factors that would assist in understanding the
test results are useful in interpreting test results. For
- control of extraneous variables = standardization
example, if the test does not align to curriculum
3. What should be standardized?
standards and how those standards are taught in the
a. Test Conditions
classroom, the test results may not provide useful
– there should be uniformity in the testing conditions
information
– physical condition
– motivational condition 4. Tasks of test developers to ensure
b. Test Administration Procedure uniformity of procedures in test
– there should be uniformity in the instructions and administration:
administration proper. Test administration includes – prepare a test manual containing the ff:
carefully following standard procedures so that the test i. Materials needed (test booklets & answer
is used in the manner specified by the test developers. sheets)
The test administrator should ensure that test takers ii. Time limits
work within conditions that maximize opportunity for iii. Oral instructions
optimum performance. As appropriate, test takers, iv. Demonstrations/examples
parents, and organizations should be involved in the v. Ways of handling querries of examinees
various aspects of the testing process 5. Tasks of examiners/test
– Sensitivity to Disabilities: try to help the disable users/psychometricians
subject overcome his disadvantage, such as increasing – ensure that test user qualifications are strictly met
voice volume or refer to other available tests (training in selection, administration, scoring and
– Desirable Procedures of Group Testing: Be care for time, interpretation of tests as well as the required license)
clarity, physical condition (illumination, temperature, – advance preparations
humidity, writing surface and noise), and guess i. Familiarity with the test/s
c. Scoring ii. Familiarity with the testing procedure
– there should be a consistent mechanism and iii. Familiarity with the instructions
procedure in scoring. Accurate measurement iv. Preparation of test materials
necessitates adequate procedures for scoring the v. Orient proctors (for group testing)
6. Standardization sample the student's misunderstanding of the
– a random sample of the test takers used to evaluate learning objective
the performance of others - this may be a difficult task,
– considered a representative sample if the sample especially when constructing a true
consists of individuals that are similar to the group to statement
be tested 2. Test Construction
> Objectivity – be mindful of the following test construction
1. Time-Limit Tasks guidelines:
– every examinee gets the same amount of time for – Deal with only one central thought in
a given task each item
2. Work-Limit Tasks – Be precise
– every examinee has to perform the same amount – Be brief
of work – Avoid awkward wordings or dangling
3. Issue of Guessing constructs
> Stages in Test Development – Avoid irrelevant information
1. Test Conceptualization – Present items in a positive language
– in creating a test plan, specify the following: – Avoid double negatives
– Objective of the Test 3. Test Tryout
– Clear definition of variables/constructs to 4. Item Analysis (Factor Analysis for
be measured Typical-Performance Tests)
– Target Population/Clientele 5. Test Revision
– Test Constraints and Conditions – avoid terms like “all” and “none”
– Content Specifications (Topics, Skills, > Item Analysis
Abilities) – measures and evaluates the quality and
– Scaling Method appropriateness of test questions
✓ Comparative scaling – how well the items could measure ability/trait
✓ Non-comparative scaling 1. Classical Test Theory
– Test Format – analyses are the easiest and the most widely
✓ Stimulus (Interrogative, used form of analyses
Declarative, Blanks, etc.) – often called the “true-score model” which
✓ Mechanism of Response (Structured involves the true core formula:
vs. Free)
✓ Multiple Choice
- more answer options (4-5) reduce the
chance of guessing that an item is – assumes that a person’s test score is comprised of
correct their “true score” plus some measurement error
- many items can aid in student (X = T + e)
comparison and reduce ambiguity, – employs the following statistics
increase reliability a. Item difficulty
- easy to score – the proportion of examinees who got the item
- measures narrow facets of correctly
performance – the higher the item mean, the easier the item is
- reading time increased with more for the group; the lower the item mean, the more
options difficult the item is for the group
- transparent clues (e.g., verb tenses – Formula:
or letter uses “a” or “an”) may
encourage guessing
- difficult to write four or five
reasonable choices
- takes more time to write questions
- test takers can get some correct
answers by guessing – 0.00-0.20 : Very Difficult : Unacceptable
✓ True or False – 0.21-0.40 : Difficult : Acceptable
- ideally a true/false question should – 0.41-0.60 : Moderate : Highly Acceptable
be constructed so that an incorrect – 0.61-0.80 : Easy : Acceptable
response indicates something about – 0.81-1.00 : Very Easy : Unacceptable
b. Item discrimination
– measure of how well an item is able to
distinguish between examinees who are
knowledgeable and not
– how well is each item related to the trait
– the discrimination index range is between
-1.00 to +1.00
– the closer the index to +1, the more effectively
the item distinguishes between the two groups of
examinees
– the acceptable index is 0.30 and above
– Formula: =