0% found this document useful (0 votes)
10 views35 pages

Psychological Testing

The document outlines the history and evolution of psychological testing, detailing significant milestones from ancient China to modern assessments, including the development of various intelligence and personality tests. It discusses the properties of measurement scales, functions of measurement in various contexts, and the distinctions between psychological and physical measurements. Additionally, it covers the processes involved in test construction, including planning, item writing, administration, and analysis, emphasizing the importance of reliability and validity in psychological assessments.

Uploaded by

Ralitsa D'souza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views35 pages

Psychological Testing

The document outlines the history and evolution of psychological testing, detailing significant milestones from ancient China to modern assessments, including the development of various intelligence and personality tests. It discusses the properties of measurement scales, functions of measurement in various contexts, and the distinctions between psychological and physical measurements. Additionally, it covers the processes involved in test construction, including planning, item writing, administration, and analysis, emphasizing the importance of reliability and validity in psychological assessments.

Uploaded by

Ralitsa D'souza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Psychological testing

History of psychological measurement


1. 206 BC to 220 AD: Chinese officials were examined every 3rd yr to determine their fitness for
public offices. During hans dynasty, the use of test batteries became common for civil law,
military affairs etc
2. 1368 AD to 1644 AD: National Multistage tests were utilised for finalizing candidates for civil
offices.
3. 1809: Gross develops theory of Measurement error (difference between true value and
measured value)
4. 1855: british govt adopted the chinese testing system in selecting candidates for civil services
5. 1883: USA established their American Civil Services Commission to conduct exams for govt
jobs
6. 1850-1990s: psychological measurement was influenced by 2 major developments
7. 1860: psychophysics founded by Fechner. Studies the relationship between the physical stimuli
such as intensity of light or sound and sensations produced by these stimuli
8. 1859: darwin’ theory established the grounds for the argument that members of the same
species are not common rather individual differences exist between them
9. Sir francis Galton: younger cousin of Darwin picked upon this idea and started studying human
ability. Often referred to as father of mental testing
10. James Mckeen Cattell: student of galton and carried on his tradition of testing individual
differences. 1st titled professor of psych and gave the term mental test
11. 1904: charles spearman started working on the 2 dimensional theory of intelligence and in 1896
Karl pearson had developed the product moment correlation
12. 1904: Alfred Binet appointed as the French educational commission to devise a diagnostic
system for identifying intellectually deficient children for school placement purposes.
13. 1905: Binet Simon intelligence tests
14. 1912: William Stern revised revised the Binet Simon test and introduced IQ
15. 1916: Lewis Terman revised the Binet Simon intelligence test . extension of Binet Simon scale
was published, included detailed instructions for administration and scoring and over 1/3rd
items were new
16. 1917: Yerkes from APA developed the Army Alpha (verbal test for literate personal) and Army
Beta test (performance test for illiterate draftees)
17. 1918: woodsworth developed the personal data sheet, 1st objective measure of personality to
identity the emotionally unstable military personnel
18. 1921: Inkblot technique published by Rorschach
19. 1926: Goodenough published Draw a Man test
20. 1929: L.L Thurstone proposed the methods of scaling for measuring attitudes and values
21. 1935: Murray and Morgan develops TAT
22. 1939: Wechsler introduces the Wechsler-Bellevue Intelligence Scale which was designed to
measure adult intelligence
23. 1940: Hathaway and McKimey published MMPI to measure adult intelligence
24. 1949: Wechsler published WISC

Properties of scales of measurement


1. 3 important properties that make the scale different from one another, magnitude, equal
intervals and absolute zero.
2. Magnitude, defined as the property of moreness. Any scale is said to have the property of
magnitude if it can be said that a particular instance of the attribute represents more, less or
equal amounts of the given quantity than does another instance.
3. Equal intervals: if the difference between any 2 points at any place on the scale has the same
meaning as the difference between 2 other points that differ by the same number of scale points.
But in psychological tests, the property of equal intervals rarely works. The 10 points at the first
level do not mean the same thing as 10 points at the 2nd level.
4. Absolute zero: for psychological measurements it is very difficult. If not impossible to define an
absolute zero. Arbitrary zero.

Functions of measurement
1. Measurement has varied functions.
2. In selection: selection of personnel in industry or other institutions may be carried out with the
help of psychologists. The function is to predict the ability of an individual.
3. In classification: measurement helps the teacher to classify the children as low or high
achievers; retarded, average and gifted.
4. In comparison: the pioneering work of Galton and Darwin has revealed that no 2 individuals are
alike. Whenever 2 persons are to be compared on any of the factors, measurement comes into
use.
5. In guidance and counselling: measurement helps an individual know his strengths and
weaknesses and may provide insight and understanding into the relationship between the
counsellors and the patient. Predict problems of adjustment likely to come up in the future and
also diagnose mental disabilities, aberrations, deficiencies.
6. In research: measurement helps in research activities, undertaken to discover new facts about a
problem. The effect of one variable or a set of variables is studied while the effects of all other
variables are controlled.
7. In improving classroom instruction: it measures the outcome of instruction, becomes clear how
many students are being benefited by instruction and how many are not being benefited by it or
at least benefitted. After such an analysis some essential suggestions for modification in the
instruction can be made.

Distinction between psychological measurement and physical measurement


1. Measurement has 2 broad dimensions, psychological or qualitative and physical or quantitative.
2. Psychological comprises the measurement of mental processes, traits, habits, tendencies and
likes of an individual whereas physical measurement comprises the measurement of objects,
things etc, which are often physically present in the world.
Physical Psychological

The unit of measurement is fixed and constant There is no fixed unit of measurement. For ex, some may
throughout the measurement measure intelligence in terms of verbal questions or items
answered in specified time

There is a true 0 point, a point which Arbitrary 0 point, a point which does not represent the
represents the underlying absence of the trait underlying absence of the trait being measured
being measured

Is more accurate and predictable Zero point is itself not known. No prediction can be made
with definite accuracy

direct Indirect. As extraversion trait of personality cannot be


measured directly

The entire quantity can be measured The entire quantity cannot be measured but only a sample
representing that quantity or trait can be measured

Problems related to the measurement process


1. The researcher encounters problems in selecting the attributes of interest and in defining them
clearly and unequivocally. In defining concepts like intelligence, anxiety, adjustments,
cooperativeness and so on we expect diversity in definition.
2. The 2nd problem related to devising procedures for eliciting the relevant attributes.
3. The 3rd problem relates to the equality of units. In intelligence tests for instance, items of
analogy such as ‘hot is too cold as wet is too…?’ are not equal to items of arithmetic series
such as 2,8,13,17…? Such inequality creates a problem in the measurement process. Due to
such inequality, addition, subtraction and comparison of scores remain a suspect issue.
4. Indirectness of measurement: most psychological and educational measurements are indirect.
This is because most psychological and educational variables cannot be observed and studied
directly.
5. Incompleteness of measurement: measures are incomplete and therefore the measurement of
any psychological or educational variable is also incomplete.
6. Relativity of measurement: psychological and educational measurements are relative.
7. Errors in measurement: measurement in the physical sciences as well as in the behavioral
sciences is most of the times not pure. It contains some uncontrolled factors which produce
gross errors. This is also true of sociological measurement.

Difference between testing, assessment and measurement


1. Assessment is the general term that includes any of a variety of procedures used to obtain
information about the performance. It includes paper and pencil tests as well as extended
responses, performance of authentic tasks. The basic goal is to evaluate a person in terms of
current and future functioning. Behaviors are classified into different categories measured
against normative standards.
2. Testing is done through systematic procedure for measuring a sample of behaviors by putting a
set of questions in a uniform manner. Such a systematic procedure is called a test. Not all
assessment techniques are tests, any assessment technique is called a test only when its
procedures for administration, scoring and interpretation are standardised.
3. Measurement is only a process of obtaining a numerical description of the degree to which an
individual possesses a particular characteristic. This numerical description is done according to
some rules.

Classification on the basis of purpose or objective


1. Tests are usually classified as
2. intelligence tests: assess intelligence of examinees
3. aptitude tests: assess potentials or aptitudes of the person
4. personality tests: traits, adjustments, values etc of the person
5. neuropsychological tests: tests which are used in the assessment of persons with known or
suspected brain dysfunctioning.
6. achievement tests: assess what the persons have acquired in the given area as a functioning of
some training or learning.

Classification on the criterion of standardization


1. Tests are classified into standardized tests and teacher made tests.
2. Standardized have been subjected to the procedure of standardization
3. In the sense there must be a standard manner of giving instructions
4. Uniformity in scoring and index of fairness of correct answer through the procedure of item
analysis should be available.
5. Reliability and validity must be established.
6. Should have norms
7. Teacher made tests are constructed by teachers for use largely within their classrooms. The
effectiveness of such tests depends upon the skill of the teacher and his knowledge of test
construction. Items may come from any area of curriculum, and may be modified according to
the will of the teacher.

Methods of test construction


Step 1: defining the test universe, audience and purpose
1. Testing universe refers to the defining behavior that the test aims to measure or the body of
knowledge that the test represents.
2. Defining an audience means making a list of rigorous reviews of literature. Purpose of the test
explains how the test users can use the test scores

Step 2: Planning of the test


1. Very careful planning is required. Every planning begins with a purpose for the test. Imp to
specify broad and specific objectives of the test.
2. Developer also develops an operation definition of the constructs that the test will measure
which mainly involves rigorous review of literature

Step 3 : writing items of the test


● Item writing is preparation of the test itself
● Item is a single question or task that is not often broken down into any smaller units
● According to Asthana B (1991), suggestion in item writing are
● Number of items in the preliminary draft should be more than final draft
● Clearly phrased so that their content and not their form determines purpose
● Comprehensive enough test
● No item is such that that it could be replied by referring to any other item or a group of items
● Each item carries equal marks
● Wording such that whole content determines the answer and not just a part of it
● 2 types of items:
● Essay items require the examinee to rely upon their memory and past associations to answer the
questions in a few words only. They include both short type and long type answers
● Objective items are those that allow only 1 fixed correct answer
● Further divided into 2 types
● Supply type: type of item that needs the examinee to write down the correct answer on his own
● Selection type, which requires the examinnes to marke the correct answer from a given set
● In MCQ, the item consists of a problem known as stem, and a list of suggested solutions known
as alternatives
● The alternatives consist of one correct or best alternative which is the answer and incorrect or
inferior alternatives known as distracters
● Shouldn't be too easy or too difficult.
● More the options lesser the chance of guessing the right answer
● In dichotomous like true or false, effect of guessing is higher
● Forced choice question format which are usually used in scales, chances of guessing right
answer is lesser
● To deal with social desirability, triggering the examinee is to use ipsative measurement aka
forced choice testing. The item is a forced choice in such a way that it compares the person
with himself by offering options for equal desirability
● Acquiescence, tendency to agree with the idea presented can be controlled by using reverse
scoring
● While composing test, it is preferred that two and a half more times are developed
● Item evaluation: It is the process of judging the adequacy of test items to fulfill the designated
purpose of the test
● 2 forms
● Subject judgement: may be made by a test specialist or by subject matter specialist or by both.
The test specialists look for ambiguities of wording or special clues that lead to the answer for
item characteristics unrelated to the skill or ability or other attributes that the item is supposed
to measure that may influence the way an examinee answers a question
● Statistical judgement: in statistical evaluation, the difficulty level and discrimination power of
items are calculated and this is done through the process of item analysis
● The draft is now subjected to participants for further evaluations

Step 4: preliminary administration of the test and pilot testing


● 1st draft should be administered on a sample who are representation
● At Least 3 times preliminary administrations of the test should be conducted triangulation
● The main purpose is:
● Finding out major weakness, omissions, ambiguities and inadequacies of the item
● Determining the difficulties values of each item for final selection
● Validity of each individual item
● Apt length of the test
● Key scoring should be done
● Then item analysis is to be taken
● Pilot testing is done in 2 phases: number of participants depends upon the size and complexity
of target audience. The setting of the pilot testing should mirror the planned test setting

Step 5: Item analysis


● Item difficulty is done only in test construction and not in scale construction because in scale
items are not measuring success or failure in an item but the degree to which a trait is present.
● D= R/N
● Item discrimination is the ability to discriminate between people who have high ability and low
ability in tests and high trait and low trait in scale.
● V= DU- DL
● Difficulty value of upper group and lower group
● Item reliability is the inter item correlation. Forced nominal is found in scale and perfectly
nominal in test. In scale, correlation is found by tetra choleric/ rank order/pearson moment
● Item validity is the correlation of an item to the test scores. It is found by biserial/ point biserial
and pearson product moment formula.

Step 6: Final drafting of the test


● After item analysis, items with good discriminating value may be taken into the final draft and
other items may be eliminated.
● Time of test is decided on the basis of administration

Step 7: Standardization
● Establishing, validity, reliability and norms of the test
● Also establishing cross validation and co validation
● Cross validation: revalidation of a test on a sample for test takers other than those on whom test
performance was originally found to be a valid predictor of some criterion
● Co-validation and co-norming: test validation process conducted on 2 or more tests using the
sample of test takers. Highly beneficial process for the publisher, test user and the test takers.
Step 8: test manual and publication of the test
● Preparation of manual for the test which includes the psychometric properties of the test, norms
and references.
● Offer clear details regarding the procedures of the test, administration and, scoring methods and
time limits
● The manual also includes standardization details of the test

Item analysis
● After items written, reviewed and carefully edited
● Uses statistics and expert judgements to evaluate tests based on the quality of individual items,
item sets and entire set of items as well as the relationship of each item to the other items
● Investigates the performance of items considered individually either in relation to some external
criterion or in relation to the remaining items on the test.
● Validity is tested on a group of examinees where their performance on individual item is
compared to their performance on the whole test
● Item analysis provides us with the estimate of validity of each item.
● Concepts are similar for norm referenced and criterion referenced but differ in specific and
significant ways.
● It gives 2 kinds of info, idea about difficulty index of the item and index of validity
● Discriminative power of the statement.

Item difficulty
● Percentage of people who answer an item correctly.
● Relative frequency with which the examinees choose the correct response.
● Ranging from 0 to as high of +1
● Higher difficulty indexes indicate easier items. Item answered correctly by 75% of examinees
has item difficulty of 0.75
Item discrimination
● indicated how adequately an item separates or discriminates between high scorers and low
scorers on an entire test.
● Also known as validity index
● Marshall and Hales in 1972 said that discriminatory power or validity indicates the extent to
which success and failure on that item indicate the possession of the trait of achievement being
measured

Divided into 2
● Positively discriminatory item: item in which the proportion or percentage of corrected answers
is higher in the upper group
● Negatively discriminatory power: item in which the proportion or percentage of correct answers
is lower in the upper group.
● Nondiscriminatory power : item in which the proportion or percentage of corrected answers is
equal or approx equal in both the upper and lower group.
● Number of high scorers and low scorers who answer an item correctly
● The higher the discrimination index, the better the item because high values indicate that the
item discriminates in favour of the upper group which should answer more items correctly. If
more low scorers answer the item correctly, it will have a negative value and is probably flawed.
● Good items have a discrimination index of 0.40 or higher, reasonably good items 0.30-0.39,
marginal items from 0.20-0.29 and poor items less than 0.20.
● 2 ways of determination of index of discrimination
● By applying a test of significance of the difference between 2 proportions or percentages:
arrange total scores from highest to lowest and then top 27% in upper and lower groups are
selected. It is the kelly’s 27% extreme group method
● By applying the correlational techniques: it is the correlation of each item with the combination
(sum or avg) of all the remaining items, not counting that 1. The larger the item remainder, the
more the item in question relates to the remaining items
● Kingston and Kramer recommended an item remainder coefficient greater than 0.30 for the
selection/retention criterion of an item total correlation
● The item total correlation is a correlation between the question score and the overall assessment
score
● Many statistics used to determine whether test item is valid and reliable are point biserial and
p-value
● Point biserial correlation: The point-biserial correlation coefficient is a statistical measure used
to assess the relationship between a continuous variable (typically a test score) and a
dichotomous variable (usually representing correct or incorrect responses on a test item).
● P-value: provides proportion of students that got the item correct and is a proxy for item
difficulty or more precisely item easiness
● Static ranges from 0 to 1
● Problematic items may show high p-values but should not be taken as a sign of item quality
● Item characteristics curve is a graphical representation of the probability of giving the correct
answer to an item as a function of the level of underlying characteristics of the examine
(ability) assessed by the test
● Basic building block of item response theory: bounded between 0 to 1
● Monotonically increasing
● Commonly assumed to take the shape of a logistic function
● Each item has its own ICC
● Illustrate discrimination power and item difficulty
● Steepness or slope conveys info about discriminatory power of the item
● Position of the curve gives indication about the difficulty of each item
● For difficulty items, ICC starts to rise on the right hand side of the plot
● For easier items, ICC curve starts to rise on the left hand side of the plot
● Analyzing the distractors (incorrect alternatives) is useful in determining the relative usefulness
of them in each item. Items should be modified if students consistently fail to select certain
multiple choice alternatives.
● The alternatives are totally implausible and of little use of decoys in multiple choice items
● A discrimination index or discrimination coefficient should be obtained for each option in order
to determine each distractor’s usefulness whereas the discrimination value of the correct answer
should be positive, the discrimination values for the distractors should be lower and preferably
negative
Test standardization
Reliability
● Describes how replicable or repeatable a study is, while validity describes how accurately the
study measures what it intends to measure
● A reliability coefficient is an index of reliability, a proportion that indicates the ratio between
the true score variance on a test and the total variance
● Reliability is a property of scores
● Based on CCT, any obtained score is divided between error and true score
● The total variance of the test is also divided into components: true variance and error variance

Types of reliability

1. External Reliability over time and consistency across raters is known as external reliability.
Refers to the extent to which a measure varies from 1 use to another.

Test retest Estimate of reliability abstained by correlating pairs of scores from the same
people on 2 different administrations of the same test. Apt measure when we
wish to measure something that is stable over time. If any characteristic
fluctuates then it is not a meaningful measure. As the time interval between the
scores obtained on each testing increases, the correlation between the 2 scores
decreases. When the time interval between testing is greater than 6 months, the
estimate of test-retest reliability is often referred to as coefficient of stability.
Diadv, result take long time to be obtained

parallel Measures 2 diff tests designed in the same way. Outcome of the results is
correlated through statistical measures to determine the reliability. The source of
error variance is content sampling, time sampling and content heterogeneity.
When administered immediately called parallel (immediately) reliability and
when after a gap, then parallel form (delayed) reliability. Correlation between 2
parallels is the estimate of reliability. Pearson’s r. Degree of a relationship is
called the coefficient of equivalence. The same test is not repeated and memory,
practice and carryover effects and recall factors are minimized and do not affect
the scores. It is difficult to have 2 parallel forms of a test especially in situations
like Rorschach. Test scores of 2nd form of the test are generally high

Inter rater Inter-rater reliability is the extent to which the observations made by different
observers are consistent.

2. Internal Consistency across items assess the consistency of results across items within a
test

Split half Is an improvement over the earlier 2 methods and it involves both the
reliability characteristics of stability and equivalence. This method provides the internal
consistency of test scores. All of the items of the test are generally arranged in
increasing order of difficulty and administered once on sample. After
administering the test it is divided into 2 comparable or similar or equal halves or
parts. Spearman brown prophecy formula is used. Calculated using
Rulon-Guttman’s formula, the variance of the differences between each person’s
scores on the 2 half tests and the variance of total scores are considered. The
Flanagan Formula is very close to Rulon’s formula. In this formula, the variance
of 2 halves are added instead of the difference between the 2 halves. Cannot be
estimating reliability of speed tests. The chance errors may affect the scores on 2
halves in the same way and thus tend to make reliability coefficient too high.
Cannot be used in power tests and heterogenous tests.

Stratified coefficient alpha


1. A reliability estimate that takes account of stratification. Tests may contain diff types of items
that can be categorized. When components (items or subsets) fall within categories or strata, we
might view the composite as a result of stratified random sampling of subsets or items.
2. When we have such stratification, we would expect that items or subsets within a stratum would
correlate more highly with each other than with items or subsets in other strata. When
stratification holds up and correlations within strata are higher than those between strata,
coefficient alpha will be smaller than stratified alpha

Cronbach’s alpha
1. Commonly used as a measure of internal consistency or reliability
2. Developed by Lee Cronbach in 1951
3. As an extension of Kuder Richardson formula (KR20)
4. Method uses the variance of scores of odd, even and total items to work out the reliability.
5. It is the avg of all possible split half coefficients varies from 0 to 1, and a value of 0.6 or less
generally indicates unsatisfactory internal consistency reliability.
6. Its value tends to increase with an increase in the number of scale items.
7. Coefficient alpha may be artificially and inaptly.
8. Another is coefficient beta.
9. Assists in determining whether the averaging process used in calculating coefficient alpha is
masking any consistent items.

Factors influencing reliability of test scores

Intrinsic factors Those factors that lie within the test

Length of the test The more the number of items the test contains, the greater will be its
reliability and vice versa. The more samples of items we take of a given
area of knowledge, skill and the more reliable the test will be.

Homogeneity of items Has 2 aspects, item reliability and the homogeneity of traits measured from
1 item to another. If the items measure different functions and the inter
correlations of items are zero or near to it then reliability is 0 or very low
and vice versa.

Difficulty value of items The difficulty level and the clarity of expression of a test item also affect the
reliability of test scores. If the test items are too easy or too difficult for the
group members it will tend to produce scores of low reliability.

Discriminative value When items can discriminate well between superior and inferior, the item
total correlation is high, the reliability is also likely to be high and vice
versa.

Test instructions Clear and concise instructions increase reliability

Item selection If there are too many interdependent items in a test, the reliability is found
to be low.

Reliability of scorer If he is a moody, fluctuating type, the scores will vary from 1 situation to
another.

Extrinsic factors Factors which remain outside the test itself

Group variability When the group of pupils being tested is homogenous in ability, the
reliability of test scores is likely to be low and vice versa

Guessing and chance Guessing gives rise to increased error variance and as such reduces
errors reliability.

Environmental Testing environment should be uniform otherwise reliability will be


conditions affected

Monetary fluctuations May raise or lower reliability. Broken pencil, anxiety regarding
non-completion of home work and knowing no way to change it are the
factors which may affect the reliability

Poor instructions Test takers may not be able to provide the accurate answers if they don’t
understand how to take the test or if they don’t understand some of the
specific test questions

Test difficulty Too difficult, not provide all answers

Objective scoring It has to be scored accurately and without bias

Errors in reliability
There is always a chance of 5% error in reliability which is acceptable

Types of errors
1. Random error: exists in every measurement and is often a major source of uncertainty. Have no
particular assignable cause. Can never be totally eliminated or corrected. Caused by
uncontrollable variables and are an inevitable part of every analysis made by human beings.
Even if we identify some they cannot be measured because most often they are so small
2. Systematic error: caused due to instruments, machines and measuring tools. It is not due to
individuals.

Validity
1. Extent to which the test measures what it is supposed to measure.
2. 4 categories
3. 1966, association combined predictive and concurrent validity into a single grouping called
criterion validity.

Types of validity

Internal validity Looks to see if there are any methodology issues and if the study is structurally
sound. Look at the structure of the study or test. If the test is internally valid, it has
a clear cause and effect relationship and there is no chance of any alternative
explanation. The researchers must remove any error from the trials they conduct
and anything that could contaminate the findings. May use techniques to ensure
validity, blinding, random selection, experimental manipulation and study protocol.

External validity Extent to which test reflects the truth of a population which means a test is valid
will apply to the real world. Relies on a test’s ability to be replicated when it comes
to diff settings, people and even periods. If a test only works in one setting or with
specific groups of people, it cannot be deemed externally valid. To ensure, calibrate
the test as needed. Replicate the test in various env with various participants and
ensure participants act as closely to their normal behaviors during testing

1. Face validity Extent to which the test seems relevant, imp and interesting. Least rigorous
measure of validity. Pertains to the fact whether the test looks valid or not.
Methods to measure it can be poll participants, follow up questionnaire etc

2. content Degree to which a test matches a curriculum and accurately measures the specific
training objectives on which a test program is based. It uses expert judgement of
qualified experts to determine if a test is accurate, appropriate and fair

3. Criterion Measures how well a test compares with an external criterion. If reflects whether a
related scale performs as expected in relation to other selected variables (criterion
variables) as meaningful criteria.

Based on time period, criterion validity can take 2 forms,


Predictive validity: correlation between a predictor and a criterion obtained at a
later time. It is concerned with how well a scale can forecast a future criterion. The
researcher collects data on the scale at one point in time and data on the criterion
variables at a future time.

Concurrent validity is the correlation between a predictor and a criterion at the


same point in time. It is assessed when the data on the scale being evaluated and
the criterion variables are collected at the same time.
4. Construct Looks for accuracy and defines how well a test or tool will measure the thing it
validity aims to measure. Addresses the question of what construct or characteristic the
scale is measuring. Construct validity requires a sound theory of the nature of the
construct being measured and how it relates to other constructs. Most sophisticated
and difficult type of validity.

Includes convergent, divergent and nomological validity


Convergent validity: extent to which scale correlates positively with other
measurements of the same construct. Not necessary to obtain all these
measurements using the conventional scaling techniques. If there is high
correlation between scores, convergent validity is determined

Discriminant: known as divergent validity, extent to which measure does not


correlate with other constructs from which it is supposed to differ. Involves a lack
of correlation among differing constructs. If low correlation is found, the new test
will have a discriminant validity

Nomological validity is the extent to which the scale correlates in theoretically


predicted ways with measures of different but related constructs. A researcher
seeks to provide evidence of construct validity in a multi item scale, designed to
measure the concept of self image.

It can be estimated by following 2 methods:


Internal consistency tests falling under the personality domain are validated by this
method. The essential character of this method is the total score on the test itself.
Verifies that a particular item or section measures the same characteristic
individually that the test as a whole measures. By comparing the performance of
the upper criterion group with that of the lower criterion group. 2nd is by
correlating sub test scores with the total score. Any subtest having low correlation
is estimated
Factorial validity: refined statistical techqniue for analyzing interrelationships of
behavior data. The factorial validity of a test is the correlation between the test and
the factor common to the test.

1. Validity is a relative term since it is valid only for a particular purpose. If used elsewhere
becomes invalid. Validity is not a fixed property of the test because validation is an unending
process that needs to be revised with the discovery of new concepts and the new meanings.
2. Validity is a matter of degree and not all or none property
3. Validity is a unitary concepts and not available in various types but in various aspects such as
content related, criterion related or construct related

Content or curricular validity.


1. It requires both item validity and sampling validity.
2. Item validity is concerned with whether the test items represent measurement in the intended
content area.
3. Sampling validity is concerned with the extent to which the test samples the total content area.
4. Content validity is examined in 2 ways:
5. Expert judgement and statistical analysis

Factors that affect internal validity:


1. Affected by flaws within the study itself such as not controlling some of the major variables (a
design problem) or problems with the research instrument (a data collection problem)
2. Subject variability
3. Size of subject population
4. Time given
5. History
6. Attrition
7. Maturation
8. instrument/task sensitivity

Factors that affect external validity


1. Extent to which you can generalize your findings to a larger group or other contexts
2. Population characteristics (subjects)
3. Interaction of subject selection and research
4. Descriptive explicitness of the independent variables
5. Effect of research env
6. Researcher or experimenter effects
7. Data collection methodology
8. Effect of time

Content V/s construct


Construct focuses on if a test measures what it claims and content assesses how well the test measures
what it claims. Both are necessary

Reliability V/s validity


1. Heterogenous test has low reliability and high validity
2. Maximum validity requires items differing in difficulty and low inter correlation among items
3. Validity of a test may not be higher than the reliability index
4. A valid test is always reliable. If a test truthfully measures what it purports to measure is both
valid and reliable
5. A reliable test may not be valid. A test maybe reliable but poor on validity

Norms
1. Is an average or typical score on a particular test obtained by a set/group of defined individuals
2. Based on the distribution of scores obtained by the people of the standardization group.
3. The sample should be large enough to provide stable values
4. Sample must be representative of the population under consideration
5. Scores on psychological tests are most commonly interpreted by reference to norms which
represent the test performance of the standardization sample
6. Indicate an individual’s relative position in the normative sample
7. Provide comparable measures which permit direct comparison of performance on different
tests.
8. Norm referencing is when the raw score is compared with the score of a specific group of
examiners on the same test. Therefore each examinee is compared with a norm.
9. In order to compare the raw score with the performance of the standardised sample, they are
converted into derived scores.
10. Percentile norms are also reported through ogive, a graph that show a cumulative percentage of
scores falling below the upper limit if each class interval
11. Used for determining how many of a set of observations are less than or equal to a specific
value.

Types and methods

Quantitative Allow us to evaluate an individual’s performance in terms of the performance of the


norms most nearly comparable standardization sample. Uniform and a clearly defined
quantitative meaning. Eg, percentiles, declies, standard score, t and z score and stanine

Percentile Percentile rank is a type of converted score that expresses an individual’s score relative
norms to their group in percentile points. If the percentile norm is to be made meaningful, it
should be a sample which is made homogeneous with respect to gender, age and other
factors. Easy to calculate, understand, interpret, no assumption about the characteristics
of populations

decile Points which divide the scale of measurement into 10 equal parts. Range of deciles is
from decile 1 - decile 9. Decile score 1 indicates that 10 % of cases lie below you or the
lowest 10% of the groups.

Standard Set of scores with the same mean and standard deviation. By converting raw scores into
score a standard score, it allows the scores to be completed. Look how many people scored
above.

percentiles Standard scores can also be used to illustrate how well someone did in comparison to
others. The relative standing of the person’s standard score in comparison to others..
Used to determine how many people scored between 2 scores. Have an equal unit of
measure. Can be converted into standard score by 2 methods: linear and normative
transformation.

For linear transformation, all characteristics of original distribution of raw score are
retained without any change in distribution. For normative transformation, raw scores
are skewed distributions are adjusted to produce a normal frequency distribution and
converted to a standard base.

stanines Or standard 9, consists of 9 statistical units on a scale of 1 to 9. Show a student’s


performance on a psychological or educational exam. Easy to find because there are only
9 units. 1,2,3 are lower, 4,5,6 are avg and 7,8,9 are above avg. They are less accurate.

Qualitative Developmental norms are developed for psychological constructs which develop. They
are supplemented by percentiles or standard scores. Have appeal for descriptive purposes
and for certain research purpose, agr, grade and gender norms

Age norms Relates the level of performance to the age of the people taking the test. Median score on
a test obtained by persons of a given chronological age.

Grade norms Median score on a test obtained by students of a given grade level. Most popular when
reporting achievement levels of school children, and are useful for teachers to
understand as to how well the students are progressing at a grade level.

Scaling and measurement


1. Levels of measurement are also known as scales of measurement
2. Imp factors in determining how data is analyzed by researchers
3. Measurement means assigning numbers or other symbols to characteristics of objects according
to certain pre-specified rules.
4. We measure not the object but some characteristic of it
5. Scaling can be considered extension of measurement
6. Involves creating a continuum upon which measured objects are located.
7. Process of placing the respondents on a continuum with respect to their attitude towards banks.
8. 4 scales/levels
9. The lowest in the hierarchy being nominal and highest ratio

nominal Involves variables that represent some type of non-numerical or qualitative data. Values are
grouped into categories that have no meaningful order. Normal variables are also known as
categorical data. Typical descriptive statistics associated with nominal data are frequencies and
percentages. Chi square test. Cannot be used to perform many statistical computations such as
mean and standard deviation.

ordinal Represents some type of relative ranking or an order of categories. Rank order data. They are
nominal level variables with meaningful order. Can also be categorical if the categories are
measured in a specific order. Described with frequencies and percentages. The Mann-Whitney U
test is most apt for ordinal level dependent variable and a nominal level independent variable.
People often use surveys that involve Likert scale. Means, SD and parametric statistical tests are
not apt to use with ordinal data

interval Represents variables in which numerical values represent equal amounts of whatever is being
measured. Also referred to as equal interval data and involves data that is grouped in evenly
distributed values. Temp. can be used to compute commonly used statistical measures such as
average, SD and pearson correlation coefficient

ratio Type of equal-interval measurement, has absolute zero point. Value of zero indicates a complete
absence of what is being measured. A lot of things are measured using ratios. All arithmetic
options are possible on a ratio variable. ANOVA is most apt for continuous level dependent
variable and a nominal level independent variabe. Used as a dependent variable for most
parametric statistical tests such as t, f, regression

Scaling techniques
1. In scaling, the objects are text statements usually statements of attitudes, opinion or feeling
2. Involves creating a continuum upon which measured objects are located
3. Can be classified into 2 categories

1. Comparative Respondent asked to compare 1 object to another.


scales

Paired comparison Respondents are presented with 2 objects at a time and asked to select 1 object
scale according to some criterion. Data is ordinal in nature. Useful when the number of
brands are limited since it requires direct comparison and overt choice. Scale has little
resemblance to the market situation which involves selection from multiple
alternatives comparative scaling technique in which a respondent is presented with 2
objects at a time and asked to select one criteria. Ordinal nature

Rank order scale Respondents are presented with several items simultaneously and asked to rank them
in order of priority. It describes the favoured and unfavoured objects but does not
reveal the distance between the objects. More realistic in obtaining the responses and
yields better results when direct comparison is required and limitation is that only
ordinal data can be generated

Constant sum The respondents were asked to allocate a constant sum of units such as rupees, points
among a set of stimulus objects with respect to some criterion. Respondents might be
asked to divide a constant sum to indicate the relative importance of the attributes
using the following format.

Q-sort scale Uses rank order procedure to sort objects based on similarity with respect to some
criterion. More imp to make comparisons among different responses of a respondent
than the responses between different respondents.

2. Non Respondents need only evaluate 1 single object. Using a non-comparative scale
comparative employ whatever rating standard seems apt to them

Continuous Simple and highly useful. The respondent’s rate the objects by placing a mark at the
rating apt position on a continuous line that runs from one extreme of a criterion variable to
the other. Respondents score is determined either by dividing the line into as many
categories as desired and assigning the respondent a score based on the category into
which his/her marks fall or by measuring distance from either end of the scale

Itemised rating Scale having numbers or brief descriptions associated with each category. The
categories are ordered in terms of scale position and the respondents are required to
select 1 of the limited number of categories that best describes the product, brand,
company or product attribute being rated. 3 itemised scales. Likert scale, developed
by Rensis Likert is popular for measuring attitudes because the method is simple to
administer. Indicate their own attitude by checking how strongly they agree or
disagree with carefully worded statements that range from very positive to very
negative towards the attitudinal object. Semantic differential scale is a 7 point rating
scale with endpoints associated with bipolar labels (such as good and bad) that have
semantic meaning. Used to find whether a respondent has a positive or negative
attitude towards an object. Been used to develop advertising and promotion
strategies.staple scale only extremes have names which represent the bipolar
adjectives with the central category representing the neural position. The in between
categories have blank spaces. A weight is assigned to each position on the scale.

Intelligence tests
1. Series of questionnaires and exercises that are designed to measure the intelligence of an
individual.
2. Can be conducted on children as well as adults
3. First intelligence test published in 1905. Late in the 19th century, alfred Binet founded the 1st
experimental psychology research laboratory in France
4. In his lab, Binet attempted to develop experimental techniques to measure intelligence and
reasoning abilities.
5. Was successful in measuring intelligence and in 1905, he and Theodore Simon published the
1st test of mental ability, Binet Simon scale. Assessment was intended for children between the
ages 3-13 yers. Consisted of an arrangement of 56 questions and tasks that were designed to
distinguish between children with cognitive deficits and those who were lower performing due
to lack of motivation or laziness.
6. Measures intelligence without regard for their educational experience.
7. Originally created in 1904 and has since undergone various revisions
8. In 1916, Lewis Terman produced the stanford-binet intelligence scales, an adaptation of Binet’s
original test. Used with Americans of ages 3-adulthood.
9. Consists of assessment in the areas of fluid reasoning, knowledge, quantitative reasoning,
visual-spatial processing and working memory.
10. Contains 10 subsets that can take approx ten 10 min each and is used for ages 2-85 years
11. The test is composed of mostly timed proportions which some researchers recognize as
problematic. At the conclusion, a single score is obtained
12. in 1939, wechsler-bellevue intelligence scale (WBIS) became popular.
13. Indexed the general mental ability of adults and revealed a pattern of a person’s intellectual
strengths and weaknesses.
14. WBIS was developed by david wechsler in bellevue hospital based on an individual’s ability to
act purposefully, think logically, and interact/cope successfully with the environment.
15. 2nd edition in 1946.
16. 1951, revised 2nd and named it as WAIS
17. In 1981 and 1991, WAIS was updated as WAIS-R and WAIS-III
18. To enhance clinical utility and user friendliness of the test, 4th edition was published in 2008
19. RSPM in 1938, assesses abstract reasoning through a nonverbal test. Includes 3 assessments
that can be given to children as early as age 5 all the way through elderly people.
20. Eliminates cultural bias
21. Progressive test, with increasing complexity.
22. 1938 original form was entirely black and white
23. 2nd test is in color which is said to be more visually stimulating and is intended for people with
cognitive disabilities or advanced age
24. The 3rd test contains more tasks and is intended for people in the age of adolescence through
adulthood who are believed to have advanced cognition.

Performance tests
1. Require the test takers to perform a particular well defined task such as making a right hand
turn, arranging blocks
2. Test takers try to do their best because their scores are determined by their success in
completing the task
3. Driving test, test of specific abilities and classroom tests

Aptitude tests
1. Achievement tests measure a test taker’s knowledge in a specific area at a specific point in time.
Aptitude test assesses a test taker’s potential for learning or ability to perform in a new job or
situation.
2. Measure the product of cumulative life experiences or what one has acquired over time.

Personality tests
1. Measure human character or disposition
2. Early 1900 brought an interest in measuring personality of individuals
3. The personal data sheet: during world war 1, the US military wanted a test to help detect
soldiers who would not be able to handle the stress associated with combat.
4. APA commissioned Robert Woodsworth to design such a test which came to be known as PDS.
pen and paper psychiatric interview to respond yes or no, 200 ques, reduced to 116.after world
war 1, woodsworth developed the woodworth psychoneurotic inventory, used for civilians and
was 1st self report test.

Project personality
1. Exploring unconscious
2. Ambiguous images or given situations and asked to interpret them
3. Subjects are to project their own emotions, attitudes and impulses onto the stimulus given and
use these projections to explain an image, tell a story or finish a sentence.
4. 1920 and 1930s
5. Their popularity and use has risen and fallen at various times
6. The way a person interprets the stimuli tells a lot about them as a person while opponents argue
that tests lack reliability and validity
7. Rorschach : 1st projective test, Hermann in 1921. Useful in depression, anxiety and psychosis
8. TAT: Henry A.Murray and C.D Morgan . Both tests are based on theories of Jung. 8-12
pictures.
9. Draw a person: size, features and any added details of the person
10. House tree person: series of ques about their drawings such as who lives in this house or how
the person is feeling. John Buck and included 60 ques
11. California psychological inventory: assess 20 attributes of normal personality. Structured
personality inventory. Dominance, independence, well being. 480 item test published by
Harrison Gough in 1968. Scale was revised in 1987 and reduced to 462 items. 6000 m and
7000f have norms based. 3rd revision reduced by 28 items. 3000m and 3000f. Interpersonal
behaviour and social interaction
12. Half of items of original version taken directly from MMPI
13. Criticism: some criterion groups were used in establishing the scales identified by their friends
as being high or low on the trait
14. 4 of the scales ( social presence, self acceptance, self control and flexibility) were developed by
selecting items that theoretically were designed to measure the construct.

NEO
1. Costa and McCrae in 1985 and 1992
2. 2. 5 primary dimensions of personality called big five
3. Normal adults ranging from 20-80
4. 6 Facets underlie each of the major constructs
5. 240 items (30 for each facet), with 3 additional validity, check items
6. Takes about 30 min to complete
7. N indicates the degree to which a person is anxious and insecure. 6 facets are anxiety, hostility,
depression, self-consciousness, impulsiveness and vulnerability.
8. E indicates the degree to which a person is imaginative and curious versus concrete and narrow
minded, 6 facets are fantasy, aesthetics, feeling, action, ideas and values
9. A indicates degree of warm and cooperative versus unpleasant and disagreeable, 6 facets are
trust, modesty, compliance, altruism, straightforwardness and tender mindedness. C indicates
the extent to which the person is persevering and responsible, 6 facets are competence, self
discipline, achievement striving, dutifulness, order and deliberation.
10. A short version of 60 items assesses only the 5 major constructs and takes 15 min to complete
11. Forms allow for self report and observe report
12. Response sheets can be hand or machine scored

Problems in psychological measurements


1. Indirectness of Measurement: Various psychological attributes are accessible to research and
measurement only indirectly.
2. For example, if a researcher is interested in measuring the personality dimensions of a subject,
then it is something that is not directly available for measurement as physical quantities- like
length- are visible and concretely available for observation and assessment. The only way to
measure it is to assess the person on a set of overt or covert responses related to his personality
or other psychological attributes of interest.
3. Lack of Absolute Zero: Absolute zero, in case of psychological measurement, means a situation
where the property being measured does not exist. The absolute zero is available in the case of
physical quantities, like length, but is very difficult to decide in the case of psychological
attributes.
4. We Measure a Sample of Behaviour not the Complete Behaviour : In psychological
measurements, a complete set of behavioural dimensions is not possible and we take only a
carefully chosen sample of behavioral dimensions to assess the attributes in question.
5. Uncertainty and Desirability Involved in Human Responses : Test subjects often give uncertain
and desirable responses which generally negates the entire purpose of the psychological
measurement. Uncertainty may arise either due to negligence on part of the researcher, or
carelessness on part of the subjects), or due uncontrolled extraneous variables.
6. Variability of Human Attributes Over Time: Various human attributes, like intelligence,
personality, attitude, and so on, are likely to vary over a period of time, and sometimes even
hours are sufficient to provide scope for such variations.
7. Psychological attributes are highly dynamic and they continuously undergo organisation and
reorganization.

Structured Interviews
1. Clinicians frequently interview clients, as a part of the assessment process, to gather
information they will use to help diagnose problems and plan treatment programs.
2. The typical clinical interview does not ordinarily qualify as a psychological test. The clinical
interview is not intended to measure samples of behavior in order to make inferences.
3. The clinician merely asks unstructured, yet purposeful questions to gather information.
4. Some semi-structured interviews, such as the Structured Clinical Interview for DSM-IV Axis 1
Disorders covers a wide range of mental health concerns.
5. Our semi-structured interviews are concerned with a single diagnosis, such as the Brown
Attention-Deficit Disorder Scales or the Yale-Brown Obsessive Compulsive Scale.
6. Acknowledged as the gold-standard measure of obsess compulsive disorder (OCD) a mental
disorder characterized by repetitive, intrusive ideas or behavior-the Y-BOCS measures presence
and severity of symptoms.

Behavior Rating Scales


1. Clinicians who treat children frequently use behavior rating scales.
2. Clinicians can use the scales early in treatment, to develop a treatment plan, or any time in
treatment to clarify a diagnosis and revis treatment plan.
3. As the name implies, behavior rating scales typically require an informant, usually a parent or
teacher, to rate a client with regard to very specific behaviors.
4. There are also self-report versions of behavior rating scales for which clients rate their own
behaviors.
5. The Child Behavior Checklist is a good example of a behavior rating scale. As a standard test
psychopathology that others use to gather convergent evidence validity.
6. There are o 100 individual items that are rated.

Symptom Checklists and Symptom-Based Self-Report


1. Clinicians also use symptom checklists and self-report tests, which clients complete themselves.
Important points

● Demand characteristics are the features of a study, which gives cues on how someone is meant
to behave
● Reactivity refers people’s behavior is affected by the knowledge that they are being observed
● A measure that is capable of differentiating between 1 group of participants from another group
on a particular construct may have good discriminant validity
● Personality orientation inventory: Maslow
● Q-sort: rogers
● House tree person test: Buck
● Objective analytic tests battery : Cattell

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy