Validity
Validity
For example, if people weigh themselves during the day, they would
expect to see a similar reading. Scales that measured weight
differently each time would be of little use.
Error Variance refers to the portion of the total variability in test scores that is caused by
factors unrelated to the true construct or characteristic being measured. It represents random
fluctuations or inconsistencies that affect test scores, making them less accurate in reflecting
an individual’s actual ability or trait.
1. Test Administration Factors: Variations in the environment during the test, like
noise, poor lighting, or uncomfortable seating, can distract test-takers and influence
their performance.
2. Test-Taker Factors: The individual’s mood, health, motivation, fatigue, or level of
anxiety can fluctuate and cause performance to vary independently of the true trait
being measured.
3. Test Construction Factors: Ambiguous questions, poorly worded items, or
inconsistencies in the difficulty of test items can introduce error.
4. Scoring Inconsistencies: Differences in how the test is scored, especially in
subjective tests like essay assessments, can also add to Error Variance.
Imagine you are taking an intelligence test, but on the day of the test, you are feeling unwell.
Your performance might be lower than your true level of intelligence. Similarly, if a question
on the test is unclear or ambiguous, people might interpret and answer it differently, not
based on their true ability. These factors create discrepancies in scores that are unrelated to
actual differences in intelligence, contributing to Error Variance.
The correlation coefficient is a statistical measure that describes the strength and direction
of a relationship between two variables. It tells you how closely two variables move in
relation to each other. In psychology, correlation coefficients are often used to determine how
one psychological variable relates to another, such as the relationship between stress and
performance or self-esteem and social interaction.
• r = +1: This indicates a perfect positive correlation, meaning that as one variable
increases, the other variable also increases proportionally. For example, height and
weight usually have a positive correlation; as height increases, weight often increases.
• r = -1: This indicates a perfect negative correlation, meaning that as one variable
increases, the other variable decreases proportionally. For example, stress levels and
the quality of sleep might have a negative correlation; as stress increases, the quality
of sleep decreases.
• r = 0: This indicates no correlation, meaning there is no linear relationship between
the two variables. For example, shoe size and intelligence would likely have a
correlation coefficient close to zero.
Strength of Correlation
• 0.1 to 0.3 (or -0.1 to -0.3): Weak correlation. There is a slight relationship between
the variables, but it is not strong.
• 0.3 to 0.5 (or -0.3 to -0.5): Moderate correlation. There is a noticeable relationship
between the variables.
• 0.5 to 1.0 (or -0.5 to -1.0): Strong correlation. The variables have a strong and
consistent relationship.
These classifications can vary slightly depending on the context and field of research.
Types of Correlation
1. Positive Correlation: As one variable increases, the other variable also increases.
Example: The relationship between study time and test scores. More time spent
studying generally leads to higher scores.
2. Negative Correlation: As one variable increases, the other variable decreases.
Example: The relationship between the number of hours spent watching TV and
academic performance. More hours watching TV may be associated with lower
academic performance.
3. Zero Correlation: There is no discernible pattern or relationship between the
variables. Example: The relationship between shoe size and personality traits.
The most common method for calculating the correlation coefficient is Pearson's
correlation coefficient, which measures the linear relationship between two variables. The
formula is:
Types of Reliability
Test-Retest Reliability
Definition: Test-retest reliability measures the consistency of test scores over time. It
evaluates whether the same test, given to the same group of people at two different points in
time, produces similar results. High test-retest reliability indicates that the test is stable and
dependable over time.
How It Works
• High Reliability: An rrr value close to 1.0 suggests that the test is highly reliable over
time. A common rule of thumb is that a reliability coefficient of 0.7 or higher is
acceptable, though this depends on the context and purpose of the test.
• Low Reliability: An rrr value significantly below 0.7 suggests that the test may not
be stable over time, which could mean that the measured trait fluctuates or that the
test is unreliable.
1. Time Interval Between Tests: The length of time between the first and second test
administrations can impact reliability. If the interval is too short, participants may
remember their answers, inflating reliability. If the interval is too long, the measured
trait may genuinely change, reducing reliability.
2. Practice Effects: Participants may perform differently on the second test simply
because they are more familiar with the test format or content, which can influence
the scores.
3. Changes in Participants: Natural changes in the participants' psychological or
physical state (e.g., mood, health) between the two testing times can impact the
results.
4. Measurement Error: Inconsistencies in test administration or environmental factors
(e.g., noise, distractions) can also affect reliability.
Alternate-Form Reliability
The idea is to determine whether the two versions of the test are equivalent and produce
similar results when administered to the same group of people.
How It Works
1. Test Construction: Two parallel or alternate forms of the test are developed. Both
forms are designed to have the same number of items, similar difficulty levels, and the
same content coverage, but the specific items differ.
2. Administration: Both forms are administered to the same group of individuals, either
simultaneously or with a short time interval to prevent significant changes in the
underlying construct.
3. Calculation: The scores from the two forms are then compared using a correlation
coefficient, such as Pearson’s correlation. A high correlation indicates strong
alternate-form reliability, suggesting that both forms measure the construct similarly.
1. Reduces Memory Effects: Since the items on the two test forms are different, this
method reduces the impact of memory or learning effects that can influence test-retest
reliability.
2. Versatile: Useful in situations where repeated testing with the same items would not
be practical or might lead to practice effects (e.g., standardized testing or academic
assessments).
1. Difficult to Create Equivalent Forms: Developing two test forms that are truly
equivalent in terms of difficulty, content, and construct measurement is challenging
and time-consuming.
2. Administration and Fatigue: Administering two forms of the test can lead to
participant fatigue, especially if the tests are long or difficult.
3. Practical Constraints: It may not always be feasible to have two separate test forms,
especially in smaller-scale research or testing situations.
1. Quality of Test Construction: The degree of similarity between the two forms
affects the reliability coefficient. If the forms are not well-matched in terms of
difficulty and content, reliability will be lower.
2. Time Interval: If both forms are administered at different times, changes in the test-
taker’s psychological or physical state can impact scores. It’s ideal to minimize the
time between test administrations if possible.
3. Environmental Factors: Consistency in testing conditions (e.g., noise level, lighting,
and instructions) is essential to obtain reliable results.
1. Careful Test Design: Invest time in creating two truly parallel forms of the test. This
involves using item analysis and expert judgment to ensure that the forms are
equivalent.
2. Pilot Testing: Administer both forms to a small sample before full-scale testing to
identify any significant differences in item difficulty or performance.
3. Consistent Administration: Ensure that the instructions and testing conditions are
the same for both forms to minimize variability unrelated to the test itself.
Split-Half Reliability
Definition: Split-half reliability is a measure of internal consistency that assesses how well
a test’s items measure the same construct. It is determined by splitting a test into two halves
(e.g., dividing the items into odd and even numbers or by random assignment) and then
measuring the consistency of the scores from these two halves. A high correlation between
the two sets of scores indicates that the test is internally reliable.
How It Works
1. Test Splitting: A test is divided into two equal halves in a way that attempts to
balance difficulty and content across both halves. The split can be done:
o Randomly.
o By taking the first half of the items versus the second half.
o Using odd-numbered items versus even-numbered items.
2. Scoring and Correlation: Scores from each half are calculated for every test-taker.
Then, the correlation between the two sets of scores is computed using a statistical
method, like Pearson’s correlation coefficient.
3. Spearman-Brown Prophecy Formula: Because the correlation between the two
halves underestimates the reliability of the full test, a correction is applied using the
Spearman-Brown prophecy formula to estimate the reliability of the entire test.
Formula
Example
Interpretation of Split-Half Reliability
• High Reliability: A value close to 1 indicates high internal consistency, meaning the
items on the test are measuring the same underlying construct.
• Low Reliability: A value far from 1 indicates that the test items may not be consistent
or that the test might measure multiple constructs rather than a single one.
where:
Where,
▪ K = number of items
▪ σi2 = variance of each individual item
▪ σt = variance of the total scores
3. Key Differences:
o Data Type: The Kuder-Richardson formulas are specifically for dichotomous
items, whereas Cronbach’s alpha is used for items that have more than two
response options (e.g., rating scales).
o Use: KR-20 and KR-21 are a form of internal consistency measurement
similar to Cronbach’s alpha but are specifically tailored to tests with
dichotomous outcomes.
4. Relationship and Interpretation:
o Both KR-20 and Cronbach’s alpha give an estimate of the test’s reliability,
which refers to the consistency or stability of the test scores. Higher values
(typically above 0.7) indicate better internal consistency.
o If you are working with dichotomous items and the assumptions for KR-20 are
met, KR-20 can be used. Otherwise, for general scales with multiple response
categories, Cronbach’s alpha is more appropriate.