Assessments and Interpretation
Assessments and Interpretation
Levels of measurement:
Levels of measurement refer to the different ways in which data can be measured or classified.
The four levels of measurement are nominal, ordinal, interval, and ratio.
Nominal data are categorical data with no inherent order or ranking (e.g., colors, gender).
Ordinal data are categorical data with an inherent order or ranking (e.g., grades, socioeconomic status).
Interval data have equal intervals between values, but no true zero point (e.g., temperature in Celsius or
Fahrenheit).
Ratio data have equal intervals between values and a true zero point (e.g., height, weight).
The nominal level of measurement is the lowest level of measurement in which data are classified or
categorized based on some characteristic or attribute. In this level of measurement, data are grouped
into categories or classes without any inherent order or numerical value.
Examples of data measured at the nominal level include gender (male/female), nationality (American,
British, French, etc.), marital status (single, married, divorced, etc.), and type of fruit (apple, banana,
orange, etc.).
In nominal level of measurement, the data are only distinguished by their names or labels and cannot be
ordered or ranked. One category is not greater or lesser than the other; it's simply different. Therefore,
statistical techniques that involve arithmetic operations such as mean, standard deviation, and
correlation cannot be applied to nominal data. Instead, frequency counts and percentages are typically
used to summarize and analyze nominal data.
The ordinal level of measurement is a type of measurement in which data can be ranked or ordered
based on their relative position on a scale, but the differences between the values are not necessarily
meaningful.
In other words, the data can be arranged in a particular order or sequence, but the actual values of the
data points may not have a meaningful interpretation in terms of numerical differences. Examples of
data measured at the ordinal level include rankings (e.g., first, second, third), Likert scales (e.g., strongly
agree, agree, neutral, disagree, strongly disagree), and grades (e.g., A, B, C, D, F).
The ordinal level of measurement is less informative than interval or ratio levels of measurement, which
provide information about the magnitude of the differences between data points. However, it is more
informative than nominal level of measurement, which only allows for classification of data into
categories without any ordering.
The interval level of measurement is a type of measurement in which the data are measured on a
numerical scale, and the intervals between values are equal in size. In other words, the difference
between any two adjacent values is the same, but there is no meaningful zero point.
Examples of data measured at the interval level include temperature (in Celsius or Fahrenheit), years
(e.g., 2000, 2001, 2002, etc.), and IQ scores. In temperature measurements, the difference between 10
and 20 degrees Celsius is the same as the difference between 20 and 30 degrees Celsius. However, there
is no meaningful zero point; 0 degrees Celsius does not indicate the absence of temperature.
Interval level data can be used in various statistical analysis techniques, such as computing means,
standard deviations, and correlations. However, because there is no true zero point, ratios between
measurements cannot be calculated, and therefore, operations such as multiplication or division cannot
be used.
True Zero
In the context of measurement scales, a true zero point refers to a point on the scale where the value of
zero indicates the complete absence of the measured quantity.
For example, weight measured on a scale with a true zero point means that a weight of zero indicates
the complete absence of any mass. Other examples of measurements with true zero points include
length, volume, and count data.
In contrast, scales without true zero points (such as the interval and ordinal scales) have a value of zero
that does not necessarily represent the complete absence of the measured quantity. For instance,
temperature measured in Celsius or Fahrenheit has an arbitrary zero point and therefore is not a true
zero point.
The presence or absence of a true zero point has important implications for the interpretation of data
and the statistical methods used to analyze them. For example, when working with data that have a true
zero point, ratios can be computed, and multiplication and division can be used to compare values. This
is not possible with data that do not have a true zero point.
Examples of data measured at the ratio level include length (in centimeters, inches, etc.), weight (in
kilograms, pounds, etc.), time (in seconds, minutes, etc.), and count data (such as the number of books,
the number of people, etc.).
In ratio level of measurement, all arithmetic operations such as addition, subtraction, multiplication, and
division can be applied, and ratios between measurements are meaningful. For example, if person A
weighs 60 kg, and person B weighs 80 kg, then person B weighs 1.33 times as much as person A.
The ratio level of measurement allows for the most extensive statistical analyses, such as measures of
central tendency (mean, median, and mode), measures of variability (range, standard deviation, and
variance), and correlation coefficients.
Central tendency:
The three measures of central tendency are the mean, median, and mode.
The mean is the sum of all values divided by the total number of values.
The median is the middle value when the data are ordered from lowest to highest (or highest to lowest).
Variability:
The range is the difference between the highest and lowest values in the data set.
The variance and standard deviation are measures of how much the data deviate from the mean.
A large variance or standard deviation indicates a high degree of variability or spread in the data.
Shape:
A skewed distribution has a longer tail on one side than the other.
Content and format: This describes the specific elements covered in the assessment, such as
types of questions, tasks, or activities. It can include multiple choice, essay writing, problem-
solving, simulations, or performance-based tasks. The chosen format influences how students
demonstrate their knowledge and skills.
Structure and organization: This refers to how the assessment is structured, including the
number of sections, weighting of different elements, and timing constraints. The structure can
guide students' focus and prioritize specific learning outcomes.
Level of complexity and difficulty: The assessment should be challenging enough to differentiate
student abilities while also being accessible and fair to all. The level of difficulty should be
appropriate for the target audience and learning goals.
Relationship:
A positive correlation means that as one variable increases, the other variable also increases.
A negative correlation means that as one variable increases, the other variable decreases.
Internal consistency: This refers to the extent to which different parts of the assessment
measure the same thing. Consistent assessments ensure that students are not penalized due to
inconsistencies in the questions or tasks.
Validity: This indicates whether the assessment actually measures what it is intended to
measure. A valid assessment accurately reflects students' knowledge, skills, or understanding in
the targeted domain.
Reliability: This refers to the consistency of the assessment results. A reliable assessment yields
consistent results across different administrations and assessors, minimizing the influence of
random factors.
Relationship to learning goals: The assessment should be aligned with the learning objectives of
the course or program. It should measure what students are supposed to learn and not
introduce extraneous content or skills.
Understanding both the "shape" and "relationships" within an assessment is crucial for effective
evaluation and learning. By considering the content, structure, complexity, internal consistency, validity,
reliability, and alignment with learning goals, we can design assessments that accurately measure
student progress, provide valuable feedback, and ultimately support effective teaching and learning.
Raw scores:
Raw scores are the actual scores obtained by individuals on a test or assessment.
Raw scores can be difficult to interpret on their own, as they do not provide any information about how
well the individual performed relative to others.
Percentiles:
Percentiles are a way of expressing an individual's score relative to the scores of others who took the
same test or assessment.
Percentiles range from 1 to 99, with a score of 50 representing the median or average score.
A percentile score of 70, for example, indicates that the individual scored higher than 70% of the people
who took the same test.
Standard scores:
Standard scores are a way of expressing an individual's score in relation to a standard or norm group.
Standard scores provide a more meaningful way of comparing scores across different tests or
assessments.
Z-scores:
Z-scores are a type of standard score that expresses an individual's score in terms of standard deviations
from the mean.
A z-score of 0 indicates a score that is equal to the mean, while a z-score of 1 indicates a score that is
one standard deviation above the mean.
Z-scores can be used to compare scores across different tests or assessments that have different scales
or units of measurement.
Grade equivalents:
Grade equivalents are a way of expressing an individual's score in terms of the grade level or age level at
which that score is typical.
Grade equivalents can be useful for interpreting scores in educational settings, as they provide a way of
understanding how well a student is performing relative to their peers.
Interpretation:
It is important to interpret scores in the context of the specific test or assessment being used.
Interpretation should take into account factors such as the purpose of the test, the population being
tested, and the reliability and validity of the test.
Scores should be used as one piece of information in making decisions, and should not be relied on
exclusively to make important judgments about individuals.
Ipsative interpretation: This involves comparing an individual's scores to their own previous scores on
the same assessment. For example, if a student takes a reading test at the beginning and end of the
school year and their score improves from 60% to 80%, this means that they have made progress in their
reading abilities.
Profile interpretation: This involves looking at an individual's scores across multiple assessments or
subtests to identify patterns of strengths and weaknesses. For example, if a student takes a battery of
assessments and scores high in reading and writing but low in math, this suggests that they may have
stronger language skills than math skills.
Clinical interpretation: This involves using professional judgment and clinical experience to interpret
assessment scores and make recommendations for intervention or treatment. For example, a
psychologist might interpret a student's scores on a personality assessment and use this information to
guide therapeutic interventions.
Referencing schemes:
Referencing schemes are a way of interpreting test scores by comparing them to a reference group or
norm group.
The purpose of a referencing scheme is to provide a basis for interpreting individual test scores in terms
of how they compare to the scores of other people who took the same test.
Different referencing schemes may be used depending on the purpose of the test and the characteristics
of the population being tested.
Standard scores:
Standard scores are a type of referencing scheme that express an individual's score in relation to a
standard or norm group.
Standard scores typically have a mean of 100 and a standard deviation of 15 or 16, which allows for easy
comparison of scores across different tests or assessments.
Z-scores:
Z-scores are a type of standard score that express an individual's score in terms of standard deviations
from the mean.
A z-score of 0 indicates a score that is equal to the mean, while a z-score of 1 indicates a score that is
one standard deviation above the mean.
Z-scores can be used to compare scores across different tests or assessments that have different scales
or units of measurement.
T-scores:
T-scores are a type of standard score that have a mean of 50 and a standard deviation of 10.
T-scores are commonly used in educational and clinical settings to compare scores on tests of academic
or cognitive functioning.
A T-score of 50 indicates an average score, while scores above 50 indicate above-average performance
and scores below 50 indicate below-average performance.
Stanines:
Stanines are a type of standard score that range from 1 to 9, with a mean of 5 and a standard deviation
of 2.
Stanines are often used in educational settings to provide a simple and easily interpretable way of
grading or evaluating student performance.
Conclusion:
Referencing schemes and standard scores provide a useful way of interpreting test scores and
comparing performance across different tests or assessments.
It is important to choose the appropriate referencing scheme or standard score based on the purpose of
the test and the characteristics of the population being tested.
Standard scores should be used as one piece of information in making decisions, and should not be
relied on exclusively to make important judgments about individuals.
Validity:
Validity refers to the extent to which an assessment measures what it is intended to measure.
A valid assessment should be designed to measure the specific knowledge, skills, or abilities that it is
intended to measure, and should not be affected by extraneous factors.
Reliability:
Reliability refers to the consistency and stability of an assessment over time and across different test-
takers.
A reliable assessment should produce consistent results when administered to the same individual
multiple times, and should produce similar results when administered to different individuals.
Standardization:
Standardization refers to the use of consistent and uniform procedures for administering, scoring, and
interpreting assessments.
Standardization helps to ensure that scores obtained on an assessment are comparable across different
individuals, settings, and time periods.
Objectivity:
Objectivity refers to the degree to which an assessment is free from subjective bias or personal opinions.
Objective assessments should be based on objective criteria and procedures, and should not be
influenced by personal beliefs or attitudes of the assessor.
Feasibility:
A feasible assessment should be easy to administer, score, and interpret, and should not require
excessive time or resources to implement.
Utility:
Utility refers to the usefulness and relevance of an assessment for its intended purposes.
A useful assessment should provide meaningful and relevant information about the knowledge, skills, or
abilities being assessed, and should be applicable to real-world situations.
Fairness:
Fairness refers to the extent to which an assessment procedure treats all individuals fairly and equally,
regardless of their background or personal characteristics.
A fair assessment should be free from biases based on gender, race, ethnicity, or other personal
characteristics, and should be designed to provide equal opportunities for all test-takers.
Overall, an assessment procedure that possesses these desirable characteristics is likely to be more
effective in providing accurate and meaningful information about an individual's knowledge, skills, or
abilities.
The validity of a norm-referenced test refers to the extent to which the test accurately measures the
construct it is intended to measure and the extent to which the test scores can be meaningfully
compared to the scores of other individuals.
Construct validity can be established by examining the relationship between test scores and other
measures of the same construct, such as other tests or assessments.
Norm-referenced tests should also demonstrate content validity, meaning that the test should cover all
aspects of the construct being measured.
The reliability of a norm-referenced test refers to the consistency and stability of the test scores over
time and across different test-takers.
Test-retest reliability can be established by administering the same test to the same group of individuals
at two different points in time and comparing the scores.
Internal consistency reliability can be established by examining the degree to which the different items
on the test are measuring the same construct.
The validity of a criterion-referenced test refers to the extent to which the test accurately measures the
specific knowledge, skills, or abilities that it is intended to measure.
Criterion-related validity can be established by examining the relationship between test scores and
other measures of the same construct, such as performance on a related task or job.
The reliability of a criterion-referenced test refers to the consistency and stability of the test scores over
time and across different test-takers.
Test-retest reliability can be established by administering the same test to the same group of individuals
at two different points in time and comparing the scores.
Inter-rater reliability can be established by examining the degree to which different raters or evaluators
agree on the scores assigned to the same test-taker.
Overall, both norm-referenced and criterion-referenced tests should be designed to demonstrate high
levels of validity and reliability in order to ensure that the test scores are accurate and meaningful.
The unit-weighting model involves giving equal weight to each assessment method used in the
evaluation.
This model assumes that all assessment methods are equally valid and reliable, and that each method
contributes equally to the overall evaluation.
The compensatory model involves weighing each assessment method according to its relative
importance in the overall evaluation.
This model assumes that some assessment methods may be more important than others in assessing
certain skills or knowledge areas, and that weaker performance on one assessment method can be
compensated for by stronger performance on another assessment method.
The profile model involves using a combination of assessment methods to create a profile of an
individual's strengths and weaknesses in different areas.
This model assumes that each assessment method provides unique information about the individual
being assessed, and that combining multiple methods can provide a more comprehensive understanding
of the individual's skills and abilities.
The hierarchical model involves using different assessment methods to assess different levels of a
hierarchy of skills or knowledge.
This model assumes that some assessment methods are better suited to assessing basic knowledge and
skills, while others are better suited to assessing more advanced or complex knowledge and skills.
The integrative model involves using a combination of assessment methods to create a holistic and
integrated understanding of the individual being assessed.
This model assumes that all assessment methods are interrelated and that combining multiple methods
can provide a more complete understanding of the individual's skills, abilities, and overall performance.
Convergent Model: This model involves using multiple assessment methods that measure the same
construct to get a more comprehensive understanding of the individual being assessed. For example, if
you were assessing a student's writing skills, you might use both a standardized writing test and a
writing sample from a class assignment to get a more complete picture of their abilities.
Complementary Model: This model involves using different assessment methods that provide unique
information about an individual's abilities or characteristics. For example, you might use a personality
inventory to assess a student's personality traits and a performance-based assessment to assess their
skills in a specific area. By combining these different types of assessments, you can gain a more well-
rounded understanding of the individual.
Multitrait-Multimethod Model: This model involves using multiple assessment methods to measure
multiple constructs. For example, if you were assessing a student's academic abilities, you might use a
standardized test to measure their reading and math skills, a writing sample to measure their writing
skills, and a teacher rating scale to measure their social-emotional skills. By using multiple methods to
measure multiple constructs, you can get a more comprehensive understanding of the individual being
assessed.
It's worth noting that there are many different models and approaches to combining assessment
information, and the specific model used will depend on the goals of the assessment and the
information being gathered.
Overall, each model of combining assessment information from different assessment methods has its
own strengths and limitations, and the choice of model will depend on the specific goals and purposes
of the evaluation. It is important to carefully consider the validity, reliability, and practicality of each
assessment method used, as well as the potential biases and limitations associated with each model of
combining assessment information.
STANDARDISATION
Standardization refers to the process of establishing a set of norms, procedures, and rules for
administering and scoring an assessment to ensure consistency and fairness in the results obtained. The
goal of standardization is to ensure that the results of an assessment accurately reflect the knowledge,
skills, or abilities being measured, rather than being influenced by factors such as the testing
environment, test-taker characteristics, or the personal biases of the examiner.
Norms: Norms are the reference points used to interpret the scores obtained from an assessment.
Norms are typically established by administering the assessment to a representative sample of
individuals who are similar to the intended test-takers. The scores obtained from this sample are then
used to establish a standard against which the scores of the test-takers can be compared.
Scoring: Standardization of scoring involves using objective criteria to score the assessments and
establishing rules for partial credit or deduction of points. The scoring rubric should be clearly defined
and consistent for all test-takers.
Interpretation: Standardization of interpretation involves ensuring that the scores obtained from the
assessment are accurately interpreted and reported to the test-taker, their parents/guardians, or other
stakeholders. Interpretation should be based on established norms and should take into account the
limitations and strengths of the assessment.
Standardization is essential to ensure that assessments are fair, reliable, and valid, and that the results
accurately reflect the knowledge, skills, or abilities being measured. Standardization helps to reduce the
influence of extraneous factors on test scores, and ensures that the scores obtained are comparable
across individuals and over time.
Examples include the Scholastic Aptitude Test (SAT) and the American College Testing (ACT). These tests
are designed to measure the academic knowledge and skills of students and are administered under
standardized conditions. Norms are established based on the scores obtained from a representative
sample of test-takers, and scores are reported as percentile ranks to allow for meaningful interpretation.
Examples include the Graduate Record Examination (GRE) and the Law School Admission Test (LSAT).
These tests are designed to measure an individual's potential for success in specific academic or
professional domains. Norms are established based on the scores obtained from a representative
sample of test-takers, and scores are reported as percentile ranks or scaled scores.
Standardized Personality Tests:
Examples include the Minnesota Multiphasic Personality Inventory (MMPI) and the California
Psychological Inventory (CPI). These tests are designed to measure an individual's personality traits and
are administered under standardized conditions. Norms are established based on the scores obtained
from a representative sample of test-takers, and scores are reported as T-scores or percentile ranks.
Examples include medical tests such as blood glucose tests and diagnostic imaging such as X-rays. These
tests are designed to identify the presence or absence of specific medical conditions and are
administered under standardized conditions. Norms are established based on the scores obtained from
a representative sample of individuals without the medical condition being tested for, and scores are
reported as positive or negative based on established cut-off scores.
Examples include the Woodcock-Johnson Tests of Cognitive Abilities and the Kaufman Assessment
Battery for Children. These tests are designed to measure an individual's cognitive or motor skills and
are administered under standardized conditions. Norms are established based on the scores obtained
from a representative sample of test-takers, and scores are reported as percentile ranks or standard
scores.
Standardization is essential to ensure the reliability, validity, and fairness of assessments across different
domains.