Qualities of Classroom Observation Systems
Qualities of Classroom Observation Systems
To cite this article: Courtney A. Bell, Marjoleine J. Dobbelaer, Kirsti Klette & Adrie Visscher (2019)
Qualities of classroom observation systems, School Effectiveness and School Improvement, 30:1,
3-29, DOI: 10.1080/09243453.2018.1539014
ARTICLE
a
Center for Global Assessment, Educational Testing Service, Princeton, NJ, USA; bELAN, Department of
Teacher Development, Faculty of Behavioural, Management and Social Sciences, University of Twente,
Enschede, The Netherlands; cDepartment of Teacher Education and School Research, Faculty of
Educational Sciences, University of Oslo, Oslo, Norway
ABSTRACT KEYWORDS
Observation systems are increasingly used around the world for a Teacher evaluation; teaching
variety of purposes; 2 critical purposes are to understand and to evaluation; teaching quality;
improve teaching. As observation systems differ considerably, classroom observation;
individuals must decide what observation system to use. But the observation systems
field does not have a common specification of an observation
system, nor does it have systematic ways of thinking about how
observation systems are similar and different. Given this reality
and the renewed global interest in observation systems, this arti-
cle first defines the observation system concept and then presents
a framework through which to understand, categorize, and com-
pare observation systems. We apply the framework to 4 well-
known observation systems that vary in important ways. The
article concludes with a discussion of the results of the application
of the framework and some important implications of those
findings.
Introduction
Observation systems are used around the world for a variety of purposes. Two critical
purposes are to understand and improve teaching. Scholars often seek to understand
teaching by identifying dimensions of teaching and investigating how those dimensions
contribute to valued outcomes such as student learning or students’ motivation (e.g.,
Decristan et al., 2015). They also seek to use observation systems to improve teaching. In
order to improve teaching, one must first measure it and understand it. This means that
the scores from observation systems can be used to provide feedback and coaching to
teachers as well as to evaluate interventions hypothesized to affect teaching (e.g., Kraft
& Blazar, 2017). But when individuals set out to understand and/or improve teaching,
they face many choices. For example, should they use a system that can be used across
school subjects, a so-called “generic” system, or one that is subject specific? Should they
select a system that produces more narrow and detailed information or one that
produces more global, summary information? To what degree do existing systems
serve the specific purposes the individual has in mind?
The scoring tools in an observation system specify which dimensions of teaching will
be measured. These tools include the scales themselves – both the teaching practices
being assessed and the number and definition of the score points (e.g., present/not
present, a 3-point criterion-referenced scale or rubric). Because observation scales are
designed to measure complex human interactions, raters come to understand the scales
through videos (and/or text-based descriptions) of teaching that have been rated by
someone who understands the scales and score point distinctions. These video- and
text-based descriptions show raters how the words of the scoring scales are embodied
in teachers’ and students’ words and actions.
As has been documented in some observation systems, human rating of teaching is
prone to being unreliable and inaccurate, especially when coding certain aspects of
teaching practice such as intellectual challenge or cognitive activation (e.g., T.J. Kane &
Staiger, 2012; Decristan et al., 2015; Praetorius, Pauli, Reusser, Rakoczy, & Klieme, 2014).
Therefore, it is very important for observation systems to have rating quality procedures
(Park, Chen, & Holtzman, 2014). These procedures are used to ensure that raters are well
trained and are able to use the rating scales accurately and reliably over time. A
common quality procedure is the formal training and certification of raters.
Certification tests often mimic the work raters will do in studies or in practice. For
example, raters might be required to take and pass a certification test in which s/he
rates a lesson, and the ratings must agree exactly with master ratings on 80% of the
rating scales. Another common procedure is double scoring, the practice of having two
raters independently assign ratings to the same lesson in order to compute inter-rater
agreement metrics.
Finally, sampling specifications are the details around how the observations sample
from the larger domain to which the ratings are intended to generalize (Joe et al., 2014).
These specifications include, but are not limited to, the number of observations con-
ducted for a reliable estimate of teaching quality, the length of time of those observa-
tions, the frequency with which raters assign ratings (e.g., every 10 min, every 30 min),
and how lessons are sampled from the unit of analysis. For example, for a primary
teacher, how does a four-lesson sample used by researchers vary across the subjects a
primary teacher might teach? Are there only language and mathematics lessons? Are all
lessons from April and May, or are they sampled from the entire school year? These and
other similar questions are addressed in the sampling specifications of an observation
system.
Given this description of an observation system, in what follows, we propose a
framework to guide considerations of existing observation protocols, hereafter referred
to as observation systems. Our framework hypothesizes eight aspects of observation
systems, which might be used to better understand the affordances and constraints of
any such observation system (see Table 1). We then use these eight aspects of observa-
tion systems to consider four different observation systems. In doing so, we hope to
show how observation systems can be considered side by side, thereby contributing to
the field’s meta-knowledge of observation systems.
In order to improve teaching, one must have a theory of improvement (Van Veen,
Zwart, & Meirink, 2012). Observation systems rarely have such theories embedded within
them; however, all observation systems parse teaching in specific ways, valuing one
grouping of teaching practices over an alternative grouping (Praetorius & Charalambous,
6 C. A. BELL ET AL.
Dimensions of teaching
Observation systems include dimensions of teaching that are considered to be indicators
of teaching quality. The assumption generally is that the better a teacher scores on these
indicators, the better the teaching, and, therefore, the more his/her students will learn.
Some frequently used indicators in observation systems originate in the process-product
studies of teaching (e.g., classroom management, clear explanation of subject matter;
Brophy & Good, 1986). Others come from other strands of research, for example, the
TIMSS studies (e.g., cognitive activation; Baumert et al., 2010; Hiebert & Grouws, 2007),
research on assessment for learning (Black & Wiliam, 1998), self-regulation (Zimmerman,
1990), and instructional differentiation (e.g., Tomlinson, Brimijoin, & Navaez, 2008).
We present dimensions of teaching quality frequently included in classroom obser-
vation systems based on two reviews of classroom observation systems. In the first, a
comprehensive review of classroom observation systems was searched by means of a
5-step search strategy based on Littell, Corcoran, and Pillai (2008), including a sys-
tematic literature review and contacting experts in the field. The main inclusion
criteria concerned whether the systems were developed for measuring teaching
quality in primary education and were published after 1990 in English or in Dutch.
Also, research into reliability and validity had to be conducted in primary education,
and the systems had to provide useful data for practitioners in the field. The 27
classroom observation systems that met the criteria were reviewed by two reviewers
(Dobbelaer & Visscher, 2018).
The other, and less systematic reviews (Charalambous & Praetorius, 2018; Klette &
Blikstad-Balas, 2018), summarize and organize existing frameworks based on the distinc-
tion between generic versus subject-specific frameworks (Charalambous & Praetorius,
2018), conceptual framings and vocabulary used (Praetorius, Klieme, Herbert, & Pinger,
2018), and/or review system characteristics along the aspects in our framework,
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 7
● Safe and stimulating classroom climate: This dimension refers to the degree to
which teachers and students respect one other, communicate with each other in a
supportive way, and together create a safe and positive classroom climate in which
student learning is promoted (e.g., Danielson, 2013; Saginor, 2008).
● Classroom management: Classroom management reflects the degree to which
teachers and students manage their behavior and time in such a way that learning
can be productive. In a well-managed class, little time and energy are lost on
activities that are not learning oriented (Marzano, Marzano, & Pickering, 2003;
Wang, Haertel, & Walberg, 1993).
● Involvement and motivation of students: This dimension is about the extent to
which teachers involve all students actively in classroom learning activities, and
how much students participate in classroom learning activities (Rosenshine, 1980;
Schacter & Thum, 2004).
● Explanation of subject matter: How clearly teachers explain the subject matter to
be learned to their students is crucial for how much students learn. Clear explana-
tions include clear specification of lesson objectives to students, reviewing previous
learning, the use of clear language, presenting information in an orderly manner,
presenting vivid and appealing examples, checking for understanding, and the
frequent restatement of essential principles (Schacter & Thum, 2004; Van de Grift,
2007).
● Quality of subject-matter representation: Quality is influenced here by the richness
(e.g., multiple representations of subject matter), precision, and accuracy of the
subject matter. Strong representations provide opportunities to learn the subject-
matter practices (e.g., problem solving, argumentation) as well as the significant
organizing ideas and procedures of that subject matter (Hill et al., 2008).
● Cognitive activation: A deep understanding of how the various parts of subject
matter are related to and connected with each other requires that teachers can
activate students’ deep thinking by means of questions, appropriate assignments,
classroom discussions, and other pedagogical strategies (Baumert et al., 2010;
Osborne et al., 2015).
● Assessment for learning: Assessment for learning is characterized by a cycle of
communicating explicit assessment criteria, collecting evidence of student under-
standing of subject matter, and providing feedback to students that moves their
learning forward (Black & Wiliam, 1998, 2010).
● Differentiated instruction: Teachers differentiate their teaching to the degree they
adapt subject matter, the explanation of subject matter, students’ learning time,
and the assignments to the differences between students (Keuning et al., 2017;
Tomlinson, 2004).
● Teaching learning strategies and student self-regulation: This dimension is about
teachers (a) explicitly modeling, scaffolding, and explaining learning strategies to
8 C. A. BELL ET AL.
students, which students can use to perform higher level operations (e.g., teaching
heuristics, thinking aloud when solving problems, using checklists) (Carnine, Dixon,
& Silbert, 1998; Slavin, 1996), and (b) encouraging students to self-regulate and
monitor their own learning process in light of the learning goals (Boekaerts,
Pintrich, & Zeidner, 2000; Muijs et al., 2014; Zimmerman, 1990). Teachers who
explicitly model, scaffold, explain strategies, give corrective feedback, and ensure
that children master the material taught contribute highly to the academic success
of their pupils.
While all of these dimensions of teaching are fundamental to students’ learning and
development, each dimension can be operationalized differently across observation
systems. Further, observation systems vary in the degree to which they capture all
dimensions or target specific dimensions.
Subject specificity
There is widespread agreement about the importance of the subject-matter specificity
of teaching quality (Seidel & Shavelson, 2007); however, there is less agreement about
how to measure this aspect of teaching practice. Several observation systems have
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 9
Grain size
Related to a subject-specific or more generic focus, there is also the issue of grain
size: how discrete/targeted practices are to be coded (Hill & Grossman, 2013). This
issue has been addressed in observation studies for decades (Brophy & Good, 1986;
Flanders, 1970). In some newer systems (e.g., CLASS and PLATO), consensus has been
reached on a set of core activities (12 for both CLASS and PLATO). This stands in
contrast to earlier systems that included a long list of activities to score (Scheerens,
2014). Thus, how many domains and elements that are to be scored is a feature that
might vary across systems.
A system’s grain size may be related to the number of scale points (e.g., when
measuring a practice such as the presence of a lesson objective, this might be rated
on a dichotomous scale – present or absent). However, the number of scale points
should not be assumed to be an indicator of score quality (e.g., reliability, variation, etc.).
Matters of score quality are best addressed through a compelling argument that relies
on multiple sources of validation evidence (M. Kane, 2006).
Whether to score the whole lesson or segments of the lesson is a related aspect of
grain size. One might imagine observation systems that seek to code smaller grain
sizes, that is, narrower teaching practices, might segment the lesson many times so
that narrow behaviors can be accurately documented throughout a lesson (e.g., MQI).
Alternatively, observation systems using more holistic codes requiring the rater to
judge multiple interrelated practices might segment at larger intervals (e.g., 20 min
or a whole lesson) so that the ratings reflect all of the interrelated practices (e.g.,
ICALT).
The decisions about what grain size to capture are further shaped by the rhythm and
pace of instruction. Activities are not always equally probable in every segment of a
lesson. For example, while instructional purpose may be central to the beginning of a
lesson, it may be less central towards the end of the lesson. The degree of lesson
segmentation necessary for a specific grain size of practice being scored is a decision
made by system designers (Klette, Blikstad-Balas, & Roe, 2017) and is often
undocumented.
10 C. A. BELL ET AL.
Scoring procedures
Classroom observation systems differ in their scoring procedures, sampling procedures,
and preparation of raters. The choices made by developers for these three aspects
influence the reliability and validity of the observation scores. We describe each in
turn.
Sampling procedures
Classroom observation systems are developed for one or more of the following pur-
poses: promoting teacher learning, teacher evaluation, or developing research insights.
Given these purposes, lessons are sampled in different ways. The lesson’s subject matter
and type (e.g., an introductory or a practice lesson) may be specified by the system. The
observations can be conducted live or on video, be announced or unannounced, and
they can vary in length.
Sampling of the lesson can be specified even further: for example, whether the
observer should walk around, talk with students or not during an observation, which
part of the lesson should be observed, how many observation cycles should be con-
ducted, and when the observation should be conducted across days, weeks, or the
school year.
Scoring procedures
Observation systems differ in how rating procedures and scoring rules are carried out.
How many observations and the number of segments to be scored, the degree to which
lessons are double rated, and whether ratings are checked systematically by master
raters for accuracy are just some of the rating procedures that are relevant to the validity
of the system. Scoring rules concern how ratings are aggregated across units (e.g.,
segments, lessons, teachers) and across raters (e.g., averaging discrepant ratings, taking
the highest rating), as well as rounding rules, various scoring models (e.g., averaging
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 11
ratings across segments and lessons to the teacher level, using IRT models to create
teacher scores), and rules regarding dropping ratings.
Preparation of observers
Raters are usually trained using manuals that provide insight into the theoretical basis of
the system, the meaning of the items and scales and scoring rules. Training can also
provide raters opportunities to practice, by observing videos and scoring them during
the training. Certification of raters could be required, as well as recertification after a
specific time period. It is also critical that raters are able to create accurate and unbiased
scores across teachers so teachers can improve.
Empirical evidence
The validity of the content of the observation system will probably vary. As was stated in
the dimensions of teaching section, the assumption is that the dimensions of teaching
included in observation systems reflect teaching quality. A critical criterion for teaching
quality is how much students learn and develop. Thus, it is important to understand the
extent to which the assumed relation between the teaching quality indicators and
student learning has been confirmed empirically. In other words, what is the nature
and quality of the research upon which the indicators are based? This is often consid-
ered empirically by testing the degree to which scores from a particular observation
system, which includes specific dimensions of teaching, are associated with student
outcomes (e.g., Decristan et al., 2015) or statistically derived measures of teaching
quality such as value-added models (e.g., Bell et al., 2012; T.J. Kane & Staiger, 2012).
Despite the desire to use predictive validation studies as the gold standard of empirical
evidence, such studies face many problems such as confounds to causal mechanisms,
inadequate accounting for prior learning and other school factors that shape teaching
and learning (e.g., curriculum), and inappropriate outcome measures, just to name a few.
While predictive evidence is important, M. Kane (2006) argues that we must consider
the validity of any system in the form of a clear validity argument. Such an argument
specifies the inferences necessary to move from observation ratings to inferences about
the sample studied (often the quality of teaching in a given timeframe with a specific
group of students), all the way to the inferences at the domain level (all of a teacher’s
teaching in a given year with all the students s/he taught). In one application of M. Kane,
US researchers specify empirical evidence that ranges from the quality of the scoring
inference to predictive validity (Bell et al., 2012). Evidence might include details regard-
ing the training and monitoring of raters, inter-rater reliability, specification of sources of
variance, factor analyses, convergent validity evidence, and correlations to measures of
student learning (e.g., value-added models) or development.
There are many types of empirical evidence that can be brought together into a
validity argument. When judging the quality of validation evidence, we often must take
account of the specific score use associated with the argument. For example, if scores
are used to better understand the relationship between teaching and learning, perhaps
evidence from scores created through IRT models would be more precise and compel-
ling, whereas when providing teachers with feedback quickly, we might prefer evidence
from scores created through averaging up to the lesson level because that type of score
12 C. A. BELL ET AL.
will be used in schools. The most compelling empirical evidence will vary with the
specific inferences and score uses under consideration (M. Kane, 2006).
Developmental continuum
Related to the quality of the empirical evidence available for an observation system,
observation systems can be placed on a developmental continuum. It takes time to
develop a strong system and gather information about valid and reliable uses of the
system. Indicators of the stage of development of the system are the year of develop-
ment, whether the system was pilot tested, the number of versions, the last published
version, whether research was done into the valid and reliable use of the system by the
developers, and whether people outside the development team have used or
researched the system.
Table 2. (Continued).
Observation System
CLASS
Framework Aspect ICALT K-3 UE TIMSS PLATO
6. Scoring Scoring based on Raters observe Each video Raters observe PLATO in
procedures the observation CLASS in cycles: reviewed 7 cycles: 15 min of
of a full lesson. 15–20 min of times by coders. observation and 8–10 min
Rater training observation and Raters certified of rating the elements.
available, but no 10 min of rating and monitored. Raters must prove
manual or the dimensions. reliability and obtain
general The number of certification. Online
guidelines observation training facility available.
regarding the cycles is
number of dependent of
observations or the use. Raters
observers for must prove
specific use. inter-rater
reliability and
obtain
certification. A
rater manual is
available.
7. Empirical The empirical evidence for each system has not been summarized here given space
evidence considerations. Please refer to the text for citations that begin to show the nature of the
empirical evidence for each system.
8. Developmental Moderate use Extensive use Limited use Limited to moderate use
continuum outside of outside of outside of outside of original
original original original developers and
developers and developers and developers and researchers
researchers researchers researchers
*For more information on exactly how each observation system was aligned by the authors to the nine teaching
dimensions, see Appendix 1.
instruction. ICALT is used for research purposes and as a system for teacher professional
development, across a variety of subjects as well as grades.
Of the nine teaching dimensions presented earlier, only the dimension about subject-
matter representation is not covered in ICALT. The 32 items focused on teacher behavior
are divided over six scales: safe and stimulating learning climate, efficient classroom
management, quality of instruction, teaching learning strategies, stimulating learning
environment, and adaptation of teaching to diverse student needs. One additional scale
called student engagement contains three items that focus on student behavior.
Together, ICALT measures both teacher and student behavior, with an emphasis on
the former. The indicators were derived from reviews of research on the relationship
between teaching characteristics and the academic achievements of pupils.
ICALT is a high-inference system, and scores are based on a whole lesson. All quality
indicators are scored on a 4-point scale ranging from predominantly weak to predomi-
nantly strong. In the system, examples of good practices are provided for each quality
indicator to assist observers in making the judgments. Observers can indicate whether
these good practices were present or not during the lesson and, based on this informa-
tion, they make a quality judgment about the relatively small grain-sized indicators at
the end of the lesson. There are no required scoring rules for computing a score.
Methods for analyzing the data range from computing a standardized scale score to
using IRT.
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 15
Observers can become certified if they are able to rate a lesson in a way similar to
master observers. There is no general manual available for the use of ICALT, and
training opportunities are not offered on a regular basis. However, training is avail-
able upon request, and the system authors can be contacted for information about
the system.
Research into ICALT has mainly occurred in secondary education. Confirmatory factor
analysis supported the six scales (Van de Grift, Van der Wal, & Torenbeek, 2011). Rasch
analyses have been conducted to place all quality indicators on a Rasch scale such that
teachers can be trained in their zone of proximal development (e.g., Van der Lans, Van
de Grift, & Van Veen, 2018). Multilevel analyses showed a relation between ICALT and
students’ academic engagement (Maulana, Helms-Lorenz, & Van de Grift, 2017). The RUG
also still conducts research into reliability aspects of ICALT (e.g., Van der Lans, Van de
Grift, Van Veen, & Fokkens-Bruinsma, 2016) and has just started a new international
project on teaching quality from an international perspective, the ICALT3.
ICALT has been developed since 2002. The first academic paper was published in
2007, and ICALT thereafter has been further developed into the current version. ICALT
has been used by researchers outside the development team in The Netherlands and
abroad. It is used by practitioners, and the previous observation instrument used by the
Dutch inspectorate was also based on ICALT.
behavioral and cognitive viewpoints as well. TIMSS follows both students’ and teachers’
actions and discourse and tracks the degree to which these are public (i.e., shared with
the entire classroom) or private (i.e., between a small number of students). The system is
subject specific and targeted toward secondary grades.
The TIMSS codes capture six of the nine teaching dimensions in the framework; safe
and stimulating learning environment, assessment for learning, differentiated instruc-
tion, and teaching learning and student self-regulation are not addressed. TIMSS does
not have scales, but instead codes that are grouped conceptually and describe the
subject matter of mathematics lessons by documenting the lesson’s specific mathema-
tical subject matter, the organization of the lesson, and the instructional processes. Each
lesson is segmented into problem-based interactions of variable length and mutually
exclusive categories, called coverage codes. Twenty-one coverage codes define the
organization of the lesson including whether mathematics is being taught, in what
problem format, and whether or not problems overlap. There are also occurrence
codes that describe the types of activities engaged in by students as well as how
those activities unfold, the resources being used, and the nature of mathematical
practices and interactions emphasized. Codes were developed based on mathematics
education research, and a collaborative process of viewing videos from seven countries
and attempting to capture both similarities and differences across countries (Stigler,
Gallimore, & Hiebert, 2000)
TIMSS is a low-inference system. TIMSS is scored using both a video and a
standardized transcript of the lesson. Using transcripts and the 110-page system,
general and specialized raters make a total of seven passes through a video and its
associated transcript in order to assign categorical codes to the entire lesson. TIMSS
parses teaching into very small pieces, for example, whether there was a mathema-
tical generalization present, how many there were, or how many graphs were drawn
publicly. And yet, alone, the codes do not make judgments about teaching quality.
Analysts bring a teaching quality analytic framework to the codes in order to aggre-
gate the codes in ways that allow judgments about teaching quality to be made (e.g.,
Leung, 2005).
All raters are required to pass a certification test, and lessons are double scored. All
codes are aggregated to the lesson level, and, to our knowledge, no one has attempted
to make systematic claims about teachers, instead focusing on descriptions of teaching
within and across countries. There is no training offered by the developers; however, the
codes are available for free download in technical documentation for the study (Jacobs
et al., 2003).
The original reports of the coding schemes detail the lesson-level reliability of coding
as well as the standard errors for each code; additional reports describe the develop-
ment and application of the codes (e.g., Givvin et al., 2005). Our review did not identify a
published factor analysis. We also found no validation evidence that considered various
scoring models or investigated the quality of ratings beyond general rater agreement
metrics. Ratings on TIMSS have been linked to student achievement at least once (World
Bank, 2010).
Initial codes were developed in 1994 and revised for use in the 1999 TIMSS Video
study. The only other study our search located that used the full codes appears to use
the 1999 codes (World Bank, 2010). While we were able to locate studies that used
18 C. A. BELL ET AL.
TIMSS video capture methodology – for example, two cameras, medium angle camera
shot, or reanalyzed videos (e.g., Hugener et al., 2009; Kunter & Baumert, 2006), we found
only a single English-language study in which a non-developer researcher used the
TIMSS codes as described in the technical manual (World Bank, 2010). To our knowledge,
there are no additional studies that modify and report on those modified codes, thus
indicating little progression on a developmental continuum.
academic engagement (Cohen & Grossman, 2016; T.J. Kane & Staiger, 2012). Developers have
also documented that relationships between PLATO scores and student achievement are
sensitive to the student achievement test used (Grossman, Cohen, Ronfeldt, & Brown, 2014) as
well as sensitive to the grade level, topic, and student demographic characteristics (Grossman,
Cohen, & Brown, 2014).
First shared publically in 2009, PLATO has iterated over multiple versions, including
PLATO Prime that was used in the MET study (T.J. Kane & Staiger, 2012). The current
version is 5.0. Research has been carried out by developers and non-developer research-
ers (Dalland, Klette, & Svenkerud, 2018; Grossman et al., 2013; Klette & Blikstad-Balas,
2018). Together, this suggests the system has begun to make progress along its devel-
opmental continuum.
Table 2 summarizes the aspects of the four observation systems evaluated.
Discussion
After defining the observation system concept, we presented a framework for analyzing
observation systems and then applied the framework to four well-known systems. The
framework’s aspects seem to have value as they point to relevant differences between
the four observation systems. If practitioners or researchers plan to use an observation
system, it is important to be aware of how observation systems can differ, and make
informed choices regarding the observation system that will best suit their purposes.
Applying the framework reveals that all but one dimension of teaching (i.e., teaching
learning strategies and student self-regulation) is addressed by at least three observation
systems. All four observation systems address a core group of dimensions and do not all
measure the same dimensions of teaching. Only the dimensions involvement/motivation
and cognitive activation were measured by all four instruments. Unmeasured dimensions
are also fundamental aspects of teaching quality; however, there may be defensible
reasons for not including these dimensions in an observation system, depending on
one’s purpose (e.g., Park et al., 2014). The framework’s contribution is not to endorse a
particular system, but, rather, its application can support more deliberate selection and
use of observation systems.
This also applies to the view of teaching and learning category that forms the basis for
a specific observation system because there is no “one best” observation system.
Definitions of teaching quality are informed by empirical matters, but they are also
influenced by preferences and values regarding good teaching.
The general or subject specificity of the system did not seem to demonstrate any
pattern when applied to the instruments selected. It was clear, though, that across, both
types of systems produced empirical evidence of a relationship between scores and
student achievement as well as the documented movement along a development
continuum. This may suggest that either type may be useful from a predictive perspec-
tive (Praetorius et al., 2018) and others can learn to use both types of systems. If this
suggestion holds up to a systematic and rigorous analysis across more than four
systems, decisions about the general or subject-specific nature of a system may need
to be driven by users’ specific needs for the system (Hill & Grossman, 2013). For example,
subject-specific systems might be particularly useful when researchers are studying the
impact of a professional development program or providing feedback to teachers in that
20 C. A. BELL ET AL.
Disclosure statement
No potential conflict of interest was reported by the authors.
ORCID
Courtney A. Bell http://orcid.org/0000-0001-8743-5573
Adrie Visscher http://orcid.org/0000-0001-8443-9878
References
Abadzi, H. (2009). Instructional time loss in developing countries: Concepts, measurement, and
implications. The World Bank Research Observer, 24(2), 267–290. doi:10.1093/wbro/lkp008
22 C. A. BELL ET AL.
Abry, T., Rimm-Kaufman, S. E., Larsen, R. A., & Brewer, A. J. (2013). The influence of fidelity of
implementation on teacher–student interaction quality in the context of a randomized con-
trolled trial of the Responsive Classroom approach. Journal of School Psychology, 51(4), 437–453.
doi:10.1016/j.jsp.2013.03.001
Archer, J., Cantrell, S., Holtzman, S. L., Joe, J. N., Tocci, C. M., & Wood, J. (2017). Better feedback for
better teaching: A practical guide to improving classroom observations. Retrieved from http://
k12education.gatesfoundation.org/resource/better-feedback-for-better-teaching-a-practical-
guide-to-improving-classroom-observations/
Baumert, J., Kunter, M., Blum, W., Brunner, M., Voss, T., Jordan, A., . . . Tsai, Y.-M. (2010). Teachers’
mathematical knowledge, cognitive activation in the classroom, and student progress. American
Educational Research Journal, 47(1), 133–180. doi:10.3102/0002831209345157
Bell, C. A., Gitomer, D. H., McCaffrey, D. F., Hamre, B. K., Pianta, R. C., Qi, Y. (2012). An argument
approach to observation protocol validity. Educational Assessment, 17(2–3), 62–87. doi:10.1080/
10627197.2012.715014
Bill & Melinda Gates Foundation. (2018). Measures of effective teaching project: Frequently asked
questions. Retrieved from http://k12education.gatesfoundation.org/blog/measures-of-effective-
teaching-project-faqs/
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education:
Principles, Policy & Practice, 5(1), 7–74. doi:10.1080/0969595980050102
Black, P., & Wiliam, D. (2010). Inside the black box: Raising standards through classroom assess-
ment. Phi Delta Kappan, 92(1), 81–90. doi:10.1177/003172171009200119
Boekaerts, M., Pintrich, P. R., & Zeidner, M. (Eds.). (2000). Handbook of self-regulation. San Diego, CA:
Academic Press.
Brophy, J. E., & Good, T. L. (1986). Teacher behavior and student achievement. In M. C. Wittrock
(Ed.), The handbook of research on teaching (3rd ed., pp. 328–375). New York, NY: Macmillan.
Carnine, D. W., Dixon, R. C., & Silbert, J. (1998). Effective strategies for teaching mathematics. In E. J.
Kame’enui & D. W. Carnine (Eds.), Effective teaching strategies that accommodate diverse learners
(pp. 93–112). Columbus, OH: Merrill.
Charalambous, C. Y., & Praetorius, A.-K. (2018). Studying mathematics instruction through different
lenses: Setting the ground for understanding instructional quality more comprehensively. ZDM
Mathematics Education, 50(3), 355–366. doi:10.1007/s11858-018-0914-8
Clarke, D., Emanuelsson, J., Jablonka, E., & Mok, I. A. C. (Eds.). (2006). Making connections:
Comparing mathematics classrooms around the world. Rotterdam: Sense.
Cohen, J. (2015). Challenges in identifying high leverage practices. Teachers College Record, 117(7), 1–41.
Cohen, J., & Grossman, P. (2016). Respecting complexity and measures of teaching: Keeping
students and schools in focus. Teaching and Teacher Education, 55, 308–317. doi:10.1016/j.
tate.2016.01.017
Dalland, C. P., Klette, K., & Svenkerud, S. (2018). Video studies and the challenge of selecting time
scales. Manuscript submitted for publication.
Danielson, C. (2013). The Framework for Teaching Evaluation Instrument. Princeton, NJ: Danielson
Group.
Decristan, J., Klieme E., Kunter M., Hochweber, J., Büttner, G., Fauth, B., . . . Hardy, I. (2015).
Embedded formative assessment and classroom process quality: How do they interact in
promoting science understanding? American Educational Research Journal, 52(6), 1133–1159.
doi:10.3102/0002831215596412
Dobbelaer, M. J., & Visscher, A .J. (2018). The quality of classroom observation systems for measuring
teaching quality in primary education – A systematic review. Manuscript submitted for
publication.
Flanders, N. A. (1970). Analyzing teaching behavior. Boston, MA: Addison Wesley.
Givvin, K. B., Hiebert, J., Jacobs, J. K., Hollingsworth, H., & Gallimore, R. (2005). Are there national
patterns of teaching? Evidence from the TIMSS 1999 video study. Comparative Education Review,
49(3), 311–343. doi:10.1086/430260
Grossman, P. L. (2018). The Protocol for Language Arts Teaching Observation (PLATO). Retrieved from
http://platorubric.stanford.edu/index.html
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 23
Grossman, P., Cohen, J., & Brown, L. (2014). Understanding instructional quality in English
Language Arts: Variations in PLATO scores by content and context. In T. J. Kane, K. A. Kerr, &
R. C. Pianta (Eds.), Designing teacher evaluation systems: New guidance from the Measures of
Effective Teaching Project (pp. 303–331). San Francisco, CA: Jossey-Bass.
Grossman, P., Cohen, J., Ronfeldt, M., & Brown, L. (2014). The test matters: The relationship
between classroom observation scores and teacher value added on multiple types of assess-
ment. Educational Researcher, 43(6), 293–303. doi:10.3102/0013189X14544542
Grossman, P., Greenberg, S., Hammerness, K., Cohen, J. Alston, C., & Brown, M. (2009, April).
Development of the Protocol for Language Arts Teaching Observation (PLATO). Paper presented
at the Annual Meeting of the American Educational Research Association, San Diego, CA.
Grossman, P., Loeb, S., Cohen, J., & Wyckoff, J. (2013). Measure for measure: The relationship
between measures of instructional practice in middle school English language arts and
teachers’ value-added scores. American Journal of Education 119(3), 445–470. doi:10.1086/
669901
Grossman, P., & McDonald, M. (2008). Back to the future: Directions for research in teaching and
teacher education. American Educational Research Journal, 45(1), 184–205. doi:10.3102/
0002831207312906
Henry, A. E. (2010). Advantages to and challenges of using ratings of observed teacher-child interac-
tions (Unpublished doctoral dissertation). University of Virginia, Charlottesville, VA.
Hiebert, J., & Grouws, D. A. (2007). The effects of classroom mathematics teaching on students’
learning. In F. K. Lester (Ed.), Second handbook of research on mathematics teaching and learning
(pp. 371–404). Charlotte, NC: Information Age.
Hill, H. C. (2018). Mathematical Quality of Instruction (MQI) domains. Retrieved from https://cepr.
harvard.edu/mqi-domains
Hill, H. C., Blunk, M. L., Charalambous, C. Y., Lewis, J. M., Phelps, G. C., Sleep, L., & Ball, D. L. (2008).
Mathematical knowledge for teaching and the mathematical quality of instruction: An explora-
tory study. Cognition and Instruction, 26(4), 430–511. doi:10.1080/07370000802177235
Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: Teacher
observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–
64. doi:10.3102/0013189X12437203
Hill, H. C., & Grossman, P. (2013). Learning from teacher observations: Challenges and opportu-
nities posed by new teacher evaluation systems. Harvard Educational Review, 83(2), 371–384.
Hugener, I., Pauli, C., Reusser, K., Lipowsky, F., Rakoczy, K., & Klieme, E. (2009). Teaching patterns
and learning quality in Swiss and German mathematics lessons. Learning and Instruction, 19(1),
66–78. doi:10.1016/j.learninstruc.2008.02.001
International Association for the Evaluation of Educational Achievement. (2018). The TIMSS Video
study. Retrieved from http://www.timssvideo.com/the-study/
Jacobs, J., Garnier, H., Gallimore, R., Hollingsworth, H., Givvin, K. B., Rust, K., . . . Stigler, J. W. (2003).
Third International Mathematics and Science Study 1999 Video Study Technical Report: Volume 1:
Mathematics (NCES 2003012). Washington, DC: National Center for Education Statistics.
Joe, J. N., McClellan, C. A., & Holtzman, S. L. (2014). Scoring design decisions: Reliability and the
length and focus of classroom observations. In T. J. Kane, K. A. Kerr, & R. C. Pianta (Eds.),
Designing teacher evaluation systems: New guidance from the Measures of Effective Teaching
Project (pp. 415–443). San Francisco, CA: Jossey-Bass.
Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64).
Westport, CT: American Council on Education and Praeger.
Kane, T. J., Kerr, K. A., & Pianta, R. C. (Eds.). (2014). Designing teacher evaluation systems: New
guidance from the Measures of Effective Teaching Project. San Francisco, CA: Jossey-Bass.
Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching: Combining high-quality
observations with student surveys and achievement gains. Retrieved from https://files.eric.ed.
gov/fulltext/ED540962.pdf
Keuning, T., Van Geel, M., Frèrejean, J., Van Merriënboer, J., Dolmans, D., & Visscher, A. J. (2017).
Differentiëren bij rekenen: Een cognitieve taakanalyse van het denken en handelen van
24 C. A. BELL ET AL.
Praetorius, A.-K., Klieme, E., Herbert, B., & Pinger, P (2018). Generic dimensions of teaching quality:
The German framework of Three Basic Dimensions. ZDM Mathematics Education, 50(3), 407–426.
doi:10.1007/s11858-018-0918-4
Praetorius, A.-K., Pauli, C., Reusser, K., Rakoczy, K., & Klieme, E. (2014). One lesson is all you need?
Stability of instructional quality across lessons. Learning and Instruction, 31, 2–12. 10.1016/j.
learninstruc.2013.12.002
Rosenshine, B. (1980). How time is spent in elementary classrooms. In C. Denham & A. Lieberman
(Eds.), Time to learn (pp. 107–126). Washington, DC: National Institute of Education.
Saginor, N. (2008). Diagnostic classroom observation: Moving beyond best practice. Thousand Oaks,
CA: Corwin Press.
Sandilos, L. E., Shervey, S. W., DiPerna, J. C., Lei, P., & Cheng, W. (2016). Structural validity of CLASS
K-3 in primary grades: Testing alternative models. School Psychology Quarterly, 32(2), 226–239.
doi:10.1037/spq0000155
Sawada, D., Piburn, M. D., Judson, E., Turley, J., Falconer, K., Benford, R., & Bloom, I. (2002).
Measuring reform practices in science and mathematics classrooms: The Reformed Teaching
Observation Protocol. School Science and Mathematics, 102(6), 245–253. doi:10.1111/j.1949-
8594.2002.tb17883.x
Schacter, J., & Thum, Y. M. (2004). Paying for high-and low-quality teaching. Economics of Education
Review, 23(4), 411–430. doi:10.1016/j.econedurev.2003.08.002
Scheerens, J. (2014). School, teaching, and system effectiveness: Some comments on three state-
of-the-art reviews. School Effectiveness and School Improvement, 25(2), 282–290. doi:10.1080/
09243453.2014.885453
Seidel, T., Prenzel, M., & Kobarg, M. (Eds.). (2005). How to run a video study: Technical report of the
IPN video study. Münster: Waxmann.
Seidel, T., & Shavelson, R. J. (2007). Teaching effectiveness research in the past decade: The role of
theory and research design in disentangling meta-analysis results. Review of Educational
Research, 77(4), 454–499. doi:10.3102/0034654307310317
Slavin, R. E. (1996). Education for all. Lisse: Swets & Zeitlinger.
Stallings, J. A. (1973). Follow through program classroom observation evaluation 1971–72. (Report
No. SRI-URU-7370). Menlo Park, CA: Stanford Research Institute.
Stigler, J. W., Gallimore, R., & Hiebert, J. (2000). Using video surveys to compare classrooms and
teaching across cultures: Examples and lessons from the TIMSS video studies. Educational
Psychologist, 35(2), 87–100. doi:10.1207/S15326985EP3502_3
Teachstone. (2015). Why class? Exploring the promise of the Classroom Assessment Scoring System
(CLASS). Retrieved from http://cdn2.hubspot.net/hubfs/336169/What_Is_CLASS_ebook_Final.
pdf?t=1446
TIMSS Video Mathematics Research Group. (2003). Understanding and improving mathematics
teaching: Highlights from the TIMSS 1999 Video Study. Phi Delta Kappan, 84(10), 768–775.
doi:10.1177/003172170308401011
Tomlinson, C. A. (2004). The Möbius effect: Addressing learner variance in schools. Journal of
Learning Disabilities, 37(6), 516–524. doi:10.1177/00222194040370060601
Tomlinson, C. A., Brimijoin, K., & Narvaez, L. (2008). The differentiated school: Making revolutionary
changes in teaching and learning. Alexandria, VA: ASCD.
Van de Grift, W. (2007). Quality of teaching in four European countries: A review of the literature
and application of an assessment instrument. Educational Research, 49(2), 127–152. doi:10.1080/
00131880701369651
Van de Grift, W., Van der Wal, M., & Torenbeek, M. (2011). Ontwikkeling in de pedagogisch
didactische vaardigheid van leraren in het basisonderwijs [The development of primary school
teachers’ pedagogical and didactical skill]. Pedagogische Studiën, 88(6), 416–432.
Van der Lans, R. M., Van de Grift, W. J. C. M., & Van Veen, K. (2018). Developing an instrument for
teacher feedback: Using the Rasch model to explore teachers’ development of effective teach-
ing strategies and behaviors. The Journal of Experimental Education, 86(2), 247–264. doi:10.1080/
00220973.2016.1268086
26 C. A. BELL ET AL.
Van der Lans, R. M., Van de Grift, W. J. C. M., Van Veen, K., & Fokkens-Bruinsma, M. (2016). Once is
not enough: Establishing reliability criteria for feedback and evaluation decisions based on
classroom observations. Studies in Educational Evaluation, 50, 88–95. doi:10.1016/j.
stueduc.2016.08.001
Van Veen, K., Zwart, R., & Meirink, J. (2012). What makes teacher professional development
effective? A literature review. In M. Kooy & K. van Veen (Eds.), Teacher learning that matters:
International perspectives (pp. 3–21). Abingdon: Routledge.
Wang, M. C., Haertel, G. D., & Walberg, H. J. (1993). Toward a knowledge base for school learning.
Review of Educational Research, 63(3), 249–294. doi:10.3102/00346543063003249
World Bank. (2010). Inside Indonesia’s mathematics classrooms: A TIMSS video study of teaching
practices and student achievement (Report No. 54936–ID). Jakarta: Author.
Zimmerman, B. J. (1990). Self-regulated learning and academic achievement: An overview.
Educational Psychologist, 25(1), 3–17. doi:10.1207/s15326985ep2501_2
Appendix 1. Alignment of four exemplar observation systems to framework dimensions of teaching
Observation System
Dimension of
Teaching ICALT CLASS K-3 CLASS UE TIMSS1 PLATO
Safe and Shows respect Behavior Positive Climate – –
stimulating Maintains relaxed atmosphere Management Negative Climate
learning Promotes learners’ self-confidence Positive Climate Teacher Sensitivity
environment Fosters mutual respect Negative Behavior
Stimulates the building of self-confidence in weaker Climate Management
learners Teacher
Sensitivity
Classroom Ensures the lesson proceeds in an orderly manner Behavior Behavior Time of lesson Time Management
management Monitors to ensure learners carry out activities Management Management Patterns of public/private classroom Behavior
Provides effective classroom management Productivity Productivity interaction Management
Uses the time for learning efficiently Non-mathematics/off topic
Gives a clear explanation of how to use didactic aids Break
Outside interruption
Involvement/ Engages all learners in the lesson Regard for Student Regard for Student How many students Connections to
motivation of Encourages learners to do their best Perspectives Perspectives Required or optional personal and/or
students Offers activities and work forms that stimulate Instructional Student Length of working-on cultural
learners to take active approach Learning Engagement Facilitating exploration experiences
Learners are fully engaged in the lesson Formats
Learners show that they are interested
Learners take an active approach to learning
Gives interactive instructions
(Continued )
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT
27
28
(Continued).
Observation System
Dimension of
Teaching ICALT CLASS K-3 CLASS UE TIMSS1 PLATO
Explanation of Presents and explains the SM in a clear manner – Content Independent problem* Text-Based
subject matter Teaches in a well-structured manner Understanding Answered only problem* Instruction
Clearly specifies the lesson aims at the start of the Instructional Concurrent problem set-up* Purpose
lesson Learning Formats Concurrent problem seat work*
C. A. BELL ET AL.
(Continued )
(Continued).
Observation System
Dimension of
Teaching ICALT CLASS K-3 CLASS UE TIMSS1 PLATO
Cognitive Asks questions which stimulate learners to reflect Concept Analysis and Inquiry Resources used* Intellectual Challenge
activation Encourages learners to think critically Development Instructional Multiple solution methods* Classroom
Stimulates the application of what has been learned Dialogue Problem summary* Discourse
Lets learners think aloud Types of information or activity in
non-problem*
Contextual information:*
Mathematical concept/theory/idea
Activity
Cognitive Private work assignment*
activation Private work segments:*
Organization of students
Display information
Administrative activity
Type of public announcements
Purpose*
Mathematical generalizations*
Labels and symbols*
Links*
Assessment for Gives feedback to learners Quality of Quality of Feedback – Guided Practice
learning During the presentation stage, checks whether Feedback
learners have understood
Evaluates whether the lesson aims have been
reached
Differentiated Stimulates the building of self-confidence in weaker Teacher Sensitivity Teacher Sensitivity Required or optional Accommodations for
instruction learners Quality of Instructional Degree of student choice Language Learning
Offers weaker learners extra study and instruction Feedback Learning Formats
time
Adjusts instructions to relevant inter-learner
differences
Adjusts the processing of subject matter to relevant
inter-learner differences
Teaching learning Stimulates learners to think about solutions – Analysis and Inquiry – Explicit Strategy
and student Lets learners think aloud Instruction
self-regulation Encourages students to think critically
Teaches learners how to simplify complex problems
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT
Notes: 1As noted in the TIMSS description, the TIMSS codes do not directly convey decisions about quality; therefore, they are difficult to map onto the framework’s dimensions of teaching,
which specify particular and explicit values about how instruction proceeds. In particular, we found that the TIMSS codes aligned to explanation of subject matter, quality of subject-matter
representation, and cognitive activation align to all three of those dimensions. For example, using two representations – a graph and a table – might support high-quality subject-matter
explanations; they might also be a quality representation; and they might support cognitive activation. All of the codes marked with a * are aligned to all three aforementioned dimensions
of the framework.