0% found this document useful (0 votes)
26 views28 pages

Qualities of Classroom Observation Systems

This document discusses classroom observation systems used to evaluate and improve teaching. It defines an observation system as having three main components: scoring tools to measure dimensions of teaching, rating quality procedures to train and certify raters, and sampling specifications to generalize ratings from observations to teaching quality. The document then proposes a framework with eight aspects to categorize and compare observation systems, including their content, guidelines, empirical support, and scale of implementation. It applies this framework to analyze four well-known observation systems.

Uploaded by

Ashraf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views28 pages

Qualities of Classroom Observation Systems

This document discusses classroom observation systems used to evaluate and improve teaching. It defines an observation system as having three main components: scoring tools to measure dimensions of teaching, rating quality procedures to train and certify raters, and sampling specifications to generalize ratings from observations to teaching quality. The document then proposes a framework with eight aspects to categorize and compare observation systems, including their content, guidelines, empirical support, and scale of implementation. It applies this framework to analyze four well-known observation systems.

Uploaded by

Ashraf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

School Effectiveness and School Improvement

An International Journal of Research, Policy and Practice

ISSN: 0924-3453 (Print) 1744-5124 (Online) Journal homepage: https://www.tandfonline.com/loi/nses20

Qualities of classroom observation systems

Courtney A. Bell, Marjoleine J. Dobbelaer, Kirsti Klette & Adrie Visscher

To cite this article: Courtney A. Bell, Marjoleine J. Dobbelaer, Kirsti Klette & Adrie Visscher (2019)
Qualities of classroom observation systems, School Effectiveness and School Improvement, 30:1,
3-29, DOI: 10.1080/09243453.2018.1539014

To link to this article: https://doi.org/10.1080/09243453.2018.1539014

View supplementary material

Published online: 23 Nov 2018.

Submit your article to this journal

Article views: 273

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=nses20
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT
2019, VOL. 30, NO. 1, 3–29
https://doi.org/10.1080/09243453.2018.1539014

ARTICLE

Qualities of classroom observation systems


a
Courtney A. Bell , Marjoleine J. Dobbelaerb, Kirsti Klettec and Adrie Visscher b

a
Center for Global Assessment, Educational Testing Service, Princeton, NJ, USA; bELAN, Department of
Teacher Development, Faculty of Behavioural, Management and Social Sciences, University of Twente,
Enschede, The Netherlands; cDepartment of Teacher Education and School Research, Faculty of
Educational Sciences, University of Oslo, Oslo, Norway

ABSTRACT KEYWORDS
Observation systems are increasingly used around the world for a Teacher evaluation; teaching
variety of purposes; 2 critical purposes are to understand and to evaluation; teaching quality;
improve teaching. As observation systems differ considerably, classroom observation;
individuals must decide what observation system to use. But the observation systems
field does not have a common specification of an observation
system, nor does it have systematic ways of thinking about how
observation systems are similar and different. Given this reality
and the renewed global interest in observation systems, this arti-
cle first defines the observation system concept and then presents
a framework through which to understand, categorize, and com-
pare observation systems. We apply the framework to 4 well-
known observation systems that vary in important ways. The
article concludes with a discussion of the results of the application
of the framework and some important implications of those
findings.

Introduction
Observation systems are used around the world for a variety of purposes. Two critical
purposes are to understand and improve teaching. Scholars often seek to understand
teaching by identifying dimensions of teaching and investigating how those dimensions
contribute to valued outcomes such as student learning or students’ motivation (e.g.,
Decristan et al., 2015). They also seek to use observation systems to improve teaching. In
order to improve teaching, one must first measure it and understand it. This means that
the scores from observation systems can be used to provide feedback and coaching to
teachers as well as to evaluate interventions hypothesized to affect teaching (e.g., Kraft
& Blazar, 2017). But when individuals set out to understand and/or improve teaching,
they face many choices. For example, should they use a system that can be used across
school subjects, a so-called “generic” system, or one that is subject specific? Should they
select a system that produces more narrow and detailed information or one that
produces more global, summary information? To what degree do existing systems
serve the specific purposes the individual has in mind?

CONTACT Courtney A. Bell cbell@ets.org


Supplemental data for this article can be accessed here.
© 2018 Informa UK Limited, trading as Taylor & Francis Group
4 C. A. BELL ET AL.

Table 1. Framework for evaluating an observation system.


Categories Aspects
Content (1) Dimensions of teaching
(2) Community’s view of teaching and learning
(3) Subject specificity
(4) Grain size
(5) Focus on students’ actions
Guidelines (6) Scoring procedures
Empirical support (7) Empirical evidence
Implementation scale (8) Developmental continuum

Renewed interest in observation systems around the world spurred, in part, by


seminal large-scale research such as the Trends in International Mathematics and
Science Study (TIMSS) video study (International Association for the Evaluation of
Educational Achievement, 2018) and the Measures of Effective Teaching (MET) project
(Bill & Melinda Gates Foundation [BMGF], 2018) has generated significant research and
development work on observation systems. This research and development has not yet
been synthesized in ways that allow the field to take stock of what we have learned
about observation systems.
As a contributing step toward understanding recent work on observation systems,
this article first describes what we mean by the term “observation system”. After
clarifying this, we present one framework that can be used to understand how observa-
tion systems vary. In this framework, we develop eight observation system aspects (see
Table 1) we hope will be useful to better understand different observation systems. To
illustrate the eight aspects of the framework, we apply the framework to four rather
well-known observation systems. Finally, the article concludes with a discussion of the
results of the application of the framework and what they imply for using observation
systems for a specific purpose.

What is an observation system?


Observation protocols are often thought of as a sheet of paper with categories or rubrics
which a rater uses to judge the quality of teaching in a lesson. The dimensions of teaching
judged are rated, and the ratings are aggregated into a score (e.g., averaging of ratings, or
through item response theory [IRT]). In schools, these scores are often used to provide
teachers with improvement feedback or to evaluate individual teachers. In research con-
texts, scores are often analyzed to determine how they relate to valued outcomes such as
student learning, professional development efforts, nurturing learning environments, and
many others. Although the sheets of paper with scales are very important, there is much
more to observation protocols than constructs and scales.
When measuring teaching through observations, one must measure selected aspects
of teaching by sampling lessons or parts of lessons and ensuring that the ratings of
those lessons are of reasonable quality (Joe, McClellan, & Holtzman, 2014). To accom-
plish these tasks in valid and reliable ways, observation systems can be conceptualized
as being comprised of scoring tools, rating quality procedures, and sampling specifica-
tions (Hill, Charalambous, & Kraft, 2012; Liu, Bell, & Jones, 2017).
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 5

The scoring tools in an observation system specify which dimensions of teaching will
be measured. These tools include the scales themselves – both the teaching practices
being assessed and the number and definition of the score points (e.g., present/not
present, a 3-point criterion-referenced scale or rubric). Because observation scales are
designed to measure complex human interactions, raters come to understand the scales
through videos (and/or text-based descriptions) of teaching that have been rated by
someone who understands the scales and score point distinctions. These video- and
text-based descriptions show raters how the words of the scoring scales are embodied
in teachers’ and students’ words and actions.
As has been documented in some observation systems, human rating of teaching is
prone to being unreliable and inaccurate, especially when coding certain aspects of
teaching practice such as intellectual challenge or cognitive activation (e.g., T.J. Kane &
Staiger, 2012; Decristan et al., 2015; Praetorius, Pauli, Reusser, Rakoczy, & Klieme, 2014).
Therefore, it is very important for observation systems to have rating quality procedures
(Park, Chen, & Holtzman, 2014). These procedures are used to ensure that raters are well
trained and are able to use the rating scales accurately and reliably over time. A
common quality procedure is the formal training and certification of raters.
Certification tests often mimic the work raters will do in studies or in practice. For
example, raters might be required to take and pass a certification test in which s/he
rates a lesson, and the ratings must agree exactly with master ratings on 80% of the
rating scales. Another common procedure is double scoring, the practice of having two
raters independently assign ratings to the same lesson in order to compute inter-rater
agreement metrics.
Finally, sampling specifications are the details around how the observations sample
from the larger domain to which the ratings are intended to generalize (Joe et al., 2014).
These specifications include, but are not limited to, the number of observations con-
ducted for a reliable estimate of teaching quality, the length of time of those observa-
tions, the frequency with which raters assign ratings (e.g., every 10 min, every 30 min),
and how lessons are sampled from the unit of analysis. For example, for a primary
teacher, how does a four-lesson sample used by researchers vary across the subjects a
primary teacher might teach? Are there only language and mathematics lessons? Are all
lessons from April and May, or are they sampled from the entire school year? These and
other similar questions are addressed in the sampling specifications of an observation
system.
Given this description of an observation system, in what follows, we propose a
framework to guide considerations of existing observation protocols, hereafter referred
to as observation systems. Our framework hypothesizes eight aspects of observation
systems, which might be used to better understand the affordances and constraints of
any such observation system (see Table 1). We then use these eight aspects of observa-
tion systems to consider four different observation systems. In doing so, we hope to
show how observation systems can be considered side by side, thereby contributing to
the field’s meta-knowledge of observation systems.
In order to improve teaching, one must have a theory of improvement (Van Veen,
Zwart, & Meirink, 2012). Observation systems rarely have such theories embedded within
them; however, all observation systems parse teaching in specific ways, valuing one
grouping of teaching practices over an alternative grouping (Praetorius & Charalambous,
6 C. A. BELL ET AL.

2018). It is therefore important to fully consider the nature of an observation system


when embarking on the improvement of teaching.

Framework for analyzing observation systems


A framework for evaluating and describing the nature of observation systems can
include many different aspects of such systems. We do not presume to cover all possible
aspects, and subsequent scholarship may productively expand or revise this initial set.
Given our collective international experience of developing and using observation
systems, we selected aspects that we believe are essential for categorizing classroom
observation systems and which vary between systems. The eight aspects included in our
framework relate to the following broader categories: the content of classroom observa-
tion systems (e.g., which dimensions of teaching are evaluated, is the system for general
use, or is it subject specific?), whether the system includes guidelines for proper use of
the system, whether there is empirical evidence for the content and the use of the
system, and the scale of the implementation of the system (e.g., only used by its
developers, or also by others). Below, we will elaborate these categories into eight,
more fine-grained aspects.

Dimensions of teaching
Observation systems include dimensions of teaching that are considered to be indicators
of teaching quality. The assumption generally is that the better a teacher scores on these
indicators, the better the teaching, and, therefore, the more his/her students will learn.
Some frequently used indicators in observation systems originate in the process-product
studies of teaching (e.g., classroom management, clear explanation of subject matter;
Brophy & Good, 1986). Others come from other strands of research, for example, the
TIMSS studies (e.g., cognitive activation; Baumert et al., 2010; Hiebert & Grouws, 2007),
research on assessment for learning (Black & Wiliam, 1998), self-regulation (Zimmerman,
1990), and instructional differentiation (e.g., Tomlinson, Brimijoin, & Navaez, 2008).
We present dimensions of teaching quality frequently included in classroom obser-
vation systems based on two reviews of classroom observation systems. In the first, a
comprehensive review of classroom observation systems was searched by means of a
5-step search strategy based on Littell, Corcoran, and Pillai (2008), including a sys-
tematic literature review and contacting experts in the field. The main inclusion
criteria concerned whether the systems were developed for measuring teaching
quality in primary education and were published after 1990 in English or in Dutch.
Also, research into reliability and validity had to be conducted in primary education,
and the systems had to provide useful data for practitioners in the field. The 27
classroom observation systems that met the criteria were reviewed by two reviewers
(Dobbelaer & Visscher, 2018).
The other, and less systematic reviews (Charalambous & Praetorius, 2018; Klette &
Blikstad-Balas, 2018), summarize and organize existing frameworks based on the distinc-
tion between generic versus subject-specific frameworks (Charalambous & Praetorius,
2018), conceptual framings and vocabulary used (Praetorius, Klieme, Herbert, & Pinger,
2018), and/or review system characteristics along the aspects in our framework,
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 7

dimensions of teaching captured, and scoring/rating requirements in four observation


manuals (Klette & Blikstad-Balas, 2018).
Across these reviews, it is clear that teaching can be conceptualized in many different
ways; therefore, these dimensions are just one way to delineate teaching. Dimensions of
teaching include:

● Safe and stimulating classroom climate: This dimension refers to the degree to
which teachers and students respect one other, communicate with each other in a
supportive way, and together create a safe and positive classroom climate in which
student learning is promoted (e.g., Danielson, 2013; Saginor, 2008).
● Classroom management: Classroom management reflects the degree to which
teachers and students manage their behavior and time in such a way that learning
can be productive. In a well-managed class, little time and energy are lost on
activities that are not learning oriented (Marzano, Marzano, & Pickering, 2003;
Wang, Haertel, & Walberg, 1993).
● Involvement and motivation of students: This dimension is about the extent to
which teachers involve all students actively in classroom learning activities, and
how much students participate in classroom learning activities (Rosenshine, 1980;
Schacter & Thum, 2004).
● Explanation of subject matter: How clearly teachers explain the subject matter to
be learned to their students is crucial for how much students learn. Clear explana-
tions include clear specification of lesson objectives to students, reviewing previous
learning, the use of clear language, presenting information in an orderly manner,
presenting vivid and appealing examples, checking for understanding, and the
frequent restatement of essential principles (Schacter & Thum, 2004; Van de Grift,
2007).
● Quality of subject-matter representation: Quality is influenced here by the richness
(e.g., multiple representations of subject matter), precision, and accuracy of the
subject matter. Strong representations provide opportunities to learn the subject-
matter practices (e.g., problem solving, argumentation) as well as the significant
organizing ideas and procedures of that subject matter (Hill et al., 2008).
● Cognitive activation: A deep understanding of how the various parts of subject
matter are related to and connected with each other requires that teachers can
activate students’ deep thinking by means of questions, appropriate assignments,
classroom discussions, and other pedagogical strategies (Baumert et al., 2010;
Osborne et al., 2015).
● Assessment for learning: Assessment for learning is characterized by a cycle of
communicating explicit assessment criteria, collecting evidence of student under-
standing of subject matter, and providing feedback to students that moves their
learning forward (Black & Wiliam, 1998, 2010).
● Differentiated instruction: Teachers differentiate their teaching to the degree they
adapt subject matter, the explanation of subject matter, students’ learning time,
and the assignments to the differences between students (Keuning et al., 2017;
Tomlinson, 2004).
● Teaching learning strategies and student self-regulation: This dimension is about
teachers (a) explicitly modeling, scaffolding, and explaining learning strategies to
8 C. A. BELL ET AL.

students, which students can use to perform higher level operations (e.g., teaching
heuristics, thinking aloud when solving problems, using checklists) (Carnine, Dixon,
& Silbert, 1998; Slavin, 1996), and (b) encouraging students to self-regulate and
monitor their own learning process in light of the learning goals (Boekaerts,
Pintrich, & Zeidner, 2000; Muijs et al., 2014; Zimmerman, 1990). Teachers who
explicitly model, scaffold, explain strategies, give corrective feedback, and ensure
that children master the material taught contribute highly to the academic success
of their pupils.

While all of these dimensions of teaching are fundamental to students’ learning and
development, each dimension can be operationalized differently across observation
systems. Further, observation systems vary in the degree to which they capture all
dimensions or target specific dimensions.

Community’s view of teaching and learning


Observation systems also embody a community of practice’s view of high-quality
teaching and learning. A community of practice could be a country, with the view
embodied in that country’s national teaching standards and then operationalized in a
national teacher evaluation system. It could be a group of reform-minded mathematics
and science educators (Sawada et al., 2002), a group of school district administrators, or
a group of researchers who study teaching and educational effectiveness.
Of course, communities’ views vary, emphasizing different aspects of teaching and
learning. Perhaps all communities value cognitive activation, for example, but the
degree to which teachers facilitate classroom discourse and student participation
might vary depending on the country’s cultural views of teaching and learning (Clarke,
Emanuelsson, Jablonka, & Mok, 2006). Communities’ views necessarily reflect cultural
differences in valued practices around the world. In Japan, for example, an observation
system might privilege how effectively a collaboratively developed lesson plan was
implemented while a United States system might privilege how well a lesson plan
supported differentiated instruction.
Communities’ perspectives of teaching quality can be located along a continuum that
moves from a behaviorist view of teaching and learning, to more of a cognitive view, to
more of a sociocultural or situated view. Communities’ perspectives often blur the
boundaries across this continuum, and, depending on how thoroughly a system is
documented, it can be difficult to determine what view(s) underlie a specific system.
Further, it is not helpful to dichotomize or oversimplify views of instruction (Grossman &
McDonald, 2008; Oser & Baeriswyl, 2001) as it can lead to a focus on differences in how
communities define and label teaching rather than a focus on how teaching and
learning are related.

Subject specificity
There is widespread agreement about the importance of the subject-matter specificity
of teaching quality (Seidel & Shavelson, 2007); however, there is less agreement about
how to measure this aspect of teaching practice. Several observation systems have
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 9

been designed to evaluate teachers’ subject-specific practices such as the


Mathematical Quality of Instruction (MQI), the Protocol for Language Arts Teaching
Observation (PLATO), the Quality of Science Teaching (QST), and PISA + in science
education. The MQI system (Hill et al., 2008), for example, focuses on elements such
as the richness of the mathematics, student participation in mathematical reasoning
and meaning making, and the clarity and correctness of the mathematics covered in
class.
On the other hand, there are several systems that are generic, designed to capture key
elements of classroom teaching that are critical for students’ learning across subjects
and classes (e.g., instructional support, representation of subject matter, classroom
climate, and classroom management). Examples of such generic systems are the
Classroom Assessment Scoring System (CLASS) (Pianta, LaParo, & Hamre, 2008), the
Framework for Teaching (FFT) (Danielson, 2013), and the International Comparative
Analysis of Learning and Teaching (ICALT) (Van de Grift, 2007).

Grain size
Related to a subject-specific or more generic focus, there is also the issue of grain
size: how discrete/targeted practices are to be coded (Hill & Grossman, 2013). This
issue has been addressed in observation studies for decades (Brophy & Good, 1986;
Flanders, 1970). In some newer systems (e.g., CLASS and PLATO), consensus has been
reached on a set of core activities (12 for both CLASS and PLATO). This stands in
contrast to earlier systems that included a long list of activities to score (Scheerens,
2014). Thus, how many domains and elements that are to be scored is a feature that
might vary across systems.
A system’s grain size may be related to the number of scale points (e.g., when
measuring a practice such as the presence of a lesson objective, this might be rated
on a dichotomous scale – present or absent). However, the number of scale points
should not be assumed to be an indicator of score quality (e.g., reliability, variation, etc.).
Matters of score quality are best addressed through a compelling argument that relies
on multiple sources of validation evidence (M. Kane, 2006).
Whether to score the whole lesson or segments of the lesson is a related aspect of
grain size. One might imagine observation systems that seek to code smaller grain
sizes, that is, narrower teaching practices, might segment the lesson many times so
that narrow behaviors can be accurately documented throughout a lesson (e.g., MQI).
Alternatively, observation systems using more holistic codes requiring the rater to
judge multiple interrelated practices might segment at larger intervals (e.g., 20 min
or a whole lesson) so that the ratings reflect all of the interrelated practices (e.g.,
ICALT).
The decisions about what grain size to capture are further shaped by the rhythm and
pace of instruction. Activities are not always equally probable in every segment of a
lesson. For example, while instructional purpose may be central to the beginning of a
lesson, it may be less central towards the end of the lesson. The degree of lesson
segmentation necessary for a specific grain size of practice being scored is a decision
made by system designers (Klette, Blikstad-Balas, & Roe, 2017) and is often
undocumented.
10 C. A. BELL ET AL.

Focus on students’ actions


Depending on the focus of the scoring scales, procedures, and exemplars, observation
systems might require raters to pay attention to teachers’ or to students’ words and
actions, or some combination thereof. In some observation systems, there was an almost
exclusive focus on the teachers’ actions; for example, was the objective of the lesson
clearly specified (Brophy & Good, 1986)? Other systems required the rater to scan the
room, focusing on only the students’ actions (Abadzi, 2009; Stallings, 1973).
In systems that focus on student actions, the particular actions that are privileged
range from affective, to cognitive, to behavioral. For example, in the domain measuring
the classroom environment, the FFT asks raters to judge the degree to which students
take pride in their work and show caring and warmth. These are more affective aspects
of the learning environment. In contrast, MQI raters attend to cognitive actions, for
example, students’ provision of mathematical explanation (Hill, 2018). An example of a
behavioral aspect is the indicator in the Diagnostic Classroom Observation system where
raters observe whether students show respect for and value each other's ideas, ques-
tions, and contributions to the lesson (Saginor, 2008).

Scoring procedures
Classroom observation systems differ in their scoring procedures, sampling procedures,
and preparation of raters. The choices made by developers for these three aspects
influence the reliability and validity of the observation scores. We describe each in
turn.

Sampling procedures
Classroom observation systems are developed for one or more of the following pur-
poses: promoting teacher learning, teacher evaluation, or developing research insights.
Given these purposes, lessons are sampled in different ways. The lesson’s subject matter
and type (e.g., an introductory or a practice lesson) may be specified by the system. The
observations can be conducted live or on video, be announced or unannounced, and
they can vary in length.
Sampling of the lesson can be specified even further: for example, whether the
observer should walk around, talk with students or not during an observation, which
part of the lesson should be observed, how many observation cycles should be con-
ducted, and when the observation should be conducted across days, weeks, or the
school year.

Scoring procedures
Observation systems differ in how rating procedures and scoring rules are carried out.
How many observations and the number of segments to be scored, the degree to which
lessons are double rated, and whether ratings are checked systematically by master
raters for accuracy are just some of the rating procedures that are relevant to the validity
of the system. Scoring rules concern how ratings are aggregated across units (e.g.,
segments, lessons, teachers) and across raters (e.g., averaging discrepant ratings, taking
the highest rating), as well as rounding rules, various scoring models (e.g., averaging
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 11

ratings across segments and lessons to the teacher level, using IRT models to create
teacher scores), and rules regarding dropping ratings.

Preparation of observers
Raters are usually trained using manuals that provide insight into the theoretical basis of
the system, the meaning of the items and scales and scoring rules. Training can also
provide raters opportunities to practice, by observing videos and scoring them during
the training. Certification of raters could be required, as well as recertification after a
specific time period. It is also critical that raters are able to create accurate and unbiased
scores across teachers so teachers can improve.

Empirical evidence
The validity of the content of the observation system will probably vary. As was stated in
the dimensions of teaching section, the assumption is that the dimensions of teaching
included in observation systems reflect teaching quality. A critical criterion for teaching
quality is how much students learn and develop. Thus, it is important to understand the
extent to which the assumed relation between the teaching quality indicators and
student learning has been confirmed empirically. In other words, what is the nature
and quality of the research upon which the indicators are based? This is often consid-
ered empirically by testing the degree to which scores from a particular observation
system, which includes specific dimensions of teaching, are associated with student
outcomes (e.g., Decristan et al., 2015) or statistically derived measures of teaching
quality such as value-added models (e.g., Bell et al., 2012; T.J. Kane & Staiger, 2012).
Despite the desire to use predictive validation studies as the gold standard of empirical
evidence, such studies face many problems such as confounds to causal mechanisms,
inadequate accounting for prior learning and other school factors that shape teaching
and learning (e.g., curriculum), and inappropriate outcome measures, just to name a few.
While predictive evidence is important, M. Kane (2006) argues that we must consider
the validity of any system in the form of a clear validity argument. Such an argument
specifies the inferences necessary to move from observation ratings to inferences about
the sample studied (often the quality of teaching in a given timeframe with a specific
group of students), all the way to the inferences at the domain level (all of a teacher’s
teaching in a given year with all the students s/he taught). In one application of M. Kane,
US researchers specify empirical evidence that ranges from the quality of the scoring
inference to predictive validity (Bell et al., 2012). Evidence might include details regard-
ing the training and monitoring of raters, inter-rater reliability, specification of sources of
variance, factor analyses, convergent validity evidence, and correlations to measures of
student learning (e.g., value-added models) or development.
There are many types of empirical evidence that can be brought together into a
validity argument. When judging the quality of validation evidence, we often must take
account of the specific score use associated with the argument. For example, if scores
are used to better understand the relationship between teaching and learning, perhaps
evidence from scores created through IRT models would be more precise and compel-
ling, whereas when providing teachers with feedback quickly, we might prefer evidence
from scores created through averaging up to the lesson level because that type of score
12 C. A. BELL ET AL.

will be used in schools. The most compelling empirical evidence will vary with the
specific inferences and score uses under consideration (M. Kane, 2006).

Developmental continuum
Related to the quality of the empirical evidence available for an observation system,
observation systems can be placed on a developmental continuum. It takes time to
develop a strong system and gather information about valid and reliable uses of the
system. Indicators of the stage of development of the system are the year of develop-
ment, whether the system was pilot tested, the number of versions, the last published
version, whether research was done into the valid and reliable use of the system by the
developers, and whether people outside the development team have used or
researched the system.

Reviewing observation systems – four illustrative examples


In order to see how this framework could be used to understand observation
systems, we have selected two general and two subject-specific systems. CLASS and
ICALT, two general systems, are popular in the United States and Europe, respec-
tively. PLATO and TIMSS were developed for different subject matters. Both have
been used internationally, and the latter was developed specifically to apply across
countries. We consider subject-specific systems as well as generic systems because of
the important role subject matter plays in the improvement of teaching. These
systems were also selected, in part, because they vary across the framework’s aspects.
We describe each system using the aspects of the framework to demonstrate how
the framework can be used to investigate any observation system. We then summar-
ize the systems briefly in Table 2. It is important to note that we do not provide a
full-length treatment of the empirical evidence for the four systems. Any fair treat-
ment of a system’s evidence is necessarily lengthy and detailed and, therefore,
beyond the present scope. Instead, we point the reader to representative articles
for each system.
The summaries in Table 2 were created in two stages. The first two authors indepen-
dently reviewed and aligned each observation system to the framework’s dimensions of
teaching. They then discussed and resolved differences in their alignments to produce
the first horizontal panel of Table 2. In order to develop the description of each
observation system, a search for research articles in the global literature using that
system was carried out, and then the summary was written and revised by the author
team.

International Comparative Analysis of Learning and Teaching (ICALT)


ICALT was developed by European inspectorates with the purpose of inspection of
primary schools (Van de Grift, 2007). The University of Groningen (RUG) in The
Netherlands continued its development to capture teaching quality. It was developed
from a more cognitive and behavioristic view of teaching and learning and captures a
teacher-centered classroom in which knowledge is transmitted through direct
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 13

Table 2. Exemplar observation systems compared by selected framework aspects.


Observation System
CLASS
Framework Aspect ICALT K-3 UE TIMSS PLATO
1. Dimensions of
teaching*
a) Safe and ✓ ✓ ✓ ✓ ✓
stimulating
learning climate
b) Classroom ✓ ✓ ✓ ✓ ✓
management
c) Involvement/ ✓ ✓ ✓ ✓ ✓
motivation of
students
d) Explanation of ✓ ✓ ✓ ✓ ✓
subject matter
e) Quality of subject- ✓ ✓ ✓ ✓
matter
representation
f) Cognitive ✓ ✓ ✓ ✓ ✓
activation
g) Assessment for ✓ ✓ ✓ ✓ ✓
learning
h) Differentiated ✓ ✓ ✓ ✓ ✓
instruction
i) Teaching learning ✓ ✓ ✓ ✓ ✓
and student self-
regulation
2. Community’s Varied country US developmental US NCTM view of US ELA research community
view of teaching inspectorates’ psychological mathematics views. Aligned with
and learning communities research teaching and sociocultural and cognitive
and their community learning. views.
constituencies’ developed Aligned with
views. Aligned precursor pre-K sociocultural,
with cognitive instruments. cognitive, and
and behavioral Aligned with behavioral
views. sociocultural and views.
cognitive views.
3. Subject specificity Generic Generic Secondary Secondary language arts
mathematics
4. Grain size 4-point scale, 7 7-point scale, 3 2 types of codes: 4-point scale, 4 domains with
scales with 35 domains with 10 Coverage codes 12 indicators
items (K-3) or 11 (UE) (21) and
dimensions Occurrence
codes (48)
5. Focus on stu- Strong focus on Both students’ and Both students’ and Both students’ and teachers’
dents’ actions teachers’ teachers’ actions. teachers’ actions.
actions. Most CLASS Upper actions.
ICALT versions elementary also
also include 3 captures a scale
student for student
engagement engagement.
indicators.
(Continued )
14 C. A. BELL ET AL.

Table 2. (Continued).
Observation System
CLASS
Framework Aspect ICALT K-3 UE TIMSS PLATO
6. Scoring Scoring based on Raters observe Each video Raters observe PLATO in
procedures the observation CLASS in cycles: reviewed 7 cycles: 15 min of
of a full lesson. 15–20 min of times by coders. observation and 8–10 min
Rater training observation and Raters certified of rating the elements.
available, but no 10 min of rating and monitored. Raters must prove
manual or the dimensions. reliability and obtain
general The number of certification. Online
guidelines observation training facility available.
regarding the cycles is
number of dependent of
observations or the use. Raters
observers for must prove
specific use. inter-rater
reliability and
obtain
certification. A
rater manual is
available.
7. Empirical The empirical evidence for each system has not been summarized here given space
evidence considerations. Please refer to the text for citations that begin to show the nature of the
empirical evidence for each system.
8. Developmental Moderate use Extensive use Limited use Limited to moderate use
continuum outside of outside of outside of outside of original
original original original developers and
developers and developers and developers and researchers
researchers researchers researchers
*For more information on exactly how each observation system was aligned by the authors to the nine teaching
dimensions, see Appendix 1.

instruction. ICALT is used for research purposes and as a system for teacher professional
development, across a variety of subjects as well as grades.
Of the nine teaching dimensions presented earlier, only the dimension about subject-
matter representation is not covered in ICALT. The 32 items focused on teacher behavior
are divided over six scales: safe and stimulating learning climate, efficient classroom
management, quality of instruction, teaching learning strategies, stimulating learning
environment, and adaptation of teaching to diverse student needs. One additional scale
called student engagement contains three items that focus on student behavior.
Together, ICALT measures both teacher and student behavior, with an emphasis on
the former. The indicators were derived from reviews of research on the relationship
between teaching characteristics and the academic achievements of pupils.
ICALT is a high-inference system, and scores are based on a whole lesson. All quality
indicators are scored on a 4-point scale ranging from predominantly weak to predomi-
nantly strong. In the system, examples of good practices are provided for each quality
indicator to assist observers in making the judgments. Observers can indicate whether
these good practices were present or not during the lesson and, based on this informa-
tion, they make a quality judgment about the relatively small grain-sized indicators at
the end of the lesson. There are no required scoring rules for computing a score.
Methods for analyzing the data range from computing a standardized scale score to
using IRT.
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 15

Observers can become certified if they are able to rate a lesson in a way similar to
master observers. There is no general manual available for the use of ICALT, and
training opportunities are not offered on a regular basis. However, training is avail-
able upon request, and the system authors can be contacted for information about
the system.
Research into ICALT has mainly occurred in secondary education. Confirmatory factor
analysis supported the six scales (Van de Grift, Van der Wal, & Torenbeek, 2011). Rasch
analyses have been conducted to place all quality indicators on a Rasch scale such that
teachers can be trained in their zone of proximal development (e.g., Van der Lans, Van
de Grift, & Van Veen, 2018). Multilevel analyses showed a relation between ICALT and
students’ academic engagement (Maulana, Helms-Lorenz, & Van de Grift, 2017). The RUG
also still conducts research into reliability aspects of ICALT (e.g., Van der Lans, Van de
Grift, Van Veen, & Fokkens-Bruinsma, 2016) and has just started a new international
project on teaching quality from an international perspective, the ICALT3.
ICALT has been developed since 2002. The first academic paper was published in
2007, and ICALT thereafter has been further developed into the current version. ICALT
has been used by researchers outside the development team in The Netherlands and
abroad. It is used by practitioners, and the previous observation instrument used by the
Dutch inspectorate was also based on ICALT.

Classroom Assessment Scoring System (CLASS, K-3 and UE)


The first version of CLASS was developed in the US for a study into the quality of pre-
school programs by the National Center for Early Development and Learning and was an
adaptation of the Classroom Observation System (COS) (Teachstone, 2015). Today,
Teachstone offers different versions of CLASS for different age groups: infants, toddlers,
pre-K, K-3, upper elementary (UE), and secondary education. We focus on the K-3 (Pianta,
La Paro, & Hamre, 2008) and UE version (Pianta, Hamre, & Mintz, 2012). CLASS captures
interactions between teachers and students, which are seen as the primary mechanism
of student development and learning (Pianta, La Paro, & Hamre, 2008). The focus of both
CLASS systems is on interactions between and among teachers and students blending
socioconstructivist and cognitive viewpoints, with a heavy emphasis on the develop-
ment of students. Both observation systems have been used for research, teacher
development, and evaluation purposes; and are general tools designed to be used
across subjects.
CLASS decomposes classroom interactions into three major domains: emotional
support, classroom organization, and instructional support. Within each of the domains,
there are multiple dimensions (11 in the UE version and 9 in the K-3 version). Of the nine
teaching dimensions presented earlier, the CLASS UE dimensions cover them all while
the CLASS K-3 dimensions are aligned with seven of the nine (see Table 2). The
dimensions focus on both students’ and teachers’ behaviors. The UE version also
provides a global measure for student engagement. CLASS was based on a review of
constructs assessed in classroom observation systems used in child care and in elemen-
tary school research, literature on effective teaching practices, focus groups, and exten-
sive piloting (Pianta, LaParo, & Hamre, 2008).
16 C. A. BELL ET AL.

CLASS is a high-inference observation system. All CLASS dimensions are scored on a 7-


point scale. For each dimension, several indicators are described that include descriptions
of low (1, 2), mid (3, 4, 5), and high (6, 7) range behavior. Raters observe in cycles: 15–20 min
of observation and 10 min for rating the dimensions. Each cycle is independent of the
others. The number of observation cycles depends on the goals for which CLASS is used.
Raters must obtain CLASS certification before they can conduct observations. To become
certified, observers attend a 2-day CLASS training and have to take a reliability test. Every
subsequent year, raters must recertify. CLASS manuals are available only through Teachstone,
a company that offers both training and certification opportunities upon request.
Both CLASS K-3 and CLASS UE have been used by many researchers in the USA and
abroad. In the manual of the K-3 version, the results of six studies are presented.
However, the research sample in these studies is often broader than K-3, and it is not
always clear whether CLASS K-3 or an earlier version was used. More recent studies used
CLASS K-3 in a K-3 setting and provide significant validity evidence about the system: an
evaluation of the factor-analytic validity (Sandilos, Shervey, DiPerna, Lei, & Cheng, 2016),
measures of internal consistency, and evidence for the reliability of the individual
domains (e.g., Abry, Rimm-Kaufman, Larsen, & Brewer, 2013). Henry (2010) found that
children exposed to high-quality teacher–child interactions as measured with CLASS K-3
scored significantly higher on an assessment of their language and literacy skills. Other
research also provides evidence for an association between CLASS scores and children’s
performance at the end of preschool (Pianta, La Paro, & Hamre, 2008).
CLASS UE was used in the MET study (T.J. Kane, Kerr, & Pianta, 2014), which provides
evidence for the factor structure and the internal consistency of the scale (Pianta, Hamre, &
Mintz, 2012). Correlations between CLASS scores and value-added measures based on Math
and English Language Arts (ELA) state tests were modest in size (T.J. Kane & Staiger, 2012).
CLASS K-3’s development dates back to 2002, while the UE version is newer, first
published in 2012. While versions of the instruments are not denoted with a version
number, the current systems being used do reflect revisions over time. Through exten-
sive research carried out by both developers and non-developer researchers, the K-3 and
UE systems are well along their respective developmental continua.

Third International Mathematics and Science Study (TIMSS) video mathematics


system
The “TIMSS Video Study”, which was linked to the Third International Mathematics and
Science Study (TIMSS) produced a mathematics and science observation system whose
goal was “to describe and investigate teaching practices in eighth-grade mathematics in
a variety of countries” (Jacobs et al., 2003, p. 1). The development was led by US
researchers in collaboration with seven countries’ experts and was therefore designed
to be used internationally. It has generally been used for research and the improvement
of purposes (e.g., Givvin, Hiebert, Jacobs, Hollingsworth, & Gallimore, 2005; Leung, 2005;
TIMSS Video Mathematics Research Group, 2003; World Bank, 2010). We focus on the
mathematics’ system.
TIMSS Video was originally designed to address the US National Council of Teachers
of Mathematics student mathematics standards, which privilege socioconstructivist
approaches to learning. However, the system integrates views of teaching from more
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 17

behavioral and cognitive viewpoints as well. TIMSS follows both students’ and teachers’
actions and discourse and tracks the degree to which these are public (i.e., shared with
the entire classroom) or private (i.e., between a small number of students). The system is
subject specific and targeted toward secondary grades.
The TIMSS codes capture six of the nine teaching dimensions in the framework; safe
and stimulating learning environment, assessment for learning, differentiated instruc-
tion, and teaching learning and student self-regulation are not addressed. TIMSS does
not have scales, but instead codes that are grouped conceptually and describe the
subject matter of mathematics lessons by documenting the lesson’s specific mathema-
tical subject matter, the organization of the lesson, and the instructional processes. Each
lesson is segmented into problem-based interactions of variable length and mutually
exclusive categories, called coverage codes. Twenty-one coverage codes define the
organization of the lesson including whether mathematics is being taught, in what
problem format, and whether or not problems overlap. There are also occurrence
codes that describe the types of activities engaged in by students as well as how
those activities unfold, the resources being used, and the nature of mathematical
practices and interactions emphasized. Codes were developed based on mathematics
education research, and a collaborative process of viewing videos from seven countries
and attempting to capture both similarities and differences across countries (Stigler,
Gallimore, & Hiebert, 2000)
TIMSS is a low-inference system. TIMSS is scored using both a video and a
standardized transcript of the lesson. Using transcripts and the 110-page system,
general and specialized raters make a total of seven passes through a video and its
associated transcript in order to assign categorical codes to the entire lesson. TIMSS
parses teaching into very small pieces, for example, whether there was a mathema-
tical generalization present, how many there were, or how many graphs were drawn
publicly. And yet, alone, the codes do not make judgments about teaching quality.
Analysts bring a teaching quality analytic framework to the codes in order to aggre-
gate the codes in ways that allow judgments about teaching quality to be made (e.g.,
Leung, 2005).
All raters are required to pass a certification test, and lessons are double scored. All
codes are aggregated to the lesson level, and, to our knowledge, no one has attempted
to make systematic claims about teachers, instead focusing on descriptions of teaching
within and across countries. There is no training offered by the developers; however, the
codes are available for free download in technical documentation for the study (Jacobs
et al., 2003).
The original reports of the coding schemes detail the lesson-level reliability of coding
as well as the standard errors for each code; additional reports describe the develop-
ment and application of the codes (e.g., Givvin et al., 2005). Our review did not identify a
published factor analysis. We also found no validation evidence that considered various
scoring models or investigated the quality of ratings beyond general rater agreement
metrics. Ratings on TIMSS have been linked to student achievement at least once (World
Bank, 2010).
Initial codes were developed in 1994 and revised for use in the 1999 TIMSS Video
study. The only other study our search located that used the full codes appears to use
the 1999 codes (World Bank, 2010). While we were able to locate studies that used
18 C. A. BELL ET AL.

TIMSS video capture methodology – for example, two cameras, medium angle camera
shot, or reanalyzed videos (e.g., Hugener et al., 2009; Kunter & Baumert, 2006), we found
only a single English-language study in which a non-developer researcher used the
TIMSS codes as described in the technical manual (World Bank, 2010). To our knowledge,
there are no additional studies that modify and report on those modified codes, thus
indicating little progression on a developmental continuum.

Protocol for Language Arts Teaching Observation (PLATO)


The PLATO classroom observation system was developed by Grossman, Loeb, Cohen,
and Wyckoff (2013) at Stanford University to capture the quality of English/Language
Arts (ELA) instruction. PLATO privileges socioconstructivist approaches to learning but
combines this with more cognitive approaches as well. PLATO is used both for
research purposes and as a system for teacher professional development. It is a
subject-specific system; however, since its early stages, researchers have tested its
applicability in other disciplines such as mathematics (Cohen, 2015) and science
education (Kloser, 2014).
PLATO measures eight of our nine teaching dimensions, but with a slightly different
framing and indexing. The system is organized around four key instructional domains:
Instructional Scaffolding, Disciplinary Demand, Representing and Use of Content, and
Classroom Environment. Each domain is divided into between two and four elements
and includes a total of 12 elements. In addition to the 12 elements, PLATO captures the
subject matter of instruction (e.g., writing, literature, and/or grammar) as well as the
overall activity structures (whole group, small group, independent work, etc.). While
mainly following teachers’ actions, PLATO also pays attention to student engagement;
for example, when rating strategy use, the rater must attend to whether both the
teacher prompts students to use strategies and students are using those strategies.
PLATO builds on research proven critical for high-quality English language Arts (ELA)
education (Grossman et al., 2009).
PLATO is a high-inference system designed for interval coding, using 15-min intervals
for coding all 12 elements, and can be used for real-time observations as well as for
observing classroom videos. Each of the 12 elements is scored on a scale from 1 to 4
based on the evidence for a given element during a 15-min cycle. At the low end, there
is almost no evidence, or little evidence, of instructional practice related to the element
in question, whereas the higher end is characterized by evidence with some weaknesses,
or strong and consistent evidence.
Raters must be certified prior to entering the field. Certification requires that raters
agree with master ratings on 80% of ratings over five lessons. Certification is granted by
the developers after having attended a face-to-face or online training, which occurs
upon request. The coding manual is available for free download at the project’s website
(Grossman, 2018)
Confirmatory factor analysis showed empirical evidence for the theorized scales (Grossman
et al., 2013). And Kor (2011) performed a generalizability study to analyze the sources of
variation in PLATO scores from one study. Using those data, Kor concludes that in order to
achieve an overall reliability greater than .80, one must observe at least five segments per
teacher. Multilevel analyses indicate a relationship between PLATO dimensions and students’
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 19

academic engagement (Cohen & Grossman, 2016; T.J. Kane & Staiger, 2012). Developers have
also documented that relationships between PLATO scores and student achievement are
sensitive to the student achievement test used (Grossman, Cohen, Ronfeldt, & Brown, 2014) as
well as sensitive to the grade level, topic, and student demographic characteristics (Grossman,
Cohen, & Brown, 2014).
First shared publically in 2009, PLATO has iterated over multiple versions, including
PLATO Prime that was used in the MET study (T.J. Kane & Staiger, 2012). The current
version is 5.0. Research has been carried out by developers and non-developer research-
ers (Dalland, Klette, & Svenkerud, 2018; Grossman et al., 2013; Klette & Blikstad-Balas,
2018). Together, this suggests the system has begun to make progress along its devel-
opmental continuum.
Table 2 summarizes the aspects of the four observation systems evaluated.

Discussion
After defining the observation system concept, we presented a framework for analyzing
observation systems and then applied the framework to four well-known systems. The
framework’s aspects seem to have value as they point to relevant differences between
the four observation systems. If practitioners or researchers plan to use an observation
system, it is important to be aware of how observation systems can differ, and make
informed choices regarding the observation system that will best suit their purposes.
Applying the framework reveals that all but one dimension of teaching (i.e., teaching
learning strategies and student self-regulation) is addressed by at least three observation
systems. All four observation systems address a core group of dimensions and do not all
measure the same dimensions of teaching. Only the dimensions involvement/motivation
and cognitive activation were measured by all four instruments. Unmeasured dimensions
are also fundamental aspects of teaching quality; however, there may be defensible
reasons for not including these dimensions in an observation system, depending on
one’s purpose (e.g., Park et al., 2014). The framework’s contribution is not to endorse a
particular system, but, rather, its application can support more deliberate selection and
use of observation systems.
This also applies to the view of teaching and learning category that forms the basis for
a specific observation system because there is no “one best” observation system.
Definitions of teaching quality are informed by empirical matters, but they are also
influenced by preferences and values regarding good teaching.
The general or subject specificity of the system did not seem to demonstrate any
pattern when applied to the instruments selected. It was clear, though, that across, both
types of systems produced empirical evidence of a relationship between scores and
student achievement as well as the documented movement along a development
continuum. This may suggest that either type may be useful from a predictive perspec-
tive (Praetorius et al., 2018) and others can learn to use both types of systems. If this
suggestion holds up to a systematic and rigorous analysis across more than four
systems, decisions about the general or subject-specific nature of a system may need
to be driven by users’ specific needs for the system (Hill & Grossman, 2013). For example,
subject-specific systems might be particularly useful when researchers are studying the
impact of a professional development program or providing feedback to teachers in that
20 C. A. BELL ET AL.

professional development program, whereas school-system-wide observations might be


better supported using a general protocol that all administrators can be taught to use
relatively inexpensively. A rigorous review of different types of systems (i.e., general or
subject specific; see, e.g., Charalambous & Praetorius, 2018) would be useful but is
beyond the scope of this article.
Our analysis showed that systems also vary on the grain size and focus on students
actions aspects: refined 7-point scoring scales versus more restricted scales, varying
numbers of teaching quality indicators (10–35), and a focus on teachers only in the
evaluation of what happens in the classroom versus approaches in which the behavior
and input from students is also measured. When selecting an observation system from
this perspective, one makes choices regarding to what extent one aims to measure the
full complexity of teaching quality: how many aspects of teaching quality, how many
perspectives, which scoring distinctions? More inclusive and extensive definitions of
teaching quality may increase the cognitive demand raters bear and/or require specific
rater background knowledge and training, when they have to take account of too many
quality aspects in their measurements. Even well-trained raters are an important source
of variation in teachers’ teaching quality scores (T.J. Kane & Staiger, 2012), and the more
complex quality definitions are made, the more likely raters will be a large source of
variation in scores. Archer et al. (2017) argue that when an observation system includes
many teacher competencies, the feedback may be very fine-grained and improvement
efforts could be tuned well. However, the quality of the feedback may suffer as teachers
will be overtaxed. A balance between the two is needed.
An application of the sixth framework aspect, scoring procedures, suggests there is
wide variation in how developers support valid scores. The choices developers must
make also have trade-offs. For example, working with external expert raters can have the
advantage of a tightly controlled and monitored scoring setting where staff are focused
narrowly on providing accurate and reliable scores. This might be helpful for accurately
identifying the specific dimension of teaching that needs remediation; however, if the
scores will be discussed with the teacher who is trying to improve, a conversation with
an expert rater with whom the teacher does not share a trusting relationship may not
maximize what a teacher can learn from the scores or feedback. Conversely, if the
observation system uses administrators or peers to create the ratings, the existing
relationship these professionals share with the teacher may lead to inaccurate ratings
or ratings the teacher perceives are less than objective. There are not right or wrong
choices of raters; observation systems must specify the rater (e.g., principal or expert)
and then adjust the procedures and processes pertaining to scoring quality to account
for whatever decision is made.
A final trade-off concerns the empirical evidence necessary when selecting an obser-
vation system. As even the four systems we review here demonstrate, the amount and
type of validity evidence for an observation system varies. Certainly for any purpose, one
should be concerned that there is evidence raters can be trained to create accurate and
reliable scores. Irrespective of the system’s purpose, it is unethical, for example, to tell a
teacher she has low levels of formative assessment when s/he does not. But there may
be trade-offs around the validity of a system that should be accounted for, depending
on how scores will be used. A specific scoring decision, say using IRT to create scores
when a school system needs a low-cost system for formative feedback, may not be an
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 21

appropriate choice; however, in a research study designed to evaluate an intervention,


IRT might be an excellent scoring decision.
As another example, perhaps we can tolerate somewhat less accuracy and reliability for
improvement purposes than for research and accountability purposes (e.g., precise rank-
ings of the best to the poorest teaching teacher). If a researcher needs a precise estimate of
the impact of a new curriculum, the highest levels of accuracy and reliability are likely
necessary for detecting such impacts. That said, even for improvement purposes, one
should care about the relative differences between score points, which suggests reason-
able levels of attention to raters and rating. Researchers and practitioners should think
carefully about the purposes they have for the scores and consider what validity evidence
can be fashioned into an argument for an appropriate level of score quality for that
purpose. It bears noting, however, that we should not ignore the importance of practi-
tioners and researchers developing a common language and body of evidence about how
to measure teaching and how teaching is related to other valuable outcomes. These
outcomes can be achieved with varied levels of validity evidence.
In addition to these trade-offs that should be considered when selecting an observation
system, the framework points to important issues around the research knowledge base.
Writing this article underscored the fact that observation system developers make different
choices across framework aspects, and it is clear that these choices shape the ultimate
nature and character of the system. But it is not clear (or likely) that there is one best set of
decisions. Further, developers generally do not share the reasoning behind those systems.
In some cases, it is challenging to locate all of the framework details in published
documentation, especially issues of rater training, certification, calibration, and monitoring
accuracy and reliability. This limits the field’s understanding of how particular observation
system aspects shape empirical evidence. To further develop the field’s knowledge of how
to measure and improve teaching, researchers would do well to make these types of
decisions more transparent and more a part of the research enterprise. There are examples
of this (e.g., Seidel, Prenzel, & Kobarg, 2005), but they are rare and they do not yet
constitute a body of scholarship that guides the development of new systems and uses
of protocols within observation systems. Such knowledge would be valuable for the
efficiency of observation systems and for the improvement of teaching.

Disclosure statement
No potential conflict of interest was reported by the authors.

ORCID
Courtney A. Bell http://orcid.org/0000-0001-8743-5573
Adrie Visscher http://orcid.org/0000-0001-8443-9878

References
Abadzi, H. (2009). Instructional time loss in developing countries: Concepts, measurement, and
implications. The World Bank Research Observer, 24(2), 267–290. doi:10.1093/wbro/lkp008
22 C. A. BELL ET AL.

Abry, T., Rimm-Kaufman, S. E., Larsen, R. A., & Brewer, A. J. (2013). The influence of fidelity of
implementation on teacher–student interaction quality in the context of a randomized con-
trolled trial of the Responsive Classroom approach. Journal of School Psychology, 51(4), 437–453.
doi:10.1016/j.jsp.2013.03.001
Archer, J., Cantrell, S., Holtzman, S. L., Joe, J. N., Tocci, C. M., & Wood, J. (2017). Better feedback for
better teaching: A practical guide to improving classroom observations. Retrieved from http://
k12education.gatesfoundation.org/resource/better-feedback-for-better-teaching-a-practical-
guide-to-improving-classroom-observations/
Baumert, J., Kunter, M., Blum, W., Brunner, M., Voss, T., Jordan, A., . . . Tsai, Y.-M. (2010). Teachers’
mathematical knowledge, cognitive activation in the classroom, and student progress. American
Educational Research Journal, 47(1), 133–180. doi:10.3102/0002831209345157
Bell, C. A., Gitomer, D. H., McCaffrey, D. F., Hamre, B. K., Pianta, R. C., Qi, Y. (2012). An argument
approach to observation protocol validity. Educational Assessment, 17(2–3), 62–87. doi:10.1080/
10627197.2012.715014
Bill & Melinda Gates Foundation. (2018). Measures of effective teaching project: Frequently asked
questions. Retrieved from http://k12education.gatesfoundation.org/blog/measures-of-effective-
teaching-project-faqs/
Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education:
Principles, Policy & Practice, 5(1), 7–74. doi:10.1080/0969595980050102
Black, P., & Wiliam, D. (2010). Inside the black box: Raising standards through classroom assess-
ment. Phi Delta Kappan, 92(1), 81–90. doi:10.1177/003172171009200119
Boekaerts, M., Pintrich, P. R., & Zeidner, M. (Eds.). (2000). Handbook of self-regulation. San Diego, CA:
Academic Press.
Brophy, J. E., & Good, T. L. (1986). Teacher behavior and student achievement. In M. C. Wittrock
(Ed.), The handbook of research on teaching (3rd ed., pp. 328–375). New York, NY: Macmillan.
Carnine, D. W., Dixon, R. C., & Silbert, J. (1998). Effective strategies for teaching mathematics. In E. J.
Kame’enui & D. W. Carnine (Eds.), Effective teaching strategies that accommodate diverse learners
(pp. 93–112). Columbus, OH: Merrill.
Charalambous, C. Y., & Praetorius, A.-K. (2018). Studying mathematics instruction through different
lenses: Setting the ground for understanding instructional quality more comprehensively. ZDM
Mathematics Education, 50(3), 355–366. doi:10.1007/s11858-018-0914-8
Clarke, D., Emanuelsson, J., Jablonka, E., & Mok, I. A. C. (Eds.). (2006). Making connections:
Comparing mathematics classrooms around the world. Rotterdam: Sense.
Cohen, J. (2015). Challenges in identifying high leverage practices. Teachers College Record, 117(7), 1–41.
Cohen, J., & Grossman, P. (2016). Respecting complexity and measures of teaching: Keeping
students and schools in focus. Teaching and Teacher Education, 55, 308–317. doi:10.1016/j.
tate.2016.01.017
Dalland, C. P., Klette, K., & Svenkerud, S. (2018). Video studies and the challenge of selecting time
scales. Manuscript submitted for publication.
Danielson, C. (2013). The Framework for Teaching Evaluation Instrument. Princeton, NJ: Danielson
Group.
Decristan, J., Klieme E., Kunter M., Hochweber, J., Büttner, G., Fauth, B., . . . Hardy, I. (2015).
Embedded formative assessment and classroom process quality: How do they interact in
promoting science understanding? American Educational Research Journal, 52(6), 1133–1159.
doi:10.3102/0002831215596412
Dobbelaer, M. J., & Visscher, A .J. (2018). The quality of classroom observation systems for measuring
teaching quality in primary education – A systematic review. Manuscript submitted for
publication.
Flanders, N. A. (1970). Analyzing teaching behavior. Boston, MA: Addison Wesley.
Givvin, K. B., Hiebert, J., Jacobs, J. K., Hollingsworth, H., & Gallimore, R. (2005). Are there national
patterns of teaching? Evidence from the TIMSS 1999 video study. Comparative Education Review,
49(3), 311–343. doi:10.1086/430260
Grossman, P. L. (2018). The Protocol for Language Arts Teaching Observation (PLATO). Retrieved from
http://platorubric.stanford.edu/index.html
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 23

Grossman, P., Cohen, J., & Brown, L. (2014). Understanding instructional quality in English
Language Arts: Variations in PLATO scores by content and context. In T. J. Kane, K. A. Kerr, &
R. C. Pianta (Eds.), Designing teacher evaluation systems: New guidance from the Measures of
Effective Teaching Project (pp. 303–331). San Francisco, CA: Jossey-Bass.
Grossman, P., Cohen, J., Ronfeldt, M., & Brown, L. (2014). The test matters: The relationship
between classroom observation scores and teacher value added on multiple types of assess-
ment. Educational Researcher, 43(6), 293–303. doi:10.3102/0013189X14544542
Grossman, P., Greenberg, S., Hammerness, K., Cohen, J. Alston, C., & Brown, M. (2009, April).
Development of the Protocol for Language Arts Teaching Observation (PLATO). Paper presented
at the Annual Meeting of the American Educational Research Association, San Diego, CA.
Grossman, P., Loeb, S., Cohen, J., & Wyckoff, J. (2013). Measure for measure: The relationship
between measures of instructional practice in middle school English language arts and
teachers’ value-added scores. American Journal of Education 119(3), 445–470. doi:10.1086/
669901
Grossman, P., & McDonald, M. (2008). Back to the future: Directions for research in teaching and
teacher education. American Educational Research Journal, 45(1), 184–205. doi:10.3102/
0002831207312906
Henry, A. E. (2010). Advantages to and challenges of using ratings of observed teacher-child interac-
tions (Unpublished doctoral dissertation). University of Virginia, Charlottesville, VA.
Hiebert, J., & Grouws, D. A. (2007). The effects of classroom mathematics teaching on students’
learning. In F. K. Lester (Ed.), Second handbook of research on mathematics teaching and learning
(pp. 371–404). Charlotte, NC: Information Age.
Hill, H. C. (2018). Mathematical Quality of Instruction (MQI) domains. Retrieved from https://cepr.
harvard.edu/mqi-domains
Hill, H. C., Blunk, M. L., Charalambous, C. Y., Lewis, J. M., Phelps, G. C., Sleep, L., & Ball, D. L. (2008).
Mathematical knowledge for teaching and the mathematical quality of instruction: An explora-
tory study. Cognition and Instruction, 26(4), 430–511. doi:10.1080/07370000802177235
Hill, H. C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not enough: Teacher
observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–
64. doi:10.3102/0013189X12437203
Hill, H. C., & Grossman, P. (2013). Learning from teacher observations: Challenges and opportu-
nities posed by new teacher evaluation systems. Harvard Educational Review, 83(2), 371–384.
Hugener, I., Pauli, C., Reusser, K., Lipowsky, F., Rakoczy, K., & Klieme, E. (2009). Teaching patterns
and learning quality in Swiss and German mathematics lessons. Learning and Instruction, 19(1),
66–78. doi:10.1016/j.learninstruc.2008.02.001
International Association for the Evaluation of Educational Achievement. (2018). The TIMSS Video
study. Retrieved from http://www.timssvideo.com/the-study/
Jacobs, J., Garnier, H., Gallimore, R., Hollingsworth, H., Givvin, K. B., Rust, K., . . . Stigler, J. W. (2003).
Third International Mathematics and Science Study 1999 Video Study Technical Report: Volume 1:
Mathematics (NCES 2003012). Washington, DC: National Center for Education Statistics.
Joe, J. N., McClellan, C. A., & Holtzman, S. L. (2014). Scoring design decisions: Reliability and the
length and focus of classroom observations. In T. J. Kane, K. A. Kerr, & R. C. Pianta (Eds.),
Designing teacher evaluation systems: New guidance from the Measures of Effective Teaching
Project (pp. 415–443). San Francisco, CA: Jossey-Bass.
Kane, M. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64).
Westport, CT: American Council on Education and Praeger.
Kane, T. J., Kerr, K. A., & Pianta, R. C. (Eds.). (2014). Designing teacher evaluation systems: New
guidance from the Measures of Effective Teaching Project. San Francisco, CA: Jossey-Bass.
Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching: Combining high-quality
observations with student surveys and achievement gains. Retrieved from https://files.eric.ed.
gov/fulltext/ED540962.pdf
Keuning, T., Van Geel, M., Frèrejean, J., Van Merriënboer, J., Dolmans, D., & Visscher, A. J. (2017).
Differentiëren bij rekenen: Een cognitieve taakanalyse van het denken en handelen van
24 C. A. BELL ET AL.

basisschoolleerkrachten [Differentiated instruction for mathematics: A cognitive task analysis of


primary school teachers’ reasoning and acting], Pedagogische Studiën, 94(3), 160–181.
Klette, K., & Blikstad-Balas, M. (2018). Observation manuals as lenses to classroom teaching: Pitfalls
and possibilities. European Educational Research Journal, 17(1), 129–146. doi:10.1177/
1474904117703228
Klette, K., Blikstad-Balas, M., & Roe, A. (2017). Linking instruction and student achievement:
Research design for a new generation of classroom studies. Acta Didactica Norge, 11(3):
Art. 10. doi:10.5617/adno.4729
Kloser, M. (2014). Identifying a core set of science teaching practices: A Delphi expert panel
approach. Journal of Research in Science Teaching, 51(9), 1185–1217. doi:10.1002/tea.21171
Kor, K. (2011, April). The measurement properties of the PLATO rubric. Paper presented at the Annual
Meeting of the American Educational Research Association, New Orleans, LA.
Kraft, M. A, & Blazar, D. (2017). Individualized coaching to improve teacher practice across grades
and subjects: New experimental evidence. Educational Policy, 31(7), 1033–1068. doi:10.1177/
0895904816631099
Kunter, M., & Baumert, J. (2006). Linking TIMSS to research on learning and instruction: A re-
analysis of the German TIMSS and TIMSS video data. In S. J. Howie & T. Plomp (Eds.), Contexts
of learning mathematics and science: Lessons learned from TIMSS (pp. 335–351). London:
Routledge.
Leung, F. K. S. (2005). Some characteristics of East Asian mathematics classrooms based on data
from the TIMSS 1999 video study. Educational Studies in Mathematics, 60(2), 199–215.
doi:10.1007/s10649-005-3835-8
Littell, J. H., Corcoran, J., & Pillai, V. (2008). Systematic reviews and meta-analysis. New York, NY:
Oxford University Press.
Liu, S., Bell, C. A., & Jones, N. D. (2017, March). The validity of classroom observation systems in
research and applied contexts. Paper presented at the annual spring meeting of the Society for
Research on Educational Effectiveness (SREE), Washington, DC.
Marzano, R. J., Marzano, J. S., & Pickering, D. J. (2003). Classroom management that works: Research-
based strategies for every teacher. Alexandria, VA: ASCD.
Maulana, R., Helms-Lorenz, M., & Van de Grift, W. (2017). Validating a model of effective teaching
behaviour of pre-service teachers. Teachers and Teaching: Theory and Practice, 23(4), 471–493.
doi:10.1080/13540602.2016.1211102
Muijs, D., Kyriakides, L., Van der Werf, G., Creemers, B., Timperley, H., & Earl, L. (2014). State of the
art – Teacher effectiveness and professional learning. School Effectiveness and School
Improvement, 25(2), 231–256. doi:10.1080/09243453.2014.885451
Osborne, J., Borko, H., Busch, K. C., Fishman, E., Millon, S., & Tseng, A. (2015, August). Assessing the
quality of classroom discourse in science classrooms. Paper presented at the biennial conference
of the European Association for Research on Learning and Teaching, Limmasol, Cyprus.
Oser, F. K., & Baeriswyl, F. J. (2001). Choreographies of teaching: Bridging instruction to learning. In
V. Richardson (Ed.), Handbook on research on teaching (4th ed., pp. 1031–1065). Washington, DC:
American Educational Research Association.
Park, Y. S., Chen, J., & Holtzman, S. L. (2014). Evaluating efforts to minimize rater bias in scoring
classroom observations. In T. J. Kane, K. A. Kerr, & R. C. Pianta (Eds.), Designing teacher evaluation
systems: New guidance from the Measures of Effective Teaching Project (pp. 383–414). San
Francisco, CA: Jossey-Bass.
Pianta, R. C., Hamre, B. K., & Mintz, S. (2012). Classroom Assessment Scoring System (CLASS) manual
upper elementary. Baltimore, MD: Paul H. Brookes.
Pianta, R. C., La Paro, K. M., & Hamre, B. K. (2008). Classroom Assessment Scoring System: Manual K-3.
Baltimore, MD: Paul H. Brookes.
Praetorius, A.-K., & Charalambous, C. Y. (2018). Classroom observation frameworks for studying
instructional quality: Looking back and looking forward. ZDM Mathematics Education, 50(3),
535–553. doi:10.1007/s11858-018-0946-0
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT 25

Praetorius, A.-K., Klieme, E., Herbert, B., & Pinger, P (2018). Generic dimensions of teaching quality:
The German framework of Three Basic Dimensions. ZDM Mathematics Education, 50(3), 407–426.
doi:10.1007/s11858-018-0918-4
Praetorius, A.-K., Pauli, C., Reusser, K., Rakoczy, K., & Klieme, E. (2014). One lesson is all you need?
Stability of instructional quality across lessons. Learning and Instruction, 31, 2–12. 10.1016/j.
learninstruc.2013.12.002
Rosenshine, B. (1980). How time is spent in elementary classrooms. In C. Denham & A. Lieberman
(Eds.), Time to learn (pp. 107–126). Washington, DC: National Institute of Education.
Saginor, N. (2008). Diagnostic classroom observation: Moving beyond best practice. Thousand Oaks,
CA: Corwin Press.
Sandilos, L. E., Shervey, S. W., DiPerna, J. C., Lei, P., & Cheng, W. (2016). Structural validity of CLASS
K-3 in primary grades: Testing alternative models. School Psychology Quarterly, 32(2), 226–239.
doi:10.1037/spq0000155
Sawada, D., Piburn, M. D., Judson, E., Turley, J., Falconer, K., Benford, R., & Bloom, I. (2002).
Measuring reform practices in science and mathematics classrooms: The Reformed Teaching
Observation Protocol. School Science and Mathematics, 102(6), 245–253. doi:10.1111/j.1949-
8594.2002.tb17883.x
Schacter, J., & Thum, Y. M. (2004). Paying for high-and low-quality teaching. Economics of Education
Review, 23(4), 411–430. doi:10.1016/j.econedurev.2003.08.002
Scheerens, J. (2014). School, teaching, and system effectiveness: Some comments on three state-
of-the-art reviews. School Effectiveness and School Improvement, 25(2), 282–290. doi:10.1080/
09243453.2014.885453
Seidel, T., Prenzel, M., & Kobarg, M. (Eds.). (2005). How to run a video study: Technical report of the
IPN video study. Münster: Waxmann.
Seidel, T., & Shavelson, R. J. (2007). Teaching effectiveness research in the past decade: The role of
theory and research design in disentangling meta-analysis results. Review of Educational
Research, 77(4), 454–499. doi:10.3102/0034654307310317
Slavin, R. E. (1996). Education for all. Lisse: Swets & Zeitlinger.
Stallings, J. A. (1973). Follow through program classroom observation evaluation 1971–72. (Report
No. SRI-URU-7370). Menlo Park, CA: Stanford Research Institute.
Stigler, J. W., Gallimore, R., & Hiebert, J. (2000). Using video surveys to compare classrooms and
teaching across cultures: Examples and lessons from the TIMSS video studies. Educational
Psychologist, 35(2), 87–100. doi:10.1207/S15326985EP3502_3
Teachstone. (2015). Why class? Exploring the promise of the Classroom Assessment Scoring System
(CLASS). Retrieved from http://cdn2.hubspot.net/hubfs/336169/What_Is_CLASS_ebook_Final.
pdf?t=1446
TIMSS Video Mathematics Research Group. (2003). Understanding and improving mathematics
teaching: Highlights from the TIMSS 1999 Video Study. Phi Delta Kappan, 84(10), 768–775.
doi:10.1177/003172170308401011
Tomlinson, C. A. (2004). The Möbius effect: Addressing learner variance in schools. Journal of
Learning Disabilities, 37(6), 516–524. doi:10.1177/00222194040370060601
Tomlinson, C. A., Brimijoin, K., & Narvaez, L. (2008). The differentiated school: Making revolutionary
changes in teaching and learning. Alexandria, VA: ASCD.
Van de Grift, W. (2007). Quality of teaching in four European countries: A review of the literature
and application of an assessment instrument. Educational Research, 49(2), 127–152. doi:10.1080/
00131880701369651
Van de Grift, W., Van der Wal, M., & Torenbeek, M. (2011). Ontwikkeling in de pedagogisch
didactische vaardigheid van leraren in het basisonderwijs [The development of primary school
teachers’ pedagogical and didactical skill]. Pedagogische Studiën, 88(6), 416–432.
Van der Lans, R. M., Van de Grift, W. J. C. M., & Van Veen, K. (2018). Developing an instrument for
teacher feedback: Using the Rasch model to explore teachers’ development of effective teach-
ing strategies and behaviors. The Journal of Experimental Education, 86(2), 247–264. doi:10.1080/
00220973.2016.1268086
26 C. A. BELL ET AL.

Van der Lans, R. M., Van de Grift, W. J. C. M., Van Veen, K., & Fokkens-Bruinsma, M. (2016). Once is
not enough: Establishing reliability criteria for feedback and evaluation decisions based on
classroom observations. Studies in Educational Evaluation, 50, 88–95. doi:10.1016/j.
stueduc.2016.08.001
Van Veen, K., Zwart, R., & Meirink, J. (2012). What makes teacher professional development
effective? A literature review. In M. Kooy & K. van Veen (Eds.), Teacher learning that matters:
International perspectives (pp. 3–21). Abingdon: Routledge.
Wang, M. C., Haertel, G. D., & Walberg, H. J. (1993). Toward a knowledge base for school learning.
Review of Educational Research, 63(3), 249–294. doi:10.3102/00346543063003249
World Bank. (2010). Inside Indonesia’s mathematics classrooms: A TIMSS video study of teaching
practices and student achievement (Report No. 54936–ID). Jakarta: Author.
Zimmerman, B. J. (1990). Self-regulated learning and academic achievement: An overview.
Educational Psychologist, 25(1), 3–17. doi:10.1207/s15326985ep2501_2
Appendix 1. Alignment of four exemplar observation systems to framework dimensions of teaching

Observation System
Dimension of
Teaching ICALT CLASS K-3 CLASS UE TIMSS1 PLATO
Safe and Shows respect Behavior Positive Climate – –
stimulating Maintains relaxed atmosphere Management Negative Climate
learning Promotes learners’ self-confidence Positive Climate Teacher Sensitivity
environment Fosters mutual respect Negative Behavior
Stimulates the building of self-confidence in weaker Climate Management
learners Teacher
Sensitivity
Classroom Ensures the lesson proceeds in an orderly manner Behavior Behavior Time of lesson Time Management
management Monitors to ensure learners carry out activities Management Management Patterns of public/private classroom Behavior
Provides effective classroom management Productivity Productivity interaction Management
Uses the time for learning efficiently Non-mathematics/off topic
Gives a clear explanation of how to use didactic aids Break
Outside interruption
Involvement/ Engages all learners in the lesson Regard for Student Regard for Student How many students Connections to
motivation of Encourages learners to do their best Perspectives Perspectives Required or optional personal and/or
students Offers activities and work forms that stimulate Instructional Student Length of working-on cultural
learners to take active approach Learning Engagement Facilitating exploration experiences
Learners are fully engaged in the lesson Formats
Learners show that they are interested
Learners take an active approach to learning
Gives interactive instructions

(Continued )
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT
27
28

(Continued).
Observation System
Dimension of
Teaching ICALT CLASS K-3 CLASS UE TIMSS1 PLATO
Explanation of Presents and explains the SM in a clear manner – Content Independent problem* Text-Based
subject matter Teaches in a well-structured manner Understanding Answered only problem* Instruction
Clearly specifies the lesson aims at the start of the Instructional Concurrent problem set-up* Purpose
lesson Learning Formats Concurrent problem seat work*
C. A. BELL ET AL.

Concurrent problem class work*


Concurrent problem mixed activity*
Interruption type: independent
problem*
Interruption type: problem piece*
Non-problem*
Number of concurrent problems*
Goal statement*
Historical background*
Summary of lesson*
Non-math within problems*
Real-life connection/application*
Quality of – Concept Content Homework* Text-Based
subject-matter Development Understanding How many students* Instruction
representation Language Required or optional* Representation of
Modeling Problem content* Content
Real-life connection* Connections to
Graphs* Prior Academic
Tables* Knowledge
Drawings or diagrams* Models/modeling
Physical materials* Guided Practice
Degree of student choice* Accommodations
Proof/verification/derivation* for Language
Number of different numerical or Learning
geometric target results*
Number of different forms of the
target results*
Length of working-on*
Facilitating exploration*

(Continued )
(Continued).
Observation System
Dimension of
Teaching ICALT CLASS K-3 CLASS UE TIMSS1 PLATO
Cognitive Asks questions which stimulate learners to reflect Concept Analysis and Inquiry Resources used* Intellectual Challenge
activation Encourages learners to think critically Development Instructional Multiple solution methods* Classroom
Stimulates the application of what has been learned Dialogue Problem summary* Discourse
Lets learners think aloud Types of information or activity in
non-problem*
Contextual information:*
Mathematical concept/theory/idea
Activity
Cognitive Private work assignment*
activation Private work segments:*
Organization of students
Display information
Administrative activity
Type of public announcements
Purpose*
Mathematical generalizations*
Labels and symbols*
Links*
Assessment for Gives feedback to learners Quality of Quality of Feedback – Guided Practice
learning During the presentation stage, checks whether Feedback
learners have understood
Evaluates whether the lesson aims have been
reached
Differentiated Stimulates the building of self-confidence in weaker Teacher Sensitivity Teacher Sensitivity Required or optional Accommodations for
instruction learners Quality of Instructional Degree of student choice Language Learning
Offers weaker learners extra study and instruction Feedback Learning Formats
time
Adjusts instructions to relevant inter-learner
differences
Adjusts the processing of subject matter to relevant
inter-learner differences
Teaching learning Stimulates learners to think about solutions – Analysis and Inquiry – Explicit Strategy
and student Lets learners think aloud Instruction
self-regulation Encourages students to think critically
Teaches learners how to simplify complex problems
SCHOOL EFFECTIVENESS AND SCHOOL IMPROVEMENT

Stimulates the use of control activities


Teaches learners to check solutions
Asks learners to reflect on practical strategies
29

Notes: 1As noted in the TIMSS description, the TIMSS codes do not directly convey decisions about quality; therefore, they are difficult to map onto the framework’s dimensions of teaching,
which specify particular and explicit values about how instruction proceeds. In particular, we found that the TIMSS codes aligned to explanation of subject matter, quality of subject-matter
representation, and cognitive activation align to all three of those dimensions. For example, using two representations – a graph and a table – might support high-quality subject-matter
explanations; they might also be a quality representation; and they might support cognitive activation. All of the codes marked with a * are aligned to all three aforementioned dimensions
of the framework.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy