Mixednotes 2024
Mixednotes 2024
These lecture notes are made for the course Statistics A where roughly three weeks are about mixed
models. There are many books about mixed models, but some are very thick and unfocused and
not suitable for a shorter course, some are of very applied nature without giving much mathematical
background (rather, they are guides on how do do analyses in R), and yet others are very mathemati-
cal/technical in spirit. Here, I hope to give a relatively brief presentation, with a concise mathematical
description of the models and the principles of analysis, yet also with an emphasis on applications.
Except for the very brief Chapter 9 on Bayesian analyses we treat mixed models from a frequentistic
point of view, but we get back to mixed models in the final part of the course on Bayesian data analysis.
Data examples play a major role in the first two chapters where I give a motivation for using mixed
models and introduce the formalism about linear mixed models. Chapters 3–5 are more technical
of nature, in particular Chapter 3 on estimation, but the data examples are used for illustration. I am
probably going to use other data examples in the lectures, such that you get introduced to more applica-
tions. I encourage you to focus on the ideas and interpretations, and for that purpose the corresponding
exercises will be indispensable. Actually, many important points will only be discussed in exercises!
We are going to use R for the analyses. Functions in the package lme4 will be the main tools, but other
packages will be introduced, too. The notes include information about fundamental R commands, but
not detailed description on R usage and no output copied from R. The reason for this is that the notes
would otherwise blow up. Instead, there will be an accompanying Markdown document with selected
code and output. Likewise, R material from lectures will be made available. I encourage you to run
the analyses yourself while reading.
I now and then give references to the lecture notes by Hansen and Tolver (2023), used for the 2022/23
version of the course Mathematical Statistics. They may be useful for you, in particular for the (mathe-
matical) treatment of categorical variables, and you can find them on Absalon. Various research papers
are also mentioned along the way. You are not expected, but of course welcome, to dig into them to
get deeper insight.
Although I chose not to use a textbook as curriculum for the mixed model part of Statistics A, let me
mention a few books which you may consult if you would like to read more. The book by Jiang and
Nguyen (2021) is one of the more theoretical books, but it also has nice applications and notes on
software. The books by Galecki and Burzykowski (2013) and West et al. (2015), on the other hand,
have emphasis on computations but less theory. They are all electronically available via the Royal
Danish Library. Finally, there are many books on longitudinal data analysis, e.g., Mohlenberghs and
Verbeke (2005) on discrete data and Fitzmaurice et al. (2013).
This is the fourth version of the notes. There are only small changes compared to the 2023 version,
most notably in Section 8.3. There is most likely still a bunch of typos left, and I would appreciate
your help with finding them as well as other ”inconveniences”.
Helle Sørensen
November 2024
i
Contents
6 Hypothesis tests 34
ii
1 Motivation and data examples
Classical linear normal models and generalized linear models assume independence between observa-
tions and describe how the expected values of the outcome are affected by explanatory variables (cat-
egorical and/or continous). More specifically, the effect of explanatory variables is modeled through a
linear predictor Xβ where X is the model matrix containing covariates, and β is a vector of unknown
parameters. For linear normal models, the linear predictor describes the vector of expected outcomes,
EY , itself, whereas for generalized linear models the association between EY and Xβ is described
through a link function.
The assumption of independence is convenient, yet often not appropriate. If several measurements are
taken on the same subjects, for example over time or space, or observations come from clusters of
related units, then it is often reasonable to assume that subjects or clusters are independent, but not that
observations from the same subject or cluster are independent. Loosely speaking, we would typically
expect observations from the same subject/cluster to be ”more alike” than observations from different
subjects/clusters.
Incorporation of such dependence is one of the arguments for using mixed models. Mixed models
include both fixed and random effects. Fixed effects are those that are assumed to be the same across
all observations, whereas random effects are effects that are allowed to vary between subjects, and
where this subject-to-subject variation is described in terms of a probability distribution rather than
with deterministic (but unknown) parameters.
To be a bit more specific, consider a setting with several subjects and ni measurements for subject
i, let EYi be the ni -vector of expected values and Xi be the model matrix corresponding to certain
explanatory variables for subject i. The linear (not generalized linear) model — with all explanatory
variables entering as fixed effects — would assume that EYi = Xi β with the same β for all subjects.
On the other hand, a model with all explanatory variables as random effects would assume EYi =
Xi βi with the extra assumption that βi comes from some distribution, e.g., a (multivariate) normal
distribution. This allows the parameter to differ between subjects in a specified stochastic way. In
most cases, some parameters are assumed to be common across subjects, whereas others are allowed
to differ between subjects. This leaves us with models with both fixed and random effects, i.e., mixed
models.
You may ask yourself various questions at this point: First, why do we have to care about dependence?
The short answer is this: To make correct inference! If we treat all observations as independent, without
them being so, then we may overestimate or underestimate (both things can happen) the amount of
uncertainty in our estimates and thus get invalid standard errors, confidence intervals, and p-values.
We will get back to this later, in particular in exercises.
Second, why do we treat the subject-specific parameters as random/stochastic instead of as ordinary
1
fixed effects which would also be possible? For example, the above βi could be unknown parameters
in the usual sense if the entries in Xi differ between observations from the same subject. The answer
here is at least three-fold: (a) It gives parsimonious models, i.e., models with fewer parameters to
be estimated, namely variance parameters rather than parameters for each subject; (b) the random
effects approach makes it possible to test certain hypotheses that we would not be able to test in the
purely fixed-effects models; (c) the random-effects approach fits with an interpretation of subjects
being random representatives from some population. We will get back to all three reasons in relation
to the data examples.
Another motivation for using random effects is the wish to describe different sources of variation, i.e.,
distinguish subject-to-subject variation from measurement-to-measurement variation and be able to
quantify their contributions to the total variation. This is sometimes, albeit far from always, a specific
aim of the analysis.
Just as the fixed-effects models, the mixed models come in several variants corresponding to the data
and model types. In particular, linear mixed models (LMMs) assume that EY = Xβ and are usually
also based on the normal distribution (but without the overall independence assumption). Generalized
linear mixed models (GLMMs) are used in situations where a link function is needed to describe
the association between Xβ and EY , in particular for discrete data. Furthermore, non-linear mixed
models (NLMMs) are used in situations where the association between X and EY cannot be described
in terms of a linear predictor.
As is common in material on mixed models, we start with a thorough treatment of LMMs in Chap-
ters 2–7 and later turn to GLMMs in Chapter 8. NLMMs are only briefly mentioned (Section 7.5).
Before we go into the more precise and technical specification of models and the machinery for esti-
mation and inference, let us consider some data examples that will hopefully clarify some of the points
above.
Example 1.1. (Infection in spring barley) This example is taken from Bibby et al. (2006) and the
data come from a experiment carried out at the The Royal Veterinary and Agricultural University in
Copenhagen, now part of University of Copenhagen. One aim of the experiment was to study to which
extent infection of barley seeds with a certain pathogen carries over to plants; another was to study the
variability between varieties in this respect.
The following experiment was carried out: Nine varieties of spring barley were selected as represen-
tatives of varieties. A total of 150 seeds of each variety were inoculated with the pathogen Drechslera
graminea and distributed to three petri dishes with 50 seeds each. After germination the proportion of
infected plants was determined in each dish. Hence, there is a total of 27 observed infection rates. The
data are available in the file barley.txt at Absalon.
The data are illustrated in Figure 1.1, with a stripchart to the left and boxplots to the right. It is perhaps
a bit farfetched to make boxplots of just three observations, but patterns are a bit easier to see, so we
show the plot anyway — just keep in mind that there are only three observations behind each boxplot.
Not surprisingly, we see that some varieties tend to have large infection rates whereas others have
smaller rates.
Although data are discrete by nature, we will first treat the proportions as continuous. This is fairly
2
0.8 0.8
0.6 0.6
Proportion
Proportion
0.4 0.4
0.2 0.2
0.0 0.0
39
39
lin
lin
u
u
4
4
a
a
ba
ba
i
i
27
27
is
is
es
es
ea
ea
do
do
94
94
on
on
k
k
ex
ex
or
or
m
m
61
61
ar
ar
ad
ad
an
an
I6
I6
r
r
La
La
Ve
Ve
C
C
Al
Al
M
M
C
PF
PF
C
C
M
M
Variety Variety
unproblematic (why?) when the observations are not close to the boundaries (zero and one) For this
dataset there are actually three zero observations, but at least they occur for three different varieties.
Now, you are encouraged to think about the following questions:
– Say that we were interested in a comparison of the nine specific varieties, for example because
we wanted to choose between them. What kind of analysis would you carry out; in particular,
what kind of model would you use?
– With the original aim and the data collection process in mind, why is the situation different than
the one described in the first bullet point?
– If we are just interested in the expected infection rate across varieties, we could perhaps consider
the 27 observations as one sample and simply compute the confidence interval with the standard
formula from the one-sample situation. Why would this not be appropriate?
We return to the example later (and hopefully answer the questions at some point!). As you may have
guessed, we are going to model Variety as a random effect for the analysis. The fixed part of the
model will consist of an intercept only (and people who are more strict than me, would probably say
that the model is therefore not a mixed model, but a random-effects model). This is a bit unusual in
applications, but makes the distinction between the fixed-effects and the random-effects approach more
clear, so it is a good starting example.
Example 1.2. (Yield of oats) The data for this example date back to 1935 and come from the literature
on design of experiments which evolved rapidly in the 1920s and 1930s, in particular for agricultual
applications. The experiment is a classical field experiment where the yield of oats is measured under
different conditions. There are data for three varieties, for now denoted V1–V3, and furthermore
nitrogen fertilization has been applied at four different levels. Below, levels are denoted N0–N3, but
they actually correspond to linearly increasing amounts (0.0, 0.2, 0.4 and 0.6 in a certain unit). The
data are available as the dataset oats in the MASS package.
The question under study is how yield varies with variety and nitrogen application.
All twelve combinations of variety and nitrogen level were tested, with six measurements each, so
there is a total of 72 observations in the dataset. Figure 1.2 shows boxplots for each combination. It
3
seems quite clear that yield increases with increasing nitrogen level, perhaps even linearly, whereas it
is less clear if there are any differences between varieties.
175
150
Yield 125
100
75
50
0
3
:N
:N
:N
:N
:N
:N
:N
:N
:N
:N
:N
:N
V1
V1
V1
V1
V2
V2
V2
V2
V3
V3
V3
V3
Variety : Nitrogen
Now, the description above does not tell the whole story about the experiment, and neither does the
figure. What is missing is the details about how treatments were laid out in the field. The precise soil
and drainage conditions, say, will inherently vary between plots in the field, and it is important that
treatment effects are not confounded with such non-controllable conditions. Hence, careful planning
of the layout of the experiment is necessary; this is what is referred to as design of experiments.
In the current study, the six replications correspond to six different blocks, which are particular areas
in the experimental field. Each of the 12 combinations of variety and nitrogen were applied in each
block, avoiding any confounding between treatment and block. Each block was furthermore divided
into three plots, which were again divided into four subplots corresponding to the observation level.
It is practically difficult to have different varieties in smaller areas, so variety was allocated at plot
level, such that all subplots in a plot were tested with the same variety. Nitrogen, on the other hand, was
varied on subplot-level, meaning that all nitrogen levels were tested at each plot (random allocation,
of course). The design is probably easier understood with an illustration, and the picture below shows
what a block could look like:
It is natural to consider blocks and plots as random representatives of (a population of) field areas,
and thus include Block and Plot as random effects in the model. This allows for measurements from
the same block to be (positively) correlated, and measurements from the same plot to be even more
correlated. Or put differently: Each block and each plot are allowed to have their own (random) level.
Variety and Nitrogen should be included as fixed effects since they are chosen with great care and
4
cannot be considered random, and since the aim of the analysis is to compare varieties and compare
fertilization levels.
Now, think about these two questions:
– Look at the boxplots again, and notice that there is a very large observation for each of the
treatment groups corresponding to variety V3. Do you think they happened independently from
each other or could there be an explanation in the design of the experiment (you may check the
dataset)?
– Consider for a moment an analysis-of-variance (ANOVA) type of model with Variety, Nitrogen
as well as Block and Plot as fixed effects in the usual way. Explain why you cannot readily test
for a variety effect in this model. It turns out that we can indeed test for a variety effect when we
use a suitable mixed-effects model.
Example 1.3. (Language and verbal IQ) The data for this example come from Snijders and Bosker
(2012, Example 4.1 and later). It consists of data from 3758 students in 8th grade, approximately 11
years old, from 211 different schools in The Netherlands. There are 4–34 students per school. We can
think about the data collection process as a multilevel sampling process: First, a bunch of schools were
selected for inclusion in the study; next a bunch of students were selected from each school. The data
are available in the file mlbook2_r.txt at Absalon.
Among the information for each student are the result from a language test (mean 41.4 and sd 8.9) and
the result of a verbal IQ test (mean 0.04 and sd 2.0). We are going to study the association between
these two variables, and the upper left plot in Figure 1.3 shows a scatterplot of the data, with a small
jitter added to every point so the points do not fall on top of each other. There appears to be a linear
trend, and we will indeed use a model describing (the mean of) the language test result as a linear
function of the verbal IQ test result.
The picture is, however, slightly misleading, or it at least does not tell the full story, since the grouping
of students into schools is not visible from the graph. The bottom plot shows data from 16 randomly
selected schools. We see that some schools generally have poor test results, while others generally
have good test results. Compare for example schools 47 and 147. This indicates that schools have
”their own level”, or put differently, that results from students from the same school to be correlated.
We also see that estimated slopes differ from school to school, suggesting that the association between
the two variables are stronger for some schools than others. This is also illustrated in the top right
graph, where linear regression lines have been estimated for all schools, but separately for each school.
Both intercept and slope were allowed to differ between schools. The regression lines for eight random
schools are highlighted. From a (possibly too) simplistic/naive point of view, one could interpret the
slope as the school’s ability to utilize the kids’ potential. This ability varies between schools, and one
topic of interest in the study was exactly this variation between schools.
Take a moment to think about the following:
– Consider the linear normal model with interaction between school and verbal IQ, corresponding
to the lower right graph in Figure 1.3). What is the dimension of the corresponding β?
5
– With the (limited) information you have about the data, would you believe that the researchers
behind the study would be interested in comparing slopes for specific schools? Do you have any
suggestions as to how to proceed?
– Say that there were also covariates at school level (i.e., constant for all students from the same
school), for example socio-economic variables describing the student population at the school,
and say that we would like to test for the effect of such variables on the slope. Explain why that
would not be possible in the linear normal model with fixed school-specific slopes.
6
60 60
Language test
Language test
40 40
20 20
−8 −4 0 4 −8 −4 0 4
Verbal IQ score Verbal IQ score
24 37 44 47
60
50
40
30
20
52 55 65 79
60
50
40
30
Language test
20
Figure 1.3: The language data. Top left: Scatter plot with jitter for all 3758 students. Top right:
Regression lines fitted separately for each school with eight schools highlighted (schools 47, 52, 65,
79, 86, 167, 250, 254). Bottom: Data from 16 schools with fitted regression lines.
7
2 The linear mixed model
In this chapter we describe the linear mixed model in detail, more specifically a version assuming
Gaussian distributions for random effects and residuals, and independent residuals. Generalizations
will be briefly mentioned in Chapter 7. Consider the situation where we have observed outcomes
y = (y1 , . . . , yn ) ∈ Rn and a set of associated explanatory variables (numerical and/or categorical).
Let Y = (Y1 , . . . , Yn ) denote the corresponding random variable with values in Rn , such that y is a
realization of Y .
8
parameter τ of dimension d. Whenever we want to stress this dependence, we write Στ instead of Σ.
With the extra assumptions, the model becomes
B ∼ Nq (0, Στ ).
Y | B = b ∼ Nn (Xβ + Zb, σ 2 In )
which describes how the conditional mean (but not the conditional variance) changes according to b.
Together, the marginal distribution of B and the conditional distribution of Y given B determine the
simultaneous distribution of (B, Y ). Since the marginal distribution of B as well as the conditional
distribution of Y given B = b are assumed to be Gaussian, and the expectation and variance in the
conditional distribution are assumed to be linear in b and constant in b, respectively, then it follows
from Sørensen (2023, Example 2.5) that the joint distribution of (Y, B) is Gaussian, too, of dimension
n + q. More specically (verify this yourself!),
ZΣτ Z T + σ 2 In ZΣτ
Y Xβ
∼ Nn+q , . (2.3)
B 0 Στ Z T Στ
We could have derived that (more easily) directly from (2.2), but we need the joint distribution later.
In particular EY = Xβ, exactly as in the linear normal model, and the extension with random effects
changes the distribution of Y only through the variance. On the other hand, the conditional distribution
of Y given B depends on the random effects through the conditional mean only.
Recall that we only observe Y , not B, so what matters for statistical inference is the marginal distribu-
tion of Y , i.e., (2.4). In that sense the construction with random effects can be considered a ”detour”
merely used as a way to define the structure of Var Y via the Z matrix. But this is important in itself!
Furthermore, it is often natural to think about the sampling process in two steps corresponding to the
random effects and the outcomes, respectively. For the language data in Example 1.3, say, we can
naturally think of a sampling process where random schools are sampled first, and random students are
then sampled from these schools — and it is natural that the (conditional) distribution of test results
1
The phrase hierarchical models is used in the Bayesian terminology for mixed models, and also include a specification
of prior distributions for the parameters. We will talk more about that in the final part of the course.
9
differ from school to school. Sometimes one is also interested in the predictions of the random effects,
see Section 5.1.
In principle we are done with the model framework now: our interest in the coming sections is on
linear mixed models on the form (2.2), and model specification thus consists of specification of the
model matrices X and Z, and the variance matrix Στ for B. However, we are aware that it is probably
not clear at this point how to set up models in practice, and how to interpret the random-effects part of
the model. Therefore, let us consider the simplest possible example, with a single random factor and
no fixed effects (apart from the intercept) in order to better understand the model construction.
Example 2.1. (Infection in spring barley, continued) Recall Example 1.1 with infection rates for nine
spring barley varieties. For a moment, use double-index notation and let yij and Yij denote the observed
infection rate and the associated random variable for replication j from variety i (i = 1, . . . , 9 and
j = 1, 2, 3).
The usual oneway ANOVA model with fixed effect of Variety would assume that
Yij = βi + εij
where β1 , . . . , β9 are the expected infection rates for the varieties, and εij s are iid. N (0, σ 2 ).
On matrix form, the model is written
13 β1
13 β2
Y = Xβ + e = .. + ε
..
. .
13 β9
where Y and ε are 27-vectors of random variables, 13 is the 3-column of ones, i.e., 13 = (1, 1, 1)T ,
such that X is a 27 × 9 matrix. Notice the difference between 13 and I3 , say, describing a column with
ones and a identity matrix, respectively. All entries in X outside the one-vectors are zero.
With this model we would be able compare the varieties with formal t- or F -tests. However, the va-
rieties in this example are representatives from a population of spring barley varieties (a population
which is much larger than you would think!), and we would like to model the variety-to-variety vari-
ation on top of the measurement-to-measurement variation. Therefore, we consider the nine expected
infection rates as random variables, written as β + Bi where β is the population mean and Bi s are the
deviations from this population mean. In other words,
or
13 B1
13 B2
Y = 127 β + .. + ε. (2.6)
..
. .
13 B9
If we furthermore assume that Bi s are iid. N (0, τ 2 ) such that B ∼ N9 (0, τ 2 I9 ) and that εij are iid.,
ε ∼ N27 (0, σ 2 I27 ) with B and ε independent, then the model is exactly on the form (2.2). The model
would typically be referred to as the one-way ANOVA model with random variety effect, or the linear
mixed model with constant intercept and random effect of Variety.
10
It is easy to see that the variance-covariance matrix for Y is the 27×27 block diagonal matrix composed
of nine 3 × 3 blocks, all equal to
2
τ + σ2 τ2 τ2
τ2 τ 2 + σ2 τ2 . (2.7)
τ 2 τ 2 2
τ +σ 2
This shows that the marginal variance of each Yi is τ 2 + σ 2 and that the correlation between two
2
observations from the same variety is τ 2τ+σ2 . In particular, the model only allows for non-negative
correlations. Observations from different varieties are assumed independent.
Now, let us talk about how to set up models more generally. Mathematically speaking, the question
is about defining appropriate model matrices X and Z for the fixed and random effects, respectively.
Luckily we do not have to set up these matrices ourselves, since the statistical software we use — at
least any software that I know of — constructs the model matrices from a high-level specification like
model formulas in R.
Our task is to decide which variables to include in the model, and how. Most of these considera-
tions are exactly the same as for (generalized) linear models, but with mixed models there is the extra
consideration whether terms should be included as fixed or random. Model building is typically an
iterative process where different models are fitted and validated until we are happy with the model fit.
For example, extra covariates may be included and variables may be transformed during the process.
Initially the following questions must be thought through:
– Which variables are available and relevant to use as explanatory variables? Are the variables
categorical (factors) or numerical (covariates)? For the latter question, we can sometimes even
choose if a variable should be treated as the one or the other.
– Is it relevant to include any interaction terms (between a factor on the one side and a factor or
numerical variable on the other side)? Any higher-order interactions?
– For parameters varying according to a factor (possibly a product factor): Should parameters be
fixed parameters or random variables? The question also applies to the intercept parameter.
– Does the model take into account the correlations between measurements in an appropriate way?
Regarding the fourth question, whether parameters should be considered fixed or random: Factors cor-
responding to things like Person, Block (a particular area in the field), Plant, etc. are most often (but
not always) considered random since those entities are thought of as random representatives, whereas
factors describing to things like Treatment, Age group, Socio-economic group, etc. are most often con-
sidered as fixed. A factor describing Variety would often be used as a fixed factor, but in Example 2.1
it was natural to use it as a random factor.
Notice the word relevant in the second and third point. By this we mean relevant from a subject-area
(e.g., biological or social science) point of view, not from a statistical point of view, so it is crucial with
11
input from people with field-specific knowledge, often the people who collected the data or carried out
the experiment. In particular, we do not include terms in the model just because it is mathematically
possible, and we do not consider the statistical significance of the variables at this point.
Suppose we have decided on the explanatory variables in the model; main effects as well as interactions,
and whether they should enter into the fixed or random part of the model. Each explanatory variable
— numerical or categorical or an interaction — generates a subspace of Rn , and is thus associated to a
(non-unique) full-rank matrix that spans the subspace. These matrices can be combined/manipulated in
an appropriate way to form model matrices X and Z. For the fixed part (that is, model matrix X) this
is dicussed in detail in Hansen and Tolver (2023), and we refer to these lecture notes for an elaborate
discussion of model matrices for sum spaces. The situation is somewhat different for Z as will be
discussed below.
Let us remind ourselves what subspaces and associated matrices look like for the individual model
terms. First, a numerical explanatory variable, also called a covariate, x ∈ Rn gives rise to the subspace
Lx = {βx | β ∈ R} of dimension 1, and the associated matrix is just x itself, interpreted as an n × 1
matrix.
Second, consider a categorical variable, or a factor as it is often called, and recall that we can think of
it as a map f : {1, . . . , n} → F where F is the set of levels of the variable. Assume for simplicity
that F only includes factor levels which in fact occur in the data (such that f is surjective). In other
words, f assigns a group level to each observation and therefore defines a partition of the set of units
{1, . . . , n} into |F | groups. In the matrix specifications below we implicitly assume that groups are
named {1, . . . , |F |}, but for practical work it is much better to keep the original and meaningful names
for the groups. The factor f gives rise to the subspace
Lf = {y ∈ Rn | yi = yj if f (i) = f (j)}
If f1 and f2 are factors, and f1 is finer than f2 (equivalent: f2 is coarser than f1 ), meaning that f1
corresponds to a finer grouping of {1, . . . , n} than f2 , then Lf2 ⊆ Lf1 .
There are two ”extreme” factors. One is the degenerate or constant factor with f (i) = 1 for all
i = 1, . . . , n. It corresponds to the very boring grouping where all observations are allocated to the
same group, and it is coarser than all other factors. The other one is the identity factor with f (i) = i
for all i. It is finer than all other factors and corresponds to each observation being its own group. We
will sometimes use the notation 1 and I for the constant and the identity factor, resepectively. Then,
the above nestedness considerations imply that L1 ⊆ Lf ⊆ LI for any factor f .
An interaction effects between two factors is described through the product factor f1 × f2 , defined
by (f1 × f2 )(i) = f1 (i), f2 (i) for two factors f1 and f2 . Notice that it is not uncommon that
all combinations of factor levels for f1 and f2 are not present in the dataset, implying that (f1 ×
f2 )({1, . . . , n}) 6= F1 × F2 . This may cause a bit of trouble with the parameterization, but is otherwise
not problematic.
12
For interaction effects between covariates and factors it is sometimes convenient to think in terms of ef-
fect pairs. An effect pair (x, f ) consists of a covariate x and a factor f . The effect matrix corresponding
to the effect pair (x, f ) is the n × |F | matrix with entries
xi , f (i) = j
Cij =
0, f (i) 6= j
If f is the degenerate/constant factor, then the effect pair corresponds to a constant effect of x over all
i, i.e., a common intercept or slope. On the other hand, (x, f ) introduces potential group differences
if |F | > 1. For x = (1, 1, . . . , 1) and a non-degenerate factor f , the effect pair (x, f ) allows for the
intercept to vary between groups, while for a non-constant x the effect pair (x, f ) describes that the
slope for x can differ between groups.
As described above, each term included in the model defines a matrix that spans the associated subspace
of Rn , and those matrices must be glued together in order to form X and Z. Matrices corresponding to
effect pairs with the constant factor are part of the X matrix, while matrices corresponding to group-
varying parameters enter into X or Z depending on whether the parameters should be considered fixed
or random.
There are some important differences between X and Z. We usually insist that X has full rank, so it is
often necessary to remove columns after we have pasted the individual matrices together. Our software
takes care of that.2 As opposed to this, all columns are retained in Z. To be specific, assume that
Y = Xβ + Z1 B1 + Z2 B2 + ε with two random-effects terms. From an interpretational point of view
it is natural to think of the model in this way, separating the contributions, but the same model can be
written on the usual form if we define a new random variables B and a new model matrix as
B1
B= , Z = Z1 Z2 .
B2
Notice that the two random terms could stem from the same factor, but with different associated co-
variates (for example the intercept and a non-constant numerical variable) or from different factors,
nested or not. As an example, consider the oats data from Example 1.2. Here, it is natural to have
random intercepts varying both at the block level (six levels) and plot level (18 levels). If the block and
plot factors were included as fixed effects rather than random effects, then the block factor would be
redundant since differences between blocks would also be incorporated through the plot factor. This is
not the case for random effects.
Another important difference between X and Z is that X can be replaced by any other full-rank matrix
X 0 that spans the same subspace of Rn without changing the model, only the interpretation of the
parameters. The same is generally not true for the Z matrix, since that would change Σ, the variance-
covariance matrix of B, and thereby change the independence properties in the model.
As mentioned, and luckily, we do not have to do the matrix manipulations ourselves. Instead we specify
terms or effect pairs in the software’s syntax, including a specification of whether the effect is fixed or
random. In R, the syntax for the fixed-effects specification of (x, f ) is f:x or x:f (does not matter),
while the random-effects specification is (x|f), including the parentheses. Notice that (x|f) also
includes a random intercept; if we do not want that, we must write (x-1|f). A random intercept
in itself is specified as (1|f). If f is the constant factor, implying parameters that are constant for
2
This is exactly the same problem as for the usual linear normal models and GLMs, so we do not go into details, but you
can consult Hansen and Tolver (2023) if needed.
13
all observations, then we just write x as usual. As always: If you are the least in doubt whether you
managed to specify the model you were aiming at, then make sure to check that you get estimates you
were aiming at, or check that the model matrices built by R are as you expect them to be.
As an example, the one-way ANOVA from Example 2.1 with intercept varying according to variety
can be fitted with the command
The term 1 is optional since an intercept parameter, corresponding to β in (2.5) and (2.6), is automat-
ically included as usual. For the model with β = 0 (implying EYi = 0), we should replace 1 by 0 in
the specification of fixed effects. Such a model is sometimes — but only rarely, and definititely not in
this example — of interest.
One thing is still missing in the model specification: the structure of Σ. Elements in B corresponding to
different factors are almost always assumed to be independent, whereas random effects corresponding
to the same factor, but different covariates, are often allowed to be dependent. For example, in a model
with both random intercept and random slope for different subjects, the subject-specific intercepts and
slopes are usually allowed to be correlated, see Example 2.2 right below.
Example 2.2. (Language and verbal IQ, continued) Recall the language test data from Example 1.3.
We are aiming at a model with school-specific intercepts and slopes, and we want to treat schools as
random. If we use double-index notation, we assume that
where xij and Yij are the verbal IQ score and the language test result, respectively, for the jth student
from school i, β0 + Bi,0 is the intercept for school i, and β1 + Bi,1 is the slope for school i. The
random effects (Bi,0 , Bi,1 )T are iid. N2 (0, Σ), where Σ is a 2 × 2 matrix. Notice that iid. here means
that the pairs for different schools are independent and identically distributed, not that Bi,0 and Bi,1 are
independent with the same distribution. The most general model for Σ allows for correlation between
Bi,0 and Bi,1 , such that Σ is a general positive semi-definite 2 × 2 matrix (with three parameters). It
may also be assumed that Bi,0 and Bi,1 are independent; then Σ is a 2 × 2 diagonal matrix (with two
parameters). Since EBi,0 = EBi,1 = 0, the parameters β0 and β1 are interpreted as the expected
intercept and expected slope for a random school.
In terms of effect pairs, we have included (1n , School) and (Verbal IQ, School), to allow for random
school-specific intercepts and slopes, respectively. Here, 1n means the numerical variable constantly
equal to 1. Furthermore, the non-zero average slope β1 formally corresponds to the effect pair (Verbal
IQ, 1) where 1 is the the constant factor. Finally, the model allows for a non-zero β0 (the average
intercept).
Intercepts (both fixed and random) are as default included, so the lmer code for the model is
This specification allows for dependence between intercept and slope, and (IQ_verb|schoolnr)
must be replaced by (IQ_verb||schoolnr), with two vertical bars, under the assumption of
independence.
14
The model matrices X and Z are composed of model matrices for each school. For school i with ni
students, we write
Yi,1 1 xi,1 1 xi,1 εi,1
.. .. .. β0 + .. .. Bi,0 + ..
. = . . β . . B .
1 i,1
Yi,ni 1 xi,ni 1 xi,ni εi,ni
εi,1
β0 Bi,0
= Xi + Zi + ...
β1 Bi,1
εi,ni
For the model matrices for thePcomplete dataset, the Xi matrices are simply stacked in order to get the
n × 2 matrix X, where n = i ni is the total number of observations. If I is the number of schools,
then Z is the n × 2I block matrix with Zi s as blocks and zeroes outside the blocks. You are encouraged
to examine the matrices yourself! They can be extracted with the following commands:
The model will often be referred to as the linear mixed model with random intercepts and slopes
(implicitly, for each school). Notice the difference to a linear normal model (not mixed) with school-
specific intercept and slope. In that model, the intercepts and slopes are deterministic and allowed to
vary completely freely of each other and be arbitrarily large, while the distributional assumptions for
B is the linear mixed model impose some degree of shrinkage or regularization. We sometimes use the
phrase that we borrow strength between schools.
For complex datasets with several factors in play, and in particular if some of them are nested, then
we recommend to make a factor diagram in order to get an overview of the data structure. A factor
diagram is a graph with the relevant factors, including relevant product factors, as vertices/nodes and
an arrow from factor f1 to factor f2 (which could be product factors) if f1 is a finer factor then f2 . If
f1 is finer than f2 and f2 is finer than f3 , then f1 is also finer than f3 . In that case we do not make
an arrow from f1 to f3 , but let the arrow ”go through” f2 . The identity factor and the constant factor
are often included in the diagram. Recall that the identity factor and the constant factor are finer and
coarser than all other factors, respectively, so there are always arrows ”out of” the identity factor and
arrows ”into” the constant factor. It is common to put dim(Lf ), i.e., the number of factor leves for f
as superscript for factor f , and random factors are sometimes put in brackets, such that the model can
be identified from the diagram. Numerical explanatory variable may be included as well.
If a set of factors comprises a so-called orthogonal design, then the factor diagram also allow for easy
computation of certain degrees of freedom and other quantities. We are not going to dwell on this
since R, or whatever software we use, will make those computations for us. We see the factor diagram
merely as way to get an overview of the data structure, which is useful in more complicated settings
15
with nested factors. The R package LabApplStat (Markussen, 2022) has functionalities for making
factor diagrams, but I highly recommend that you (first) make it by hand since it forces you to think
carefully about and understand the data structure.
Example 2.3. (Infection in spring barley, continued) Recall the data from Example 1.1, in particular
that there are three petri dishes for each of nine barley varieties, with 50 seeds per dish. Observations
are available at dish level: there is one observation per dish, such that the identity factor corresponds
to Dish. The factor diagram is really boring since there is only one variable of interest:
27 9
11
I Variety
For the purpose of illustration, consider the following hypothetical situation. Assume that there was a
measurement for each seed, for example that all seeds germinated and that the length of each seedling
was measured after a couple of days (not realistic). Then, there would be a total of 1350 measurements.
Furthermore, assume that there were two different types of variaties (five varieties of one type, four of
the other one), and that the three dishes corresponding to each variety were treated with three different
light conditions. In this situation the relavant factors would be Variety, Dish, Type, Light, and possibly
also the product between Light and Type, and the factor diagram would be this one:
1350 27
11
I Dish
Variety9 Type2
There is an arrow from Dish to Variety because only one variety is present at each dish, and a arrow
from Dish to Light×Type because light treatment and variety type are also constant at each dish.
Popularly phrased: If we know which dish an observation comes from, then we also know the variety,
the variety type and the light treatment for the observation.
The set-up with Type and Light and 1350 observations was hypothetical and we will not return to it
later.
As a second example, you are encouraged to make a factor diagram for the oats data from Example 1.2.
The diagram should consist of the factors corresponding to Variety, Nitrogen level, and their interaction,
Plot, Block, the identity factor and the constant factor. Remember to think carefully about nestedness
between factors.
16
3 Estimation in the linear mixed model
3.1 ML estimation
In principle the estimation problem is easy: Just write up the log-likelihood and maximize it! The first
part of that ”program” is easy: Except for a constant, the likelihood function, i.e., the density of Y at
the observed data y if the parameters are (β, θ, σ 2 ), is
2 1 1 T −1
L(β, σ , θ) ∝ 2 n/2 exp − 2 (y − Xβ) Vθ (y − Xβ) . (3.1)
(σ ) |Vθ |1/2 2σ
This is not a particularly nice expression in (β, θ, σ 2 ), in particular because θ enters in a highly non-
linear way, even after taking the logarithm. However, we can take advantage of the machinery from
linear normal models and break up the large maximization problem into easier ones.
17
The nice thing about the new parameterization is that the model for Y now has the form of a linear
normal model: Y ∼ Nn (Xβ, σ 2 Vθ ) — not with iid. residuals as usual, but with a dependence structure
given by Vθ . For fixed θ there is a closed-form expression for the MLE for (β, σ 2 ), and since θ is
typically low-dimensional, it is therefore natural to find the MLE for the complete parameter (β, σ 2 , θ)
with a profile algorithm. It goes like this:
and
1
σ̂θ2 = (Y − X β̂)T Vθ−1 (Y − X β̂). (3.3)
n
The pair (β, σ 2 ) would be the MLE for (β, σ 2 ) if θ was known.
2. Insert the estimates into the likelihood in order to get the profile likelihood for θ, i.e., consider
the function
θ 7→ L β̂θ , σ̂θ2 , θ ,
and maximixe it wrt. θ. Denote the solution θ̂, which is then the MLE for θ.
3. Insert θ̂ into the expressions for β̂θ and σ̂θ2 and get the MLE for (β, σ 2 ):
β̂ = β̂θ̂ , σ̂ 2 = σ̂θ̂2 .
Regarding step 1: You may only know the formulas for the situation with iid. errors where Vθ = In
such that β̂ = (X T X)−1 X T Y . In the general case, the transformation
is the ortohogonal projection onto the space spanned by X when the inner product is specified by Vθ−1 ,
see Lauritzen (2023, end of Section 4.2.2).1
Regarding step 2: Except for simple cases, there is no closed-form expression for θ̂ so the optimization
must be carried out with numerical methods. Luckily, the dimension of θ is typically relatively small,
so this is most often not too difficult. As usual, the logarithm of the likelihood has better numerical
properties, so the log-likelihood is used instead of the likelihood itself for the numerics.
Normally, we will obviously let R, or more specifically lmer, do the computations for us, but let us
do it ”manually” for the spring barley data for the illustration.
Example 3.1. (Infection in spring barley, continued) In Example 2.1 we wrote a few different repre-
sentations of the linear mixed model with random effect of variety (and constant mean), e.g.,
Yij = β + Bi + εij ,
using double-index notation. Here, Bi s are iid. N (0, τ 2 ) and εij s are iid. N (0, σ 2 ). The parameters
consist of β (the constant mean), τ or τ 2 representating the variety-to-variety variation, and σ or σ 2
1
You may also consult the Wikipedia pages for generalized least squares and Projection (linear algebra).
18
ML, mean = 0.187 REML, mean = 0.201
Profile log−likelihood 3.3 200 200
150 150
3.2
100 100
3.1
50 50
3.0 0 0
1.0 1.5 2.0 2.5 3.0 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4
theta ML estimate of tau REML estimate of tau
Figure 3.1: Profile log-likelihood for the barley data with a dashed line located at the MLE (left);
ML and REML estimates of τ for 2000 simulated datasets with a dashed line at the true value 0.2104
(middle and right).
representing the within-variety variation. With the new parameterization, define θ = τ 2 /σ 2 , and
consider parameters (β, σ 2 , θ).
The left graph in Figure 3.1 shows the profile-log-likelihood for θ. Numerical optimization of this
function yields the MLE of θ, which turns out to be θ̂ = 1.796, and the dashed vertical line is placed
as θ̂. The associated estimates for β and σ 2 , i.e., the ML estimates, turn out to be β̂ = 0.334 and
σ̂ 2 = 0.0246. Finally, we get the ML estimates τ̂ 2 = 0.0443 and τ̂ = 0.210 by back-transformation.
Normally, we would not make the computations this way, but just use lmer directly. The ML estimates
are obtained with the following command (try it yourself):
As will be explained in the next section, it is necessary to include the option REML=FALSE to get the
ML estimates. Otherwise, the REML principle is used instead of ML.
Notice that β̂ = ȳ, i.e., the average of the 27 observed values. Actually, β̂θ = ȳ for all θ. This is not
so surprising: No matter the degree of correlation between observations from the same variety, how
should we estimate the overall mean as something else than the average for this simple and balanced
data? In more complicated models or for unbalanced data, this would no longer be true.
There are generally no closed-form distributional results for the MLE in linear mixed models, even
in the Gaussian case. This is because θ may enter into the model (and likelihood) in a relatively
unpleasant, non-linear way. Under fairly general assumptions (and for fixed p) however, ML estimators
in the Gaussian LMM are consistent and asymptotically normal with the inverse Fisher information
as asymptotic variance, so we can rely on asymptotic distributions for inference. This is one of the
advantages of using the ML principle for estimation.
It is important to realize, however, that a ”large sample size” not only means that n is ”large”. For
entries in τ determining Στ , the variance-covariance matrix of the random effects, it is necessary that
there are ”many” different levels of the factors associated with the random effects. In the example
with spring barley, for example, it is the number of varieties that controls the amount of information
regarding the variety-to-variety standard deviation τ , and there must be ”many” different varieties in
order to trust the asymptotic results regarding that parameter. Just taking more and more observations
from each variety gives more information about β and σ, but not about τ . As usual, it is impossible to
19
say when a sample is large enough for the asymptotic results to be reliable, but we can use simulated
data to examine the properties of the estimators.
So, ML estimators in linear mixed models are in general asymptotically well-behaved, but what hap-
pens for smaller samples? Let us consider the spring barley data again. We simulated 2000 datasets
from the linear mixed model with parameters equal to the ML estimates from Example 3.1 just above,
still nine groups and three observations per group. For each dataset we computed the ML estimates,
and thus got 2000 simulated values of τ̂ . A histogram of the estimates is shown is the middle graph in
Figure 3.1. The dashed line is at 0.2104, which was the true value of τ in the data generating model.
It appears that the distribution is shifted slightly to the left compared to the true value, and this is sup-
ported by computation of the mean, which is 0.187. In other words, the ML estimator underestimates
τ a bit on average. The histogram to the right looks slightly better; we will get back to it later. Notice
that τ was estimated to zero (at the boundary of the parameter space) for 13 of the 2000 simulated
datasets. This is also an indication that the asymptotics has not completely ”set in”.
The bias of the ML estimator for τ that we just saw for the spring barley data is not exceptional. On
the contrary, actually. Experience for small and moderate-sized datasets is that ML estimates for θ
and σ 2 tend to be biased in such a way that variability is underestimated. This has the consequence
that the estimated variance-covariance matrix of β̂ is biased, in particular the standard error of each β̂j
underestimates the true sampling variability of the estimator, and confidence intervals have lower actual
coverage than their nominal level. Hence, even in situations where we are not particularly interested in
the correlation structure of the data, it is problematic that it are not well estimated.
The phenomenon is not new, though. Recall the linear normal model Y ∼ Nn (Xβ, σ 2 In ). The ML
estimator for σ 2 is σ̂ 2 = n1 (Y − X β̂)T (Y − X β̂) where β̂ = (X T X)−1 X T Y and has expectation
E σ̂ 2 = n−p 2 2
n σ < σ . The easy fix in that situation is to divide by n − p instead of n; then we
get an unbiased estimator. The situation is not as simple for linear mixed models since there are no
general small-sample distribution results about θ̂ and σ̂ 2 . Nevertheless, there is a method, called REML
estimation, which works well in many situations — and for the linear normal model coincides with the
simple fix.
There are several ways to think about REML estimation — and even several opinions about what RE
stands for, namely restricted, residual, and reduced. First, consider the integrated likelihood
Z
(σ 2 , θ) 7→ L∗ (σ 2 , θ) = L(β, σ 2 , θ) dβ (3.4)
which is a function of (θ, σ 2 ) since β has been integrated away.2 The REML estimates for (σ 2 , θ) are
the values that maximize L∗ , and an estimate for β is then obtained in a second step. More specifically,
REML estimation amount to the following:
(b) Insert σ̂ 2 and θ̂ into the original likelihood and maximize in order to get an estimate β̂. This
amounts to using (3.2) with the new estimate θ̂ inserted for θ.
2
From a Bayesian point of view, the integrated likelihood can be interpreted as the posterior distribution of (σ 2 , θ) when
(β, σ 2 , θ) is equipped with uniform priors. The meaning of this will be clear in the Bayes part of the course.
20
It turns out (and you will be asked to prove this in an exercise) that
∗ 2 1 1 1 1 T −1
L (σ , θ) ∝ 2 (n−p)/2 exp − 2 (y − X β̂θ ) Vθ (y − X β̂θ ) (3.5)
(σ ) |Vθ |1/2 |X T Vθ−1 X|1/2 2σ
where β̂θ = (X T Vθ−1 X)−1 X T Vθ−1 y. Step (a) above is about maximization of this function. Notice
that σ 2 is now raised to the power of (n − p)/2 in the first term, and that the integrated likelihood
includes an extra determinant compared to the original likelihood. Yet another profile argument, this
time first considering θ fixed and maximizing for σ 2 , therefore yields
1
σ 2 = σ 2 (θ) = (y − X β̂θ )T Vθ−1 (y − X β̂θ )
n−p
which is plugged into L∗ and the resulting function is maximized (numerically) wrt. θ. Notice that the
expression for σ 2 (θ) only deviates from (3.3) in the denominator, n − p vs. n.
In the linear normal model, where θ is known (or not at all present in the model), the expression for
σ 2 (θ) is exactly the same as the usual unbiased estimator for σ 2 . Hence, that estimator can be thought
of as the REML estimator.
For a second, and very different, derivation of the REML estimates, let C be an n × (n − p) matrix
with full rank and with C T X = 0, meaning that the columns of C are orthogonal to the columns in
X. Define W = C T Y which has dimension n − p. The elements in W are sometimes called error
contrasts because EW = 0. The distribution of W does not depend on β, and the idea is now to use W
as the data instead of Y for estimation of (σ 2 , θ). The likelihood based on W — depending on σ 2 and
θ only — turns out to be proportional to L∗ . As a consequence, we can think of the MLE for (σ 2 , θ)
based on W as the REML estimator for (σ 2 , θ), using the original data Y , and it also follows that the
solution does not depend on the specific choice of C. The computations are left for an exercise, where
it is also argued how a matrix C with the above-mentioned properties can be constructed.
As explained, one idea for REML is to eliminate β from the likelihood, either by integrating it out
or by constructing error contrasts. There is even a third way of thinking about REML. Consider the
adjusted likelihood
L∗ (σ 2 , θ)
(β, σ 2 , θ) 7→ H(β, σ 2 , θ) = L(β, σ 2 , θ).
supβ L(β, σ 2 , θ)
where L is the original likelihood given by (3.1) and L∗ is the integrated likelihood given by (3.4).
Now, maximize it wrt. all parameters. A profile algorithm takes this form:
1. Consider (σ 2 , θ) fixed, consider H as a function of β, and notice that the ratio in the definition
of H does not depend on β. Hence, as a function of β for known (σ 2 , θ),
H(β, σ 2 , θ) ∝ L(β, σ 2 , θ).
Maximize this function wrt. β and denote the solution β̄(σ 2 , θ). Notice that β̄(σ 2 , θ) maximizes
H as well as L (for fixed values of σ 2 and θ).
2. Insert β̄(σ 2 , θ) into H to get the adjusted profile likelihood,
L∗ (σ 2 , θ)
(σ 2 , θ) 7→ H β̄(σ 2 , θ), σ 2 , θ = L β̄(σ 2 , θ), σ 2 , θ = L∗ (σ 2 , θ)
2
supβ L(β, σ , θ)
where the last equation holds because β̄(σ 2 , θ) also maximizes L for fixed σ 2 and θ. Hence,
maximizing (σ 2 , θ) 7→ H β̄(σ 2 , θ), σ 2 , θ and L∗ are equivalent. Denote the solution (σ̄ 2 , θ̄).
21
3. Insert (σ̄ 2 , θ̄) into the ”expression” for β and get the estimated β, i.e., β̄ = β̄(σ̄ 2 , θ̄). Recall from
point 1. that this value is defined as the value that maximizes
β 7→ H β, σ̄ 2 , θ̄ ∝ L β, σ̄ 2 , θ̄ .
The algorithm revealed that maximization of the adjusted likelihood H (wrt. all parameters) is equiv-
alent to the two-step procedure given by steps (a) and (b) above. So, REML estimation of (σ 2 , θ) can
indeed be thought of in three different ways: Maximization of the integrated likelihood, estimation via
contrasts, or maximation of the adjusted likehood. Strictly speaking, REML is a method for estimation
of the variance parameters only. However, we will refer to the associated estimates of β, obtained by
plugging the REML estimate of θ into (3.2) as REML estimates, too. Notice that the adjusted likeli-
hood approach is the only one where the complete estimator of (β, σ 2 , θ) is obtained via maximization
of a likelihood-ish function, but also that the adjusted likelihood is not actually the likelihood for a
statistical model.
A highly desirable property for the ML principle is the invariance to parameterizations: No matter the
parameterization, we get the same fitted model. This is also true for REML estimation as long as we
consider linear reparameterizations of β — which is what we would typically do. In order to see this,
consider two different full-rank model matrices, X and X̃, for the same model. In other words, X and
X̃ span the same linear subspace. The columns of X̃ are linear combinations of the columns in X,
so we can find a non-singular matrix D of dimension p × p such that X̃ = XD. If the fixed-effects
parameters are denoted β and γ in the two parameterizations, then X β̂θ = X̃ γ̂θ for any fixed θ since
the fitted values are independent of the parameterization in a linear normal model. Moreover,
|X̃ T Vθ−1 X̃| = |DT X T Vθ−1 X̃D| = |D|2 |X T Vθ−1 X|
It follows from (3.5) that the integrated likelihoods corresponding to the two parameterizations are
proportional and thus lead to the same estimates for (σ 2 , θ).
One still has to be a bit careful, though. It is indeed true that the REML estimates are the same for
two such parameterizations of the same model, but the maximum value of the REML (log-)likelihood
typically changes, so one cannot use the REML log-likelihoods for likelihood ratio tests and compar-
ison of AIC values, say, unless the parameterization of the mean structure is the same in the models
to be compared. For non-linear reparameterizations (which are atypical for linear mixed models), the
estimates do not necessarily carry over with REML; Danish-speaking students may read more details
about this in Hansen (2012, Section 15.6).
In certain simple models there are closed-form expressions for the REML estimators, see Section 3.3
on ANOVA estimation below, and in these cases the distribution of the REML estimators is known,
too, and the estimators are unbiased. This does not hold in general, though, but REML estimators
share the same asymptotic properties as the ML estimators, and have better small-sample properties.
Therefore we use the REML principle as default. The lmer function does that as well, and you must
use the option REML=FALSE in order to get the ML estimators.
Now, let us return to the spring barley data for illustration.
Example 3.2. (Infection in spring barley, continued) In Example 3.1 we found the ML estimates
β̂ = ȳ = 0.334, τ̂ = 0.210, σ̂ = 0.157,
while the REML estimates turn out to be
β̃ = ȳ = 0.334, τ̃ = 0.225, σ̃ = 0.157.
22
Notice that the estimates of β and σ are unchanged; this is due to the particularly simple nature of the
model and the balancesness of the data and does not carry over to all linear mixed models. The REML
estimates are computed with one of the following commands:
The right graph in Figure 3.1 shows a histogram of the REML estimates for the same 2000 datasets as
in the middle graph. We see that the distribtion of the REML estimates is shifted slightly to the right
compared to the distribution of the ML estimator, in particular the mean is closer to the true vakue.
Nine of the REML estimates are zero in this case.
ML and REML estimation applies to all normal LMMs, i.e., all models on the form (2.2) — although
there may be numerical difficulties. As explained, there are generally no explicit formulas for the
estimators and no results on the exact distributions of the estimators. However, if the data structure is
”nice” in a certain sense, then one can carry out estimation in a quite different way, and with explicit
formulas. This is the case if there are only categorical covariates (factors) and the set of factors used
in model (both fixed and random) and the set of random factors both satisfy two conditions about the
“design” of the experiment or data collection process: The sets should be minimum-stable, and factors
should be pairwise orthogonal. Moreover, random factors should be balanced. We return to these
concepts in Section 6.4, see also Hansen and Tolver (2023, Section 5.3) or Tjur (1984).
When the assumptions are fulfilled, the complete LMM can be interpreted as a set of independent linear
normal models (not mixed), one for each random factor in the model. The fixed-effects parameters and
random effect variances in the LMM are linear functions of the mean and variance parameters in the
associated linear normal models, and estimation of the LMM can be conducted by fitting each of the
linear normal models separately and then transforming the estimates. The estimation procedure is
sometimes referred to as ANOVA estimation, because it is based on the same decomposition of Rn as
for ANOVAs in minimum-stable and geometrically orthogonal designs.
Since the “intermediate” models are linear normal models, without random effects, there are explicit
solutions and exact distributional results. In particular, estimators for mean parameters and variances in
the linear normal models are unbiased, and this property is inherited to to the fixed-effects parameters
and the random effects variances in the associated LMM due to linearity. ANOVA estimates of random
effect variances may be negative (because they are computed as linear combinations of variance esti-
mated from the intermediate models, possibly with negative coefficients). This is of course unpleasant,
but when they are positive they are equal to the REML estimates. This gives a (heuristic) argument for
generally using REML rather than ML estimates: If the design is “not too far from” being orthogonal,
then the REML estimators are likely to “almost” unbiased as in the orthogonal case.
ANOVA estimation was important in the old days, where numerical optimization was a difficult task.
Nowadays, this is much less of an issue, and most people therefore use ML/REML estimation with-
out bothering much about whether the design is orthogonal or not. In my opinion, the theory about
orthogonal designs play a more important role for hypothesis testing than for estimation, and we will
therefore return to it in Section 6.4. Interested students can look for details in Tjur (1984) where the
theory is presented formally in a general framework or Jiang and Nguyen (2021, section 1.5.1) for a
23
less general treatment. The aov function in R can be used for ANOVA estimation in LMMs, but I
never use it myself.
We now return to ML and REML estimation, but with emphasis on computations. We used profile
arguments many times in the description above about ML and REML estimation. The trick about
profiling is that it breaks down one high-dimensional optimization problem into several optimization
problems of lower dimension. This is obviously smart if each of the smaller problems is easy. So
we could ask: Are we left with ”easy” problems after profiling? Well, not in all situations. The main
challenge is that the procedures involve inversion of Vθ and computation of |Vθ | = det(Vθ ). The matrix
Vθ is of size n × n, hence computations are unproblematic for smaller datasets, but not necessarily so
for large dataset if Vθ does not have the form of a block diagonal matrix with smaller non-zero blocks.
Recall also that the log-likelihood must be evaluated many times for optimization, so speed of each
evaluation matters in practice.
The lme4 package has a quite efficient implementation, and it does actually not follow the procedures
described above. We are not going to dive into all the details, yet describe the overall idea and the
main ingredients. The aim is to give you a feeling for the computations and careful considerations that
take place ”behind the scenes”, not to make you able to make an implementation of the procedure!
For more details, see Bates et al. (2015) and Bates (2010). As a side remark, the paper from 2015 has
68172 citations (October 26, 2023), and I am even sure that it is not cited in anywhere near all papers
where it is used.
The main achievement is to write the deviance, i.e., minus two times the log-likelihood (except for a
constant), as
2
rβ,θ
d(β, σ 2 , θ) = −2 log L(β, σ 2 , θ) = n log(σ 2 ) + log(|Lθ |2 ) + 2 (3.6)
σ
where Lθ is a lower triangular q × q matrix and
2
rβ,θ = min{||y − Xβ − ZΛθ u||2 + ||u2 ||}. (3.7)
u
There are two properties about this new formula, which are important for the computations: First, the
determinant of a triangular matrix is just the product of the diagonal elements, so that is easy! Second,
the minimization problem in (3.7) has an explicit solution. Altogether, this implies that ML estimation
can be carried out efficiently as follows (and with a similar procedure for REML estimation):
1. For fixed θ, the fixed-effects parameter β enters into d(β, σ 2 , θ) only through rβ,θ
2 . Minization of
2 is an explicit expression, despite the fact that
β must be carrried out numerically, but at least rβ,θ
it is written as a minimum. Minimization of rβ,θ 2 wrt. β yields β̂ as well as the smallest value,
θ
which we denote rθ2 . Moreover, σ̂θ2 = n1 rθ2 since d, as a function of σ 2 , has the usual form.
24
3. Finally, plug θ̂ into the expressions for β̂θ and σ̂θ2 in order to get the final estimates for β and σ 2 :
β̂ = β̂θ̂ and σ̂ 2 = σ̂θ̂2 .
Let us briefly describe the ingredients. The first ”trick” is to replace the random effects B with iid.
random variables U . Recall that B ∼ Nq (0, Σ) and that Y |B = b ∼ Nn (Xβ + Zb, σ 2 I). Since Σ
is a positive semidefinite q × q matrix, we can write Σ = σ 2 Λθ ΛTθ for a q × q matrix Λθ . Notice that
we still focus on how θ determines a scaled version of Σ rather than Σ ifself. If U ∼ Nq (0, σ 2 Iq ) then
Λθ U ∼ N (0, Σ), so we can rewrite the observation part of the model as
Second, ΛTθ Z T ZΛθ + Iq is positive definite, so a Cholesky decomposition gives a lower triangular
q × q matrix Lθ such that
Lθ LTθ = ΛTθ Z T ZΛθ + Iq .
This is the Lθ entering into expression (3.6). A particularly clever choice of Lθ is based on permuta-
tions of us and gives a sparse solution, i.e., a solution with many zeroes. This permutation is determined
once and for all, not for every θ.
Third, if we are sloppy with the notation and denote all densities f , no matter if they are marginal
or conditional (and let the arguments tell which density each f refers to), then the usual rules for
conditional densities gives us the following expression for the conditional density of U given Y = y:
f (u | y) ∝ f (u)f (y | u)
1 2 1 2
∝ exp − 2 ||u|| exp − 2 ||y − Xβ − ZΛθ u||
2σ 2σ
1 2 2
= exp − 2 ||u|| + ||y − Xβ − ZΛθ u|| .
2σ
Here, ”∝” means proportionality wrt. u, and we have used (3.8) and U ∼ N (0, σ 2 Iq ). Since the
marginal distribution of U as well as the conditional distribution of Y given U = u are both normal
(with the conditional mean depending linearly on u and constant variance), the conditional distribution
of U given Y = y is also normal. In particular, f (u | y) attains its maximum at the conditional mean.
Hence,
arg min{||y − Xβ − ZΛθ u||2 + ||u2 ||} = arg max f (u | y) = E(U | Y = y),
u u
and as we know from Sørensen (2023, Example 2.5), there is an explicit expression for E(U | Y = y).
This shows that the minimization problem in (3.7) indeed has a formula solution. The expression itself
is not so important for now, but will be given later, see equation (5.2).
Fourth, notice that we only need to make computations for y = yobs , the observed vector, which can
therefore be kept fixed. If we set h(u) = f (u)f (u | yobs ), then (with sloppy notation) the likelihood
function is Z Z
2
L(β, σ , θ) = f (yobs ) = f (u, yobs ) du = h(u) du.
In the end, things can be put together in order to obtain (3.6), see Bates (2010, Section 5.4.2).
25
4 Standard errors and confidence intervals
Chapter 3 described the methods to obtain point estimates with the ML and REML principles, but as
always, we are also interested in the uncertaincy of the estimators, in particular for β, but sometimes
also for the variance and correlation parameters, i.e., the residual variance or standard deviation and
parameters determining Σ. There are different methods for constructing confidence intervals/regions,
the most prominent ones being Wald, profile likelihood, and bootstrap. Although there is not much
news compared to general theory, we comment on all three methods. We are mainly going to talk
about confidence intervals for one parameter at a time, although it is also possible to make confidence
regions for pairs, say, of parameters. In the following, we use the abbreviation CI for confidence
interval.
β̂ = (X T V −1 X)−1 X T V −1 Y
θ̂ θ̂
and variance
T
Var(β̂) = (X T Vθ−1 X)−1 X T Vθ−1 σ 2 Vθ (X T Vθ−1 X)−1 X T Vθ−1
= (X T Vθ−1 X)−1 X T Vθ−1 σ 2 Vθ Vθ−1 X(X T Vθ−1 X)−1
= σ 2 (X T Vθ−1 X)−1 (X T Vθ−1 X)(X T Vθ−1 X)−1
= σ 2 (X T Vθ−1 X)−1 .
The estimated variance-covariance matrix for β̂ is calculated simply by plugging in the estimates for
σ 2 and θ, be it the ML or the REML ones:
d β̂) = σ̂ 2 (X T V −1 X)−1 .
Var(
θ̂
q
In particular, the standard error for kth element, βk is SE(β̂k ) = Var(
d β̂)kk . The method ignores that
θ is estimated and may thus underestimate the uncertainty.
26
4.2 Wald confidence intervals
Consider the kth coordinate of β. We already have the ingredients for the Wald confidence interval for
βk . For example, the 95% CI is
β̂k ± 1.96 · SE(β̂k ).
It can be computed with ML or REML estimates (and the corresponding SE), but we usually prefer the
REML version since, as mentioned, ML tends to underestimate the standard error and thus give lower
actual coverage then the nominal 95% for smaller datasets. Moreover, the validity of the Wald CI relies
on β̂ being normally distributed. This is true asymptotically, but not necessarily for smaller samples.
For those reasons, the Wald method is not the default in the lme4 package, but can be obtained with a
command like
It is possible, although not so common in practice, to make Wald confidence regions for pairs (or larger
subsets) of coordinates of β. On the other hand, Wald confidence intervals are not readily available for
σ or components in Σ, since standard errors for their estimates are not easily computed.
The preferred way to make confidence intervals for LMMs is to use the profile likelihood method. For
ease of notation, consider β1 , the first coordinate of β as the parameter of interest, and proceed as
follows:
1. Compute the maximum log-likelihood, i.e., the log-likelihood function evaluated at the ML esti-
mator:
log Lmax = sup log L(β, σ 2 , θ) = log L(β̂, σ̂ 2 , θ̂);
β,σ 2 ,θ
2. Consider a fixed value β10 of β1 , and maximixe the log-likelihood wrt. all other parameters:
3. Consider the likelihood ratio test statistic for the hypothesis H0 : β1 = β10 ,
0
0 Lmax
= −2 log L0max − log Lmax ,
LRT (β1 ) = −2 log
Lmax
and include the value β10 in the 95% CI if the hypothesis is not rejected at the 5% level. If
β1 = β10 , then LRT (β10 ) ∼ χ21 asymptotically and LRT has large values critical, so β10 is
included in the CI if LRT (β10 ) < 3.84 with 3.84 = 1.962 being the 95% quantile in the χ21
distribution.
27
σ1 σ (Intercept)
0
ζ
−1
−2
0.1 0.2 0.3 0.4 0.15 0.20 0.25 0.1 0.2 0.3 0.4 0.5
Figure 4.1: Zeta plot for the spring barley data. The three panels correspond to τ , σ and β, respectively
(despite the names in the top).
If another coverage rate than 0.95 is wanted, then the limit 3.84 is changed accordingly. Notice that —
as opposed to Wald CIs — profile likelihood CIs are not symmetric around the estimate by construction.
As is clear from the construction, the validity of the profile likelihood CIs relies on LRT (β10 ) being
approximately χ21 distributed under the null. According to Bates et al. (2015) this is a less restrictive
assumption than that of β̂1 being approximately normally distributed with the computed SE as standard
deviation. Therefore, profile likelihood CIs are generally more reliable than Wald CIs, and it is also
what you get as default, i.e., with the command
Some important comments: First, profile likelihood CIs are always based on the true (not adjusted)
log-likelihood, see point 1. above. As a consequence, if modelfit is a REML fit, then the model
must be re-fitted with ML as part of the CI computation. This is implemented in R so the model if re-
fitted automatically. Second, as explained by the algorithm, models have to be fitted for a whole range
of β10 values, so computations may take a while for larger datasets and/or complicated models. Details
about the computations can be found in Bates et al. (2015). Third, the profile likelihood method works
not only for the coordinates of β, but also for the variance and correlation parameters, and it would
also work for pairs (or triples) of parameters.
There is a feature, called zeta plots in the lme4 package for illustration of the profile likelihood con-
fidence intervals and the sensitivity of the model towards changes of the parameters (one at a time).
Figure 4.1 shows this plot for the barley data. There is one panel per parameter (τ , σ, β,pin that order),
and the y-axis is a transformation of LRT . More specifically, the y axis shows ζ = ± |LRT |, with
minus and plus to the left and right of the MLE, respectively. Then, parameter values with |ζ| < 1.96
are included in the 95% CI. The vertical bars correspond to 50%, 80%, 90%, 95%, 99% coverage rates,
respectively.
Approximate linearity of the ζ function corresponds to the log-likelihood function being approximately
quadratic which again implies that the distribution of the estimator is approximately normal. Therefore,
the more curved ζ functions the larger difference between the profile likelihood and the Wald CIs (and
typically the less validity of the Wald CIs).
28
4.4 Bootstrap confidence intervals
As a final method for construction of confidence intervals, we consider (parametric) bootstrap. The
idea is to learn the distribution of the estimator by means of simulation. We simulate a large number
(M , say) of datasets from the fitted model, for each of these simulated datasets compute the estimate,
and finally use properties of the ”bootstrap distribution” for computation of confidence intervals. More
specifically:
1. Use the fitted model to simulate values of the outcome variable, and denote these values y ∗ . The
data consisting of y ∗ and the original explanatory variables will be referred to as the bootstrap
data;
2. Fit the model to the bootstrap data with the same method as you used for the real data; denote
the bootstrap estimates β̂ ∗ , σ̂ 2,∗ , θ̂∗ and Σ̂∗ for the estimated variance of B;
3. Repeat steps 1 and 2 many times such that suitable summaries, like relevant quantiles, can be
determined with appropriate precision.
Consider for example βk , the kth coordinate in β, and the associated bootstrap sample consisting of the
∗
M values of β̂k∗ . Let β̂k,2.5% ∗
and β̂k,97.5% be the 2.5% and 97.5% quantiles in the bootstrap distribution.
∗ ∗
The percentile bootstrap CI for βk is computed as β̂k,2.5% , β̂k,97.5% whereas the basic bootstrap CI
∗ ∗
is computed as 2 · β̂k − β̂k,97.5% , 2 · β̂k − β̂k,2.5% . As an example, the basic bootstrap with 2000
bootstrap datasets can be computed like this:
There are other constructions of bootstrap confidence intervals, too. A standard error for β̂k can be
computed as the standard deviation of β̂k∗ . Consult the Wikipedia page for Bootstrapping (statistics) if
you are not familiar with bootstrap methods.
A few comments about step 1: The lme4 package has a function called simulate, which makes
simulated values of y, so it is quite easy to carry out bootstrap ”manually”. Notice that only values of y
are simulated, whereas the model matrix X is kept fixed at the observed values. As an alternative, one
could use resampling techniques in order to make bootstrap data. One has to be quite careful for mixed
models, though, because it is important that the bootstrap data has the same dependence structure as
the observed data. Therefore, resampling must be carried out at the level of independent units, and we
must be extra careful if there are several random effects involved.
Example 4.1. (Infection in spring barley, continued) Table 4.1 shows confidence intervals for the
spring barley data, with five different methods. The bootstrap based CIs are based on 5000 simulated
datasets. The differences for β are not large for these data, but notice how the Wald CI for β based
on ML is more narrow than the Wald CI based on REML which is again more narrow than the profile
likelihood CI. Differences are the largest for τ .
Remember that the results in the table do not tell us which method to prefer. Indeed, we are happy
with narrow CIs — but only if the coverage is correct! We cannot check coverage rates when we only
examine one dataset; we need to compute CIs for many simulated datasets and check how often they
contain the simulation value.
29
β τ σ
Wald, based on ML 0.184–0.483 — —
Wald, based on REML 0.175–0.492 — —
Profile likelihood 0.167–0.501 0.119–0.385 0.117–0.226
Bootstrap, percentile 0.173–0.496 0.082–0.349 0.105–0.207
Bootstrap, basic 0.172–0.494 0.102–0.368 0.107–0.209
Now, we already simulated M datasets, and we can re-use them to check the coverarage for the Wald
and profile likelihood CI. More specifcally: For each bootstrap dataset we computed not only the
estimates, but also the Wald and profile likelihood CIs, and finally computed the relative frequency of
bootstrap datasets for which the ”’true”’ parameter (the one used for simulation) was contained in the
CIs. For β, the coverage rates turned out to be 0.89 (Wald, based on ML), 0.91 (ML, based on REML),
and 0.92 (profile likelihood), while coverage was 0.92 for τ (profile likelihood) and 0.94 for σ (also
profile likelihood). It is a bit surprising that even the profile likelihood CIs have an coverage somewhat
below the nominal 95% for β. It would be much more time consuming to check the coverage of the
bootstrap CIs since it would require one more simulation loop: For each of many simulated datasets,
we should make a bunch of bootstrap datasets, which cannnot be re-used between datasets (why?).
Example 4.2. (Language and verbal IQ, continued) Recall the model from Example 4.2 for the lan-
gauage data, with random school-specific intercepts and slopes. The REML estimates (SEs, 95%
profile likelihood CIs) for the fixed-effects parameters are
In particular, the expected difference in test result between two random students from a random school,
and with a one unit difference in verbal IQ is estimated to 2.52 units. The expected difference is the
same no matter if the two random students come from the same school or not, but the variance of such
a difference would be larger for students from different schools, since it should also incorporate the
random variation between schols. The residual standard deviation is estimated to σ̂ = 6.30, and the
variance matrix for (Bi,0 , Bi,1 )T is estimated to
3.142 −1.09
σ̂ =
−1.09 0.452
such that the correlation between Bi,0 and Bi,1 is estimated to −0.76. The profile likelihood 95% CI
for the correlation is (−1 , −0.49), so there is clear evidence of negative correlation between intercept
and slope. Notice that the lower limit of the CI is at the boundary of the parameter region.
30
5 Prediction of random effects and model validation
For a given random factor it is sometimes relevant to quantify which groups have large random effects
and which have small random effects. This amounts to estimation of b, but since b are realizations of a
random variable, it is common to use the term prediction rather than estimation.
Recall that B ∼ Nq (0, Σ) and that Y | B = b ∼ Nn (Xβ + Zb, σ 2 In ). We already noticed that this
implies that the joint distribution of Y and B is Gausssian. More specifically, as noted in in (2.3):
ZΣZ T + σ 2 In ZΣ
Y Xβ
∼ Nn+q , .
B 0 ΣZ T Σ
It then follows from Sørensen (2023, Example 2.5) that the conditional distribution of B given Y = y
has mean −1
E(B | Y = y) = ΣZ T ZΣZ T + σ 2 In (y − Xβ), (5.1)
and as predictor of B we simply use this expression with the estimates of β, Σ and σ 2 inserted:
−1
b̂ = Σ̂Z T Z Σ̂Z T + σ̂ 2 In (y − X β̂).
then X β̂θ = Hθ y. Plugging this into (5.1), and considering the associated random variable, gives
−1
B̃ = ΣZ T ZΣZ T + σ 2 In (I − Hθ )Y.
31
In practice, computation of the eBLUP in the lme4 package goes via the independent U s introduced
in Section 3.4 about implemention. Recall that the q × q matrix Λθ was defined by the equation Σ =
σ 2 Λθ ΛTθ such that Λθ U ∼ N (0, Σ) for U ∼ Nq (0, σ 2 Iq ), and Y |U = u ∼ Nn (Xβ + ZΛθ u, σ 2 I).
After manipulations in the multidimensional normal distribution, we find that ũ = E(U | Y = y) is
given by
ũ = σ 2 ΛTθ Z T (ZΣZ T + σ 2 In )−1 (y − Xβ).
Matrix computations now yield that ũ satisfies
or
Lθ LTθ ũ = ΛTθ Z T (y − Xβ) (5.2)
where Lθ is the lower triangular q × q matrix with Lθ LTθ = ΛTθ Z T ZΛθ + Iq (also introduced in the
Section 3.4). It is numerically more efficient to solve this equation and then compute the corresponding
b̃ = Λθ ũ than to compute b̃ directly. Actually, the computation of ũ is a fundamental part of the
estimation procedure, anyway, so no extra computations are required. As before, the estimated values
of θ and σ 2 are inserted to get estimatated versions û and b̂ = Λθ̂ û.
As always, the validity of standard errors, confidence intervals and p-values is only ensured if the
model assumptions are appropriate. As described in the motivating examples, the introduction of
random effects into the analysis is fundamental for a valid analysis since they model the dependence
structures in the data. Exactly which random effects to include, and how, is often a matter of careful
considerations about the experimental design or data collection process rather than the result of model
validation.
Apart from dependence structures, experience is that the most important assumptions are those of the
conditional distribution of Y given the random effects, i.e., that Xβ +Zb is an appropriate specification
of the conditional mean, and that εi s have the same variance (variance homogeneity). This is tested
with residual plots as usual, where fitted values and residuals are computed as
ŷ = X β̂ + Z b̂, r = y − ŷ.
Notice that these fitted values include the predicted random effects, and thus include subject-specific
effects. They are sometimes called fitted values at level one, as opposed to fitted values at level zero,
X β̂, which are population estimates. The residual plot has fitted values at the x-axis and residuals at
the y-axis, and is easily made in R with the command
As usual, we hope that points are scattered randomly around zero in the vertical direction, with no
obvious patterns. Otherwise we must re-think the model assumptions and for example include more
explanatory variables or interactions, transform one or more variables entering in the model (outcome
and/or numerical explanatory variables), allow for different residual standard deviations, or perhaps
choose another distribution for the residuals. In situations with repeated measures over time or space,
32
it may not be reasonable to assume independent residuals, and this can be tested with suitable auto-
correlation plots of the residuals. Correlated residuals call for a more careful modeling via a non-
diagonal variance-covariance matrix for ε.
Usually we also check for normality of the residuals with a QQ-plot, but as with linear normal models
the residual plot is far more important, and we are often happy as long as the distribution of the residuals
is fairly symmetric. If the dimension q of the random effects is large, then it is also possible make a
QQ-plot over predicted random effects to assess normality. Often q is relatively small, and it not really
possible to check for the properties of B, but we may inspect the predicted random effects to detect
weird or extreme subjects/groups.
33
6 Hypothesis tests
In most cases, hypotheses about the fixed-effects part of the model are of most relevance, and this is
also the main focus of this section. However, we are also sometimes interested in hypotheses about the
random effects, and we return to that in the end. Standard test statistics like the likelihood ratio test
statistic and the Wald test statistic are of course available, but the usual chi-square approximations are
sometimes problematic. We illustrate this in an example and discuss other possibilities.
Hypotheses about the fixed effects are hypotheses about EY , typically expressed as constraints on the
parameter vector β. Recall that EY = Xβ where X is a n×p matrix of full rank, typically constructed
from a set of explanatory variables. In other words, the model assumes that EY ∈ LX = {Xβ | β ∈
Rp }. Now, consider the hypothesis ξ ∈ L0 where L0 ⊂ LX and let X0 be a full-rank matrix that spans
L0 . Let p0 be the dimension of L0 , i.e., the number of columns in X0 . The null model is
Y = X0 β + ZB + ε
with β ∈ Rp0 and with the usual assumptions on B and ε. We denote the null model by M0 and the
full/original model by M .
One standard choice of test statistic is the likelihood ratio test statistic. We denote it LRT , so
where Lmax and L0max are the maximum values of the likelihood function under the models M and
M0 , respectively. We already did the hard work and described how to carry out ML estimation, so it
is now easy to compute LRT . The usual asymptotic results hold, so for ”large enough” samples, and
under the null model, LRT has a chi-square distribution with p − p0 degrees of freedom. Hence, we
compute the p-value as
p = P W ≥ LRTobs , W ∼ χ2p−p0 .
(6.2)
With the lme4 package the LR test is easily carried out with commands like
where modelfit is the fit of full model and hypfit is the fit of the null model.
Importantly, L0max and Lmax in (6.1) refers to maximum values of the ”real” likelihood, not the REML
likelihood. The drop1 and anova functions ”know” that, so if modelfit and/or hypfit are fitted
with REML, then the models are re-fitted with ML before the LR test statistic is computed. Therefore,
34
you do not have to worry about the ML/REML issue if you use the built-in functions for tests. On
the other hand, if you compute LRT manually in the sense that you extract the log-likelihood values
yourself with logLik, then you must make sure to use ML fits.
Example 6.1. (Comparison of ergonomic stools) This example comes from Bates (2010, Section 4.2),
and the data are available as ergoStool in the nlme package. Nine men tested four different types
of stools, and the effort required to rise from the stool was registered on a special scale called the
Borg scale. The four stool types are coded as T1 (high, 50 cm), T2 (normal/low, 32 cm), T3 (low,
one-legged), T4 (low, special type). Each of the nine test persons tested each stool once, so there is a
total of 36 observation.
Figure 6.1 shows two so-called interaction plots for the data, namely profiles for each stool (left) and
profiles for each test person (right). The natural model for the data is the linear mixed model with fixed
effect of Stool type and random effect (intercept) of Test person. Exercise for the reader: What do you
see from the graphs, and why is the suggested model a natural one?
Type T1 T2 T3 T4 8 4 6 7 2
Subject 5 9 3 1
Effort (Borg scale)
12 12
10 10
8 8
8 5 4 9 6 3 7 1 2 T1 T2 T3 T4
Subject Type
The suggested model assumes EY ∈ LType where LType is the subspace generated by the categorical
Stool type variable. In other words, the expected effort is allowed to differ between stool types, with
no restrictions on the four expected values. A natural hypothesis is that the four expected values are
the same for the four stool types, i.e., β1 = β2 = β3 = β4 if we use the parameterization where βj is
the expected value for type j. If L1 is the subspace associated to the constant factor, spanned by the
36-column of ones, 136 , then the hypothesis can be expressed as EY ∈ L1 .
The LR test statistic turns out to be as large as 36, and the χ23 distribution gives a p-value of 7 · 10−8 ,
so there is clear evidence that there are differences in the efforts required. This is not surprising,
considering Figure 6.1. Differences could be reported in several ways, as estimated means (with SEs
or CIs) or as estimated differences to one type (with SEs or CIs), and possibly pairwise tests. You are
encouraged to do this.
6.2 Likelihood ratio tests for fixed effects, simulation of null distribution
In many cases, the chi-square approximation and thus the p-value in (6.2) are fine, but as always when
inference is carried out with asymptotic results, the question arises whether the sample size is large
35
enough such that asymptotics has ”set in”. For mixed models, this question is even a little more subtle,
because: What does ”sample size” more precisly refer to? Is it the number of subjects or the total
number of observations or both?
For a simple illustration, consider the following example, which is invented for the occasion.
Example 6.2. (Comparison of textbooks) Consider a university course with 180 students allocated at
random to six classes with 30 students each. Three classes, chosen at random, use textbook A while
the other three classes use textbook B, and we want to study if textbook A or B is the better one. At
the end of the course, all students take an exam and we use the exam result as outcome. The relevant
hypothesis is H0 : µA = µB where µA and µB are the expected exam result associated to textbooks
A and B, respectively. Since students in the same class share teacher, it is natural to include a random
class-specific intercecpt, and thus use the model corresponding to the following lmer command:
So, what is the relevant sample size here, for the comparison of textbooks?
Textbook is randomized at class (not student) level, such that all students in the same class use the
same textbook. Therefore, the difference between textbook must be compared to the class-to-class
variation rather than the student-to-student variation. There are only six classes, or three replications
for each textbook, so essentially three independent ”observations” per textbook, and it is doubtful if
the asymptotic distribution of the likelihod ratio test statistics is appropriate in this case. As will be
explained below, we can compute the p-values by simulation or rely on other distributional results.
The problem is that the textbook and class factors are nested. Even though there are many students
such that student-to-student variation is well determined, it does help us because the relevant variation
in this case is the class-to-class variation. If the allocation of textbooks to students was done within
each class, the story would be different. Then, the difference between textbook would be unrelated to
differences between classes, and the number of students would be the relevant count of sample size for
the comparison (but it would be a difficult experiment to carry out in practice).
Notice that in a linear normal model with Textbook and Class as explanatory variables, it would not
be readily possible to test for an effect of textbook due to the nestedness of factors.1 So, apart from
it (in my opinion) being more natural to consider classes as random rather than fixed, it also makes it
possible to test the hypothesis of interest. It would be wrong to ignore the classes and make a simple
two-sample comparison since this would assume independence between students.
We can turn to simulations if we doubt the appropriateness of the chi-square approximation to the true
null distribution of LRT . The relevant distribution is the distribution under the null model, so datasets
must be simulated from the null model M0 (not the full model M ). The procedure is this one:
1. Simulate outcomes y ∗ from the null model M0 , i.e., from the model where the hypothesis is true;
2. Compute LRT using y ∗ as the outcome, i.e., fit models M and M0 to the simulated data and get
the associated LR test statistic, LRT ∗ ;
3. Repeat steps 1 and 2 many, say K, times and thus obtain LRT1∗ , . . . , LRTK
∗;
1
An option, though, would be to consider so-called marginal means, comparing the average over the three classes using
textbook A to the average over classes using textbook B. This is not much different from the mixed model approach.
36
4. Compute the simulated p-value as the relative frequency of simulated datasets where the simu-
lated LRT exceeds the observed one
1
p= #{k | LRTk∗ ≥ LRTobs }. (6.3)
K
It is easy to simulate from a model fitted with lmer: The following command gives a vector of
simulated outcomes:
simulate(hypfit)$sim_1
Let us examine the null distribution of LRT for the data om comparison of stools.
Example 6.3. (Comparison of ergonomic stools, continued) Figure 6.2 shows the results from 5000
simulated datasets corresponding to the ergostool data. The data are simulated from the null model,
where there is is no difference between stool types. The left plot shows a scaled histogram of LRT
together with the density of χ23 in red. The density seems to approximate the distribution reasonably
well. Consider also U = F (LRT ) where F is the CDF for χ23 . If LRT ∼ χ23 then U and 1 − U are
uniform on (0, 1), and the graph in the middle shows a scaled histogram of 1 − U together with the
horizontal line with y = 1. We plotted 1 − U rather then U because 1 − U are simulated p-values if
the χ2 approximation is used.
If LRT ∼ χ23 then U is uniform on (0, 1), and the graph in the middle shows a scaled histogram of U
together with the horizontal line with y = 1. Finally, the graph to the right shows a QQ-plot against
the χ23 distribution, i.e., sorted values of LRT on the x axis and quantiles from the χ23 distribution on
the y-axis together with the 0/1 line.
From the figures we conclude that the χ23 approximation is not perfect in the tail, but not completely
off. We reject too often, in this case with a probability of 7% if we use the χ2 appoximation and a
5% significance level. The 95% quantiles are 8.75 and 7.81 for the simulated null distribution and the
χ23 distribution, respectively, so if LRT had fallen between those values, the binary conclusion of test
(reject or not-reject) would have differed depending on the choice of reference distribution: Recall that
the observed value of LRT was 36, and notice from the histogram that the simulated values do not
exceed 23, so there is no doubt about the conclusion: The hypothesis is clearly rejected.
0.25 1.5
20
0.20
Chi−square(3)
1.0 15
0.15
0.10 10
0.5
0.05 5
0.00 0.0 0
0 5 10 15 20 0.00 0.25 0.50 0.75 1.00 0 5 10 15 20
Simulated LRT 1 − pchisq(LRT,df=3) Simulated LRT
Let us consider a more complicated example where the chi-square distribution is clearly not appropri-
ate.
37
Example 6.4. (Sugar beets) The example comes from Halekoh and Højsgaard (2014) which is about
the methods implemented in the pbkrtest package in R. The data are available in the package as
the dataset beets. An experiment was carried out in order to study the effect of sowing time (five
different dates) and harvest times (two different dates) on the yield of sugar beets. There were three
blocks (replications), each split into two so-called wholeplots, which were again split into five subplots.
One wholeplot in each block was harvested early, the other late, in such a way that all subplots in a
wholeplot were harvested at the same time. On the other hand, Sow time was varied at subplot level,
such that all five sowing dates occured once in each wholeplot. At harvest the sugar percentage and
yield were measured at each subplot, so there is a total of 30 observations.
A diagram of the layout is shown in Halekoh and Højsgaard (2014, page 4). Take a look at it if you do
not understand the structure of the experiment from the above explanation! Figure 6.3 shows the factor
diagram including Block, Sow time, Harvest time, and Wholeplot. Subplot corresponds to the identity
factor I since observations are taken at subplots. The important feature is that Harvest time and Block
are coarser factors than Wholeplot!
6
Block3
Wholeplot
30
I Harvest time2 11
Sowing time5
Figure 6.3: Factor diagram for the sugar beet data, with brackets around random factors.
We are going to analyse the sugar percentage, and Figure 6.4 shows a graph of the average sugar
percentage for each of the ten combinations of harvest time and sowing time. We include Harvest time
17.0
Average sugar pct
16.8
16.6
Figure 6.4: Average sugar percentage for the suger beet data.
and Sow time as fixed effects in our model. We could very well have included their interaction (and
38
then it should have been included in the factor diagram), too, but for simplicity we stick to the main
effects. Block would usually be considered a random factor, but there are only three blocks, which
makes estimation of a block-to-block standard deviation inaccurate, and we include it as fixed instead.
Finally, Wholeplot with six levels is included as a random effect. In that way, we allow for correlation
between observations from the same wholeplot.
In summary, the model is the one fitted with the following command:
Notice that the variable wholeplot is not in the original dataset, but can be constructed as the
product factor harvest:block. You could also simply write (1|harvest:block) instead of
(1|wholeplot), but I personally prefer to construct variables with appropriate names in order to
make the code easier to read later.
The model assumes EY ∈ LHarvest + LSow + LBlock . We are interested in the effects of sowing time
and harvest time, i.e., in the hypotheses
Likelihood ratio tests with chi-square approximations give LRT = 85 and p < 0.0001 for H0,1 (sow
effect) and LRT = 12.9 and p = 0.0003 for H0,2 (harvest effect), so there is large evidence for effects
of both sowing time and harvest time.
But: Can we trust the p-values computed with the chi-square approximations? Figure 6.5 shows the
histograms for 5000 simulated values of LRT for H0,1 to the left and for H0,2 to the right. Consider the
left graph first, corresponding to H0,1 testing for a Sow time effect. The red curve is the density for the
χ24 distribution (why is that the relevant one?), and we see that it does not approximate the histogram
very well. The 95% quantile in the simulated distribution is 12.7, while it is 9.45 in the χ24 distribution.
However, it does not make any difference for the conclusion in this case because the observed value
LRTobs is as large as 85, so the p-value is essentially zero, no matter how it is computed.
Then, consider the graph to the right, corresponding to H0,2 testing for a Harvest time effect. Here,
the red curve is χ21 density and we see that it fits very poorly to the simulated distribution of LRT :
The simulated distribution has much more probability mass for larger values than χ21 . In particular, the
observed value of LRT , 12.9 and shown by the vertical blue line, is very extreme in χ21 , but not so in
the simulated distribution. The simulated p-value, as computed by formula (6.3) is 0.034. Even though
the hypothesis is still formally rejected at the 5% significance level, the evidence is not overwhelming
as before, so here it makes a huge difference whether we use the chi-square approximation or the
simulated distribution. Notice that the chi-square approximation is anti-conservative in the sense that
it leads to too many rejections, i.e., too many ”false positives” (type I errors). The 95% quantile in the
simulated distribution is 11.5 while it is 3.84 in the χ21 distribution.
The reason is essentially the same as in Example 6.2 about textbooks: Harvest time is varied at whole-
plot level, and since there are only six different wholeplots, there are only three independent replicates
for each harvest time. Therefore, it is not so surprising that the asymptotics has not set in. It is actu-
ally more surprising (to me, at least) that the chi-square distribution is also fairly inaccurate for H0,1
concerning the Sow time effect.
39
Test for Sow time effect Test for Harvest time effect
0.4
0.15
0.3
0.10
0.2
0.05
0.1
0.00 0.0
0 10 20 30 0 5 10 15 20 25
Simulated LRT Simulated LRT
Figure 6.5: Simulated values of LRT (histograms) for the sugar beet data, the asymptotic distributions
(red curves) and the observed LRT (blue line, only for the Harvest time effect).
Figure 6.6 shows what happens to the sampling distribution of LRT for H0,2 when the number of
blocks and thus wholeplots increases. Each graph is based on 2000 simulated datasets with three (left),
ten (middle), or twenty-five (right) blocks. The top panels show histograms of the simulated values
of LRT and the density for χ21 ; in the bottom panels we have transformed the values with one minus
the CDF for χ21 , corresponding to simulated p-values if χ21 is used as reference distribition. With more
blocks, it is natural to include block as a random rather than fixed effects, so we considered a model
with random effects of both block and wholeplot. For that reason the upper left histogram is not exactly
comparable to the right histogram in Figure 6.5. It is evident that the chi-square distribution becomes
a better approximation when the number of blocks increase, and with 25 blocks the approximation
appears to be quite accurate.
You are encouraged to make a similar exercise for the hypothetical situation with textbooks from
Example 6.2. You will have to choose parameter values for the expected values and variances, and
implement a simulation routine yourself since there are no real data associated to the example. You
can check how the distribution of the LRT changes if the number of classes increases.
Example 6.4 showed us that the chi-square approximation to the distribution of the LR test statistic may
be inadequate, and that the discrepancy may have severe impact on the conclusions. Therefore, it is
recommended to use simulated p-values when the sample size — ”defined” appropriately and possibly
different for different hypotheses — is not large.
We explained above how data can be simulated from a given lmer model object (with the function
simulate) and then used for computation of simulated p-values. Actually, the function PBmodcomp
from the pbkrtest package (Halekoh and Højsgaard, 2014) does the whole simulation with just one
command:
Here, fullModel and nullModel are the model fits corresponding to the full and the null model,
respectively. REML fits are automatically refitted with ML, so you can plug in ML or REML fits as
you please. Even if it is easy to just use the PBmodcomp function, it is recommended to carry out such
a simulation study ”manually” (at least once in your life time) to make sure that you understand the
40
3 10 25
0.6
Density
0.4
0.2
0.0
0 5 10 0 5 10 0 5 10
Simulated LRT
3 10 25
2
Density
0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
1− pchisq(LRT, df=1)
Figure 6.6: Simulated values of LRT values in three scenarios where the number of blocks varies from
3 (left) over 10 (mid) to 25 (right). See the text for the specific set-up.
mechanics. This also makes it possible to simulate data from the model fitted with REML rather than
ML (still using ML for the computation of LRT on both observed and simulated data).
If the hypothesis can be described with a single restriction β, then it is common to consider Wald tests.
Consider for example the hypothesis H0 : βk = 0 about the kth coordinate in β. Then, the Wald test
statistic is
β̂k − βk,0
W = .
SE(β̂k )
Asymptotically, W has a N (0, 1) distribution under the null, but there are potentially the same issues
regarding the validity of the asymptotic distributions for the Wald test statistic as for the likelihood
ratio test statistic, so we must be a bit careful. As usual, Wald test statistics can also be defined for
hypotheses concerning several simultaneous parameter restrictions.
Example 6.5. (Sugar beets, continued) For the sugar beet data in Example 6.4, hypothesis H0,2 about
the effect of harvest data could also be expressed as δ = 0 where δ is the contrast between the two
harvest dates. If the level harv1, say, is used as a reference level in the parameterization, then δ would
be one of the components of β. REML estimation yields δ̂ = −0.1133 (SE 0.0291) and W = −3.900.
The standard normal approximation to W gives a two-sided p-value of 9.6 · 10−5 , but simulations show
that W has much larger tails than the standard normal. This is left as an exercise. Actually, it holds
41
that W has a t distribution with just two degreess of freedom. This gives a p-value of 0.06, which well
in line with the simulated results about the LR test above. We return to the exact distributional result
in a moment.
You have probably already noticed that the command summary(modelfit) does not report p-
values associated with each of the fixed-effects parameters when modelfit is fitted with lmer. This
is because the authors of the lme4 package believe is is generally too dangerous to use the asymptotic
results. There are other packages for linear mixed model, that report p-values based on t distributions.
This is the case for the nlme package (Pinheiro et al., 2022) and the lmerTest package (Kuznetsova
et al., 2017). Since the t distribution is not always the exact distribution of the Wald test statistic, the
authors behind lme4 disregard those as well. We go a bit more into detail about this in the next section.
6.4 Exact and approximate F -tests and t-test for fixed effects
When the data structure is particularly nice, in a sense that will be described below, then it is possible
to test hypotheses with exact F tests. This means that a certain test statistic can be shown to have an
exact F distribution under the null, not just approximately or asymptotically. We are not going to go
into all details about this, but interested readers are referred to Tjur (1984).
Let us start by looking at the sugar beet data again.
Example 6.6. (Sugar beets, continued) If we are only interested in the hypothesis H0,2 concerning the
harvest dates, then we could proceed as follows as an alternative to the likelihood ratio test from Ex-
ample 6.4: First, compute the average outcome over the five different sowing times for each wholeplot.
This gives a dataset with just six observations (one per wholeplot), where variables Harvest time and
Block are retained, but no Sowing time variable is present since we have averaged over it.
An appropriate model for the averages would be a linear normal model with effects of harvest and
block (no random effects!). The hypothesis of no Harvest time effect can be carried out as an exact
F test with (1, 2) degrees of freedom. Notice that there is a bijection between the F test statistic and
the LR test statistic in the linear normal model, but F has the advantage that its null distribution is an
exact F distribution while the likelihood ratio test statistis is only asymptotically χ21 distributed. We
get an observed F value of 15.21 and the p-value 0.06 — well in line with simulated p-value from
Example 6.4. Since the hypothesis only involves a single√ parameter restriction, the test could also be
carried out as a t test with two degrees of freedom on F .
It is not a coincidence that the F and t tests give a p-value which is similar to the simulated p-value
from the linear mixed model. Actually, it can be shown that if we allow for negative values for the
variance for the wholeplot-to-wholeplot variation, then there is a bijection between the likelihood ratio
test statistic from the original linear mixed model and the F test statistic from the linear normal model
for the averages. In other words, the two tests are equivalent, and 0.06 is in that sense the correct
p-value also for the likelihood ratio test in the linear mixed model.
At first sight, it is perhaps strange that the tiny dataset with just six observations contains as much
information about the potential effect of harvest dates. The reason is that each observation in the small
dataset is an average and thus more precise than the original measurements.
The ”trick” with the average data is only appropriate because the same sowing dates are used in all
wholeplots, and equally many times in each wholeplot (here just once). This implies that the Sowing
42
time factor and the Wholeplot factor being (geometrically) orthogonal as defined in Hansen and Tolver
(2023, Section 5.3) or Tjur (1984).2
Due to this orthogonality, and because each harvest date is tested in each wholeplot , there is a unique
decomposition of Rn consisting of orthogonal subspaces:
and the sum is a direct sum. If we denote orthogonal projections into the V spaces with Q, then the F
statistic from above can be written as
||QHarvest Y ||2 /1
F = ∼ F (1, 2). (6.4)
||QWholeplot Y ||2 /2
The numbers 1 and 2 are the dimensions of subspaces VHarvest and VWholeplot , respectively.
The intuition is the following: Harvest date is varied at wholeplot level, so differences between harvest
dates should be compared to differences between wholeplots in order to assess the (statistical) impor-
tance of harvest date, and the norms in (6.4) measure these differences after other effects have been
”subtracted” (V spaces are orthogonal). We would sometimes say that the effect of Harvest date is
tested against Wholeplot, or that the Harvest time is tested in the Wholeplot error stratum.
The approach with exact F tests can be generalized to other ”nice” situations. We onsider models
that involves factors only (no numerical covariates). Then, the model is characterized by the sets of
fixed and random factors in the model. Let T and G denote the set of fixed and random factors,
respectively. Include the identity factor I in G; this corresponds to the residual term ε in the model.
Furthermore, let D = T ∪ G. For the beets data, we have T = {Harvest, Sowing, Block} and
G = {Wholeplot, I}.
The requirements for the sets of factors are the following:
– For any two factors in G (random factors), their minimum must be in G as well;
– All random factors are balanced, i.e., has the same number of observations for all factor levels;
– The n × n matrices Xg XgT , g ∈ G, are linearly independent, where Xg is a matrix that spans the
subspace Lg associated to the factor g.
The minimum of two factors f1 and f2 is defined as the finest factor which is coarser than both f1 and
f2 . The second and third assumptions say that D and G are closed under formation of minima, and
it is sometimes necessary to include minima in T and/or G in order to have the assumptions satisfied.
You can check yourself that factors in the model for the beets data satisfy the conditions. Two factors
are orthogonal if the number of observations at each levels of their product factor satisfy the so-called
balance equation, see Hansen and Tolver (2023, Section 5.3) or Tjur (1984, Section 3.2). This is easy
to check, and is in particular fulfilled if the product factor is balanced. The last assumption is almost
always fulfilled and has to do with identifiability of the variance parameters.
2
Hansen and Tolver (2023) uses the term geometrically orthogonal while Tjur (1984) just uses orthogonal.
43
The two first assumptions imply that D is a so-called orthogonal design. In that case there is a unique
decomposition of Rn , X
Rn = Vf
f ∈D
where each Vf is a subspace of Rn and the Vf s are orthogonal. The sum is a direct sum. Due to the
third assumption, G also comprises an orthogonal design, and there is a similar decomposition of Rn
consisting of orthogonal subspaces for each random factor:
X
Rn = Vg0 .
g∈G
For each fixed factor t ∈ T there is exactly one g such that Vt ⊆ Vg0 (unless dim(Vt ) = 0 in which
case the factor t is irrelevant). This factor g is the coarsest random factor which is finer than t, and we
say that factor t is in the g stratum.
One can show (Tjur, 1984) that the linear mixed model with factors in T as fixed effects and factors
in G as random factors is equivalent to the set of independent linear normal models for Q0g Y (g ∈ G)
where Q0g is the orthogonal projection onto Vg0 . This means that analysis of the linear mixed model
can be carried out by analyses of each of these linear normal models. In other words, the analysis can
be carried out as separate analyses in each stratum.
Now, consider a hypothesis corresponding to removal of a single fixed-effects factor, t, and assume
that it is in the stratum corresponding to the random factor g. The point is now that the hypothesis
only involves stratum g, and it is therefore carried out in the associated linear normal model. Linear
hypotheses in linear normal models are usually tested with F tests, and the F distribution is exact.
More specifically, the F test statistic is
where Qt and Qg denote orthogonal projections into subspaces Vt and Vg , respectively, and F has an
exact F distribution with (dim(Vt ), dim(Vg )) degrees of freedom.
It is debatable how important these results are.3 In the old days, the decomposition of the linear mixed
model into linear normal models in error strata was extremely important because there are closed-form
solutions for the estimators. However, as we have seen, there are now efficient computations that do not
rely on the decomposition. Moreover, some of the assumptions are quite restrictive, and not fulfilled
for many datasets that you meet in practice. For observational datasets, the assumptions are barely ever
met, and even for well-designed studies the conditions on orthogonality and balancedness are destroyed
if observations are missing or otherwise corrupted.
The strongest argument for the importance of the results is, in my opinion, situations like the test for a
Harvest date effect in Example 6.4 on sugar beets. The chi-square approximation is really bad in this
case, and even anti-conservative such that we are more likely to reject a true hypothesis. The F test,
on the other hand, is exact. If the conditions are only ”almost fulfilled”, then the test statistic in (6.5)
is only approximately F distributed. There are several suggestions for computations of approximate
p-values in such cases.
3
You may for example read this harsh comment from Douglas Bates on error strata and computation of p-values:
https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html
44
One method is known as Satterthwaite’s method. Consider a linear hypothesis of the form Lβ = 0,
where L is a full-rank contrast matrix of dimension r × p. In other words, the hypothesis imposes r
linear restrictions on β. The Wald test statistics for the hypothesis is
W = (Lβ̂)T (LĈLT )−1 (Lβ̂)
where β̂ is the ML/REML estimator for β and Ĉ is the associated estimated variance-covariance matrix
of β̂. Asymptotically, W has a χ2r distribution (as does the likelihood ratio test statistic). Satterthwaite’s
method approximates the distribution of F̃ = W/r with an F distribution with (r, ν) degrees of free-
dom. The definition of ν is the crucial part of the method, and it relies on some moment considerations.
In the orthogonal case, and if the hypothesis Lβ = 0 corresponds to removal of a factor from the model,
then F̃ coincides with F from (6.5), and ν = dim(Vg ), so there is no approximation. Satterthwaite’s
method is implemented in the lmerTest package (Kuznetsova et al., 2017).
Another method is called the Kenward-Roger approximation. It is based on a modification of F̃ , but
also uses an F approximation. This method is implemented in the pbkrtest package (Halekoh
and Højsgaard, 2014). Aa noted, this package also has a ready-to-use implementation of simulated
p-values. As a side remark, notice that both the lmerTest and pbkrtest packages are written by
researchers at Danish universities.
There are no guarantuees or guidelines about when the approximations work well and when they do
not, and some people argue strongly against them for that reason. Personally, I prefer the exact F tests
when they are available, and my experience is that the approximate F tests are often more reliable than
the approximate chi-square test. However, whenever I doubt conclusions the least, I compute p-values
by simulation.
In most situations main interest lies in estimates of and hypotheses concerning the fixed-effects pa-
rameters β, while the random effects are mainly included in the model to get valid inference. In those
situations it is common to keep random effects in the model — even if estimated standard deviations
are close to zero — and not even consider hypotheses about them.
It happens every now and then, however, that hypotheses about the random part of the model are
relevant. This would be hypotheses concerning elements in Σ (or, equivalently, elements in θ). For
example, in a model with the intercept varying randomly between groups corresponding to a factor
f , we could be interested in the hypothesis that τf = 0 where τf is the standard deviation for the
associated random effects. Under the null, the random factor is excluded from the model. The test
for this hypothesis is unpleasant because zero, the value under the null, is at the boundary of the
parameter space, such that the usual asymptotic results for LRT do not hold. In orthogonal designs, the
hypotheses could be carried out as an exact F -test, but in other situations we would rely on simulations
for computation of the p-value.
As another example, consider a model with both random intercept and random slope associated with
an effect pair (f, x). Although it is usually suggested to allow for correlation between the random
intercept and the random slope, one might consider the hypothesis of independence between the two.
Since the correlation is allowed to vary in (−1, 1), there is no boundary problem, and the chi-square
approximation would often be appropriate for this hypothesis.
For the likelihood ratio test for fixed effects, it was absolutely essential that LRT was computed with
45
ML estimation. For tests concerning random effects, on the other hand, we may consider the likelihood
ratio test statistic based on the REML likelihood instead. It is even often recommended, as the chi-
square approximation is sometimes better and the statistical power larger.
We now carry out a simulation study to make things more concrete.
Example 6.7. (Language and verbal IQ, continued) We use the language test data from Example 1.3
and 2.2 for our examinations. We previously suggested a model with school-specific intercepts and
slopes,
Yij = (β0 + Bi,0 ) + (β1 + Bi,1 )xij + εij
where xij and Yij are the verbal IQ score and the language test result, respectively, for the jth student
from school i. The random effects (Bi,0 , Bi,1 )T are assumed to be iid. N2 (0, Σ), and we consider three
specifications for the 2 × 2 variance-covariance matrix Σ:
2 2 2
τ0 τ0,1 τ0 0 τ0 0
Σ1 = , Σ2 = , Σ3 = ,
τ0,1 τ12 0 τ12 0 0
The associated models are denoted M1 , M2 and M3 . In M1 the intercept and slope are allowed to be
correlated, whereas they are independent in model M2 . Model M3 does not include random slopes at
all.
Consider first the hypothesis test from M1 to M2 , i.e., the hypothesis H0 : τ0,1 = 0. The hypothesis
value is in the interior of the parameter space, and standard likelihood theory therefore suggests that
LRT is χ21 under M2 . This is illustrated in the leftmost plots in Figure 6.7 (top and bottom). For
the plots, we simulated 5000 datasets (outcome values only) from M2 , using the REML estimates as
parameter values. Then, we fitted models M1 and M2 , and computed the LR test statistic based on
the REML likelihood. The top figure shows the histogram for these LRT values together with the χ21
density, and the bottom figure shows a QQ-plot against χ21 . Indeed, the two distributions are essentially
identical.
Next, consider the model reduction from M2 to M3 , corresponding to the hypothesis H0 : τ1 = 0. In
this case, the hypothesis value is at the boundary of the parameter space, and we cannot rely on the usual
results from likelihood theory. If τ1 was allowed to vary in all of R, but the true value was zero, then we
would expect to get a positive or negative estimate with the same probability, 0.5. When τ1 is restricted
to be positive we could therefore expect to get τ̂1 = 0 with probability 0.5 and τ̂1 > 0 with probability
0.5 under M2 . Conditional on τ̂1 > 0 we would expect an asymptotic χ21 distribution of LRT (no
boundary problem), whereas τ̂1 = 0 implies LRT = 0. Altogether, the asymptotic distribution of
LRT becomes a 50:50 mixture of the one-point distribution at zero and the χ21 distribution.
This is illustrated in the middle figures: 5000 datasets were simulated from M3 , and LRT was com-
puted every time. Notice that roughly half the datasets gave LRT = 0 (leftmost bar in the histogram).
The red curve in the top graph is half the density for χ21 . The bottom plot shows a QQ-plot, comparing
the simulated distribution and the mixture distribution. The distributions are quite simular, although
there are also deviations.
Finally, consider the model reduction from M1 to M3 , corresponding to H0 : τ1 = τ0,1 = 0. As
before the hypothesis value is at the boundary of the parameter space, but there is an extra comment to
make: Correlation between B0,i and B1,i make no sense if τ12 = Var Bi,0 = 0, so the parameter τ0,1
“disappears” when τ1 = 0, and the dimension reduction from M1 to M3 is not well-defined: is it one
or two? It turns out that the asymptotic distribution of LRT is a 50:50 mixture of χ21 and χ22 . This is
46
M2 against M1 M3 against M2 M3 against M1
1.00 1.00 0.8
15
15
Reference dist.
Reference dist.
Reference dist.
9
10
10
6
5
3 5
0 0 0
0 5 10 0.0 2.5 5.0 7.5 10.0 12.5 0 4 8 12
Simulated LRT Simulated LRT Simulated LRT
Figure 6.7: Simulations for tests of random effects. The reference distributions are χ21 (left), the 50:50
mixture of the one-point distribution at zero and χ21 (middle), and the 50:50 mixture of χ21 and χ22
(right).
illustrated in the rightmost plots, where data are simulated from M3 and the just-mentioned mixture
is used as reference distribution. It is somewhat surprising that LRT = 0 (τ̂1 = 0) does not occur
in practice; on the other hand we often get τ̂0,1 = ±1 (not shown in the graph), which is also at the
boundary. The simulated distribution and the reference distribution are almost identical.
As mentioned, we used the REML likelihood for computation of LRT when we produced the graphs
above, but there are hardly any changes if we use the actual likelihood instead (not shown).
The reference distributions mentioned in the example are actually correct asymptotic distributions for
the hypotheses we considered (as the number of shools or, more generally, subjects increase). For more
complex models or more complex hypotheses, the asymptotic distribution of the LR test statistic can
be more complicated, see Stram and Lee (1994) for some results. In the case of orthogonal designs
(cf. Section 3.3) and removal of a single random factor, the test can be carried out as an exact F test
(Tjur, 1984), and approximations to the finite-sample null distributions have been developed for more
general situations (Greven et al., 2008). However, as with tests for fixed effects, it is recommended to
compute the p-value by simulation if there is any doubt.
47
7 Variations of linear mixed models
So far, we have studied LMMs with iid. residuals and with Gaussian distributions for both random
effects and residuals. The model was stated in (2.2) and is repeated here for convenience:
We now briefly discuss possibilities for generalizations of that model. We still consider the situation
with a continuous outcome; the models for discrete outcome are referred to as generalized linear mixed
models and are treated in Chapter 8.
One assumption in (7.1) is that all εi s have the same standard deviation. However, it is often the case
that the residual standard increases with the expected value, in particular if E(Yi ) varies over a large
range. This is not special for LMMs, and the inhomogeneity is usually spotted from the residual plot.
A common solution is use apply a transformation t to the outcome instead of Y , and use a LMM for
t(Y ) instead of Y . This is often a splendid solution — but not always, and for the same reasons as in
(non-mixed) linear models: First, it may not be possible to find an transformation for which variance
homogeneity is satisfied. Second, transformation changes the interpretation of the fixed effects. This
may be desirable in itself; in particular, log-transformation has the effect that covariate effects are
relative/multiplicative rather than absolute/additive, which also makes more sense in many situations.
In other cases, however, there is a wish to keep the additive structure of Y itself.
Another solution is to allow for variance heterogeneity in the model. It could turn out, for example,
that the residual variance is larger in a treatment group compared to a control group (or vice versa), or
that the variance changes with time. In these situations, we model the residual variance Var(εi ) as a
function of some categorical or numerical explanatory variables. Another option is to assume a specific
parametric functional relation between residual variance and mean, Var(εi ) = σ 2 v(E(Yi ), α), where
v is a known function type which depends of an unknown parameter α which must be estimated along
with the other parameters in the model.
The lmer function can only handle situations where Var(εi ) is assumed to be proportional to an
observed numerical explanatory variable (covariate). Models with group-specific residual standard
deviations, i.e., models where Var(εi ) depend on categorical variables, can be fitted with glmmTMB
from the package with the same name (Brooks et al., 2017) or lme from the nlme package (Pinheiro
et al., 2022). Even more models can be fitted with lme. The syntax for lme is slightly different than
for lmer, especially if the model includes non-nested random factors.
48
7.2 Correlated residuals
Consider a situation where the outcome consists of measurements over time for several people which
are allocated into two treatment groups. This is a typical example of longitudinal data, also called panel
data (in econometrics) or simply repeated measurements. A simple model could include treatment
group, time and their interaction as fixed effects and person ID as a random effect (random intercept).
The associated model formula is
The assumption of indpendence of the residuals in (7.1) implies that there is the same correlation for
any pair of measurements from the same person, no matter if the observations are close in time or far
apart in time. However, it often appears more reasonable to assume that correlation is a decreasing
function of the time difference, for example that
where c and d are unknown parameters to be estimated and ∆tij is the time difference between obser-
vations i and j from the same person. We say that the model allows for serial correlation.
An alternative to choosing a specific model for the serial correlation is to leave the correlations unspeci-
fied, only assuming that the variance-covariance matrices for each subject are valid variance-covariance
matrices, i.e., positive semi-definite. This approach is possible if measurements are taken at the same
time points for all subjects, such that there are replications to use for estimation of the correlations.
A phenomenon similar to serial correlation occurs with measurements taken over (geographical) areas:
Measurement taken at positions close to each other must be expected to be more correlated than mea-
surements taken at positions far away from each other, and we would therefore like to include spatial
correlation in the model.
With a few exceptions, lmer cannot be used for models with correlated residuals, and lme or glmmTMB
must be used instead.
The model given by (7.1) assumes both random effects B and residuals ε have normal distributions.
Then, the marginal distribution of Y is normal as well, and there is an explict formula for the likelihood
function. If one makes (non-Gaussian) assumptions about the distribution of B and ε, then it is possible
to make ML estimation. At least in principle — but the marginal distribution of Y is most likely not
R the density of Y can be written as an integral of the joint density
explicitly available. This is because
for (B, Y ), namely as f (y) = f (y | b)f (b)db where we have been sloppy with the notation and
used f generically for densities. When f (y | b) and f (b) are not normal densities, there is typically no
formula-solution to the integral.
Numerical integration is therefore necessary in order to compute the likelihood. This is implemented
in many software system, but only for Gaussian random effects and for exponential family distribu-
tions. In R, the function glmer is the relevant one. It is typically used for generalized linear mixed
models, see Chapter 8, but one can also choose Gamma or inverse Gaussian residuals and identity link.
49
Personally, I use these models very rarely, if ever. I am not aware of software (R packages) that can
handle non-normal random effects.
If we do not specify distributions of B and ε in the model, but only assume Y = Xβ + ZB + ε,
E B = 0 and E ε = 0, then we can obviously not carry out likelihood inference. One possibility is to
carry out estimation as if B and ε were Gaussian, i.e., do proceed exactly as for the Gaussian model,
see Jiang and Nguyen (2021, Sections 1.4.1 and 1.8.3). The ”ML” and ”REML” estimators still have
nice asymptotic properties, but it is important to realize that they are not ML and REML estimators
without assumptions on normality. In particular, the variance of the estimators cannot be assessed with
the inverse Fisher information.
Consider a numerical explanatory variable (covariate), x. The specification of the mean, E Y = Xβ,
forces us to choose a parametric model for the effect of x on E Y . This effect can be non-linear in x
if we include terms like polynomials or other functions of x, but we must include such transformed
values in the model matrix X. If we prefer more flexibility, we can include smooth terms.
Smooth terms are usually included via splines (or other basis functions). Let h1 , . . . , hK be a set of
spline functions, then we allow for a term of the form
K
X
ck hk (xi )
k=1
in the model for E Yi . If columns with hk (xi ) are included in X, then we are still dealing with a linear
mixed model, and we can proceed as usual.
There is a twist, though, and even one with association to mixed models: The degree of flexibility
is controlled by the spline functions, in particular by the number of spline functions. A large spline
basis allows for great flexibility (nice), but also a great risk of overfitting (not so nice). The preferred
solution is to choose a large basis, but combine with shrinkage or penalization methods in order to
avoid overfitting. A common solution is to treat spline coefficients ck as random variables, i.e., as
random effects (Wood, 2011). Then the variance of the ck s controls the degree of flexibility, and the
appropriate degree of smoothness can thus be estimated from the data with REML methods (or ML).
This is implemented, but not used as default, in the gam function in the mgcv package which is most
often used for generalized additive models (GAMs, Wood (2017)).
Finally, let us briefly mention non-linear mixed models (NLMMs). This is the extension of non-linear
regression model where some of the parameters are equipped with both fixed and random effects.
Growth data are often analysed with NLMMs. For example, Figure 7.1 shows data from an experiment
with soybeans. The data is described in Pinheiro and Bates (2000, Section 6.3). Two varieties (F/M)
were planted in eight field plots each. Every week six plants were chosen at random from each plot
and the outcome y is the average leaf weight over the six plants. The experiment was repeated three
years, so there is data from a total of 36 plots. Plants appear to grow exponentially in the beginning but
the curves eventually bend off.
50
It is common to model such curves with the logistic function1 , and assume that
φ1i
yij = + εij
1 + exp(−(tij − φ2i )/φ3i )
where yij is the average leaf weight in plot i at time tij . The plot-specific parameters are φ1i (asymptote
as tij increases, φ2i (the time where the curve is the steepest), and φ3i (determines the steepness). It
is natural to assume that the parameters are different for the two varieties — the research question is
to examine this potential difference — and between years, but also that each plot has its own set of
parameters, varying at random. In other words, we would include Variety and Year as fixed effects and
Plot as random effect, and assume that
φ1i β1,V(i) + β1,Y(i) b1i
φ3i = β2,V(i) + β2,Y(i) + b2i
φ3i β3,V(i) + β3,Y(i) b3i
where βs are fixed-effects parameters describing the effects of variety and year and the random plot
effects satisfy bi = (b1i , b2i , b3i )T ∼ N (0, Σ). With slighly abuse of notation, Y(i) denotes the year
for observation i, not to be confused with the random variable Yi associated to yi .
Notice from Figure 7.1 that there is clear heteroscedasticity in the data: Variation is much larger at later
dayes when the leaf weight is large compared to early days where leaf weight is small. A good model
must therefore have Var εij increasing with E Yij in a suitable way.
Variety F P
30
20
10
0
20 40 60 80 20 40 60 80 20 40 60 80
Time since planting (days)
NLMMs can be fitted with the function nlmer from the lme4 package or with nlme from the nlme
package. The models are difficult from a numerical point of view and my experience is that it is not
always possible to fit the models that you would like to use.
1
Even though the logistic function is used to model the curves, the model is usually not referred to as a logistic regression
model. That phrase is reserved for the case with binary or binomial data.
51
8 Generalized linear mixed models
So far, we have considered models using the normal distribution as the sampling distribution and thus
implicitly assumed that the outcome is continuous. We now turn our attention to discrete outcomes.
For independent data, this means shifting from linear (normal) models to generalized linear models
(GLMs), and when random effects are incorporated to take dependence into account, we shift from
linear mixed models to generalized linear mixed models (GLMMs).
GLMs assume an association, described in terms of a link function, between the expected value EY
of the outcome and a linear predictor Xβ. In brief, GLMMs allow for addition of random effects on
the linear predictor scale, such that the linear predictor has the form Xβ + ZB where B is a random
variable, most often assumed to be Gaussian. In other words, the main difference between LMMs and
GLMMs is the assumptions concerning the distribution of Y given B.
GLMMs can be fitted in R with the function glmer from the lme4 package. The syntax combines
the syntax from lmer regarding the specification of fixed and random effects and the syntax from glm
regarding the family of (conditional) distributions.
Let us start out revisiting the spring barley data example, this time recognizing that they are actually
based on binary trials (infection or not).
Example 8.1. (Infection in spring barley, continued) Recall the data from Example 1.1 concerning
nine varieties of spring barley, randomly selected among many more varieties. For each variety, 150
seeds were inoculated (treated) with a pathogen (microorganism that can cause disease) and placed in
three different petri dishes. After germination and growth for a certain period, it was registered how
many of the plants were infected. The varieties will be denoted A–G in the following.
So far, we have analyzed the proportions as a continuous variable, and this is fairly unproblematic since
only few of the proportions are close to the zero and one. However, the data are discrete by nature, and
we are now going to use a logistic regression type model with random effects rather than a Gaussian
model. More specifically, we now consider the data to be on the form (ni , yi ) where ni is the number
of plants originating from seeds in dish i and yi is the number of infected plants (i = 1, . . . , 27). Notice
that ni ≤ 50 because not all seeds germinated.1
Let pi be the probability of infection in dish i. Had we been interested in differences between the
nine specific varieties, a first model could have been the GLM, where Y1 , . . . , Y27 are assumed to
1
Let me be honest and admit that I cheated here: Only the proportions were originally available so I made up ni and yi
myself based on a germination probability of 80%. In other words, I sampled each ni from Bin(50, 0.8) and then computed
yi from the observed proportion. There is a difference of up to ±0.2 between the implied new proportions yi /ni and the
original ones. In any case: Do not trust the results from the GLMMs to follow too must; just take it as an illustration. And
do not invent data like this for a real data analysis!
52
be independent with Yi ∼ Bin(ni , pi ) and pi is equal to p̃A , . . . , p̃H or p̃I corresponding to the nine
varieties. Let βj denote the log-odds for infection for variety j (j = A, . . . , I), that is,
eβj
βj = log(p̃j /(1 − p̃j )) or p̃j = .
1 + eβj
If furthermore ηi = log(pi /(1 − pi )) denotes the log-odds for dish i (i = 1, . . . , 27); η = (η1 , . . . , η27 )
is the collection of ηi s, and X is the model matrix corresponding to the Variety factor, then we could
write
η = Xβ.
We would often call this a logistic regression model even if the explanatory variable is categorical
rather than numerical.
We are going to modify this model in two ways. First, the nine varieties are selected at random from
a larger population, and we would like to take that sampling process into account as well. Hence, we
consider Variety as a random factor and write
η = 127 α + XB
where B = (BA , . . . , BI )T ∼ N9 (0, τ12 I9 ). Notice the intercept parameter α which is now necessary
because EB = 0. It is interpreted as the log-odds for a ”typical variety” (or “average variety”), namely
a fictitious variety with random effect equal to zero. The Bs describe deviations from this average such
that each variety has its own level of log-odds. Element-wise, the model reads
p̃j
log = α + Bj , j = A, . . . , I
1 − p̃j
53
(family=’quasibinomial’ in glm), but such a model cannot handle both random variety-to-
variety variation and overdispersion.
Being used to the LMMs, it may seem awkward to incorporate the random dish effect, with individual
Ci s for every i. In the LMM, it would not be possible to distinguish Ci from the residual term, but
the situation is different now, because the (conditional) binomial sampling process of Yi comes as an
additional layer in the model. In other words, binomial sampling in the GLMM replaces Gaussian
sampling in the LMM.
Another way to understand the meaningfulness of the Ci s is through a reorganization of the data.
Instead of having just one data line for dish i, we could have ni datalines with a new 0/1 variable
which is 1 in yi lines and 0 in the remaining ni − yi datalines. There should also be a new variable in
the dataset, with values from 1 to 27, telling which petri dish the binary observation comes from. If
the outcome was continuous rather than binary, then incorporation of a random Dish effect would be
completely natural; and so it in the model for the binary outcome.
Notice that equation (8.1) can be rewritten as
X 0 B
η = 127 α + ,
0 I27 C
Notice that the variables used in the model fit do not appear in the original dataset because only the pro-
portions were given, but they are available in a modified data file, barleyDat2.Rdata, at Absalon.
The estimates are
We see that variety-to-variety and dish-to-dish variation are roughly equally important. Transformation
from the log-odds scale to probability scale gives us estimate and CI for the probability of infection for
a random plant from a random variety in a random dish:
This fits fairly well with the results from the LMM analysis of the proportions, see Examples 3.2 and
4.1. Estimates of the standard deviations for random effects cannot be compared between the two
analyses since they are on the scale of proportions in the LMM model and on the log-odds scale for the
GLMM.
The fixed-effects part of the model for the spring barley data is very boring since there is only an
intercept. The next example is more typical, since it includes non-trivial fixed as well as random
effects.
54
Example 8.2. (Use of contraception in Bangladesh) This data example is taken from Bates (2010,
Chapter 6), and the data originally comes from a survey on fertility from Bangladesh carried out in
1989. The aim is to study how use of contraception depends on various factors, in particular to quantify
differences between use of contraception in rural and urban areas.
There are data from 1934 women, coming from 60 different districts. The following information is
available for each woman: use of contraception (yes/no), age (years, centered around 18.5, number of
living children (0/1/2/3+), setting (urban/rural), district (1–60). The data are available as the data frame
Contraception in the mlmRev package.
The outcome is use of contraception. The association with age and some of the other explanatory
variables is illustrated in Figure 8.1. There is not much information in a plot of the raw 0/1 values
against covariates; instead the graph shows spline fits (obtained with gam) for each combination of
urban/rural setting and number of living children.
livch 0 1 2 3+
Rural Urban
1.00
Smoothed proportion
0.75
0.50
0.25
0.00
10 20 30 10 20 30
Age (years, not centered)
Figure 8.1: The contraception data. Profiles are coloured according to the number of live children
(variable livech).
For most groups, use of contraception increases with age until a certain age and then decreases, so
we are going to use a quadratic model in age. Profiles appear to be similar for women with one or
more living children, so for simplicity we collapse those groups in the models below. Finally, use of
contraception is larger in urban compared to rural areas. None of this is surprising.
Since the outcome is binary, we are going to use a logistic regression type model. We want to incorpo-
rate differences between the 60 districts, and it is natural to consider them as random representatives
from districts as such. Hence, we are going to use a GLMM. The above considerations suggest a model
formula like
for the fixed-effects part of the model. Here, use01 is the binary outcome, setting is the urban/rural
variable, children (0/1+) tells whether the woman has living children or not, and age and age2
is the age and age squared, respectively.2 For numerical reasons it is desirable to use a centered and
2
Some of these variables do not exist in the data frame, but must be constructed prior to the model fit.
55
scaled versions of the squared age. We could also have chosen to include the interaction effect between
squared age and whether the woman has children or not, and interactions with setting.
For the random-effects part of the model, we are certainly going to include District (60 levels). More-
over, there are both rural and urban areas in most (but not all) districts. The product factor (or inter-
action) between District and Setting describes these sub-districts, and we include a random intercept
corresponding to that factor, too. It has 102 levels. We could also have chosen to let the coefficients
associated to age be random, but for simplicity we only allow for a random intercept.
Altogether, we fit a logistic regression type GLMM with fixed effects of Setting, Age (coefficient
possibly differing between women with and without children) and Squared age as fixed effects; and
District and Sub-district as random effects:
The contrast parameter for urban vs rural setting is estimated to 0.783 (SE 0.173), so the odds for using
contraception is estimated to be a factor exp(0.783) = 2.19 larger for women living in urban compared
to rural areas. As always with observational data, the results cannot be interpreted in a strict causal
way. You are encouraged to carry out the analysis yourself, and also try variations of the model.
Recall how linear mixed models (LMMs) were defined from two ingredients: (i) The distribitution
of random effects, B ∼ Nq (0, Στ ); and (ii) the conditional distribution of Y given B, Y | B = b ∼
Nn (Xβ + Zb, σ 2 In ) for given model matrices X and Z. In particular, E(Y |B = b) = Xβ + Zb.
For generalized linear mixed models (GLMMs), point (i) is left unchanged, but point (ii) is made more
general. It is assumed that, (ii’) Y1 , . . . , Yn are independent in the conditional distribution of Y given
B = b, and (ii”) the conditional distribution of Yi given B = b is an (overdispersed) exponential family
distribution with linear predictor ηi = Xi β + Zi b. Here, Xi and Zi denote row i of model matrices X
and Z. In particular, if g is the link function for the exponential family, then the connection between
the mean and the linear predictor is given as
g E(Yi | B = b) = ηi = Xi β + Zi b.
We have seen two data examples with the binomial/Bernoulli distribution as the GLM ingredient and
the logit-link. In that case,
pi
log = ηi = Xi β + Zi b
1 − pi
where pi is the success probability in the conditional distribution given the random effects.
The second-most used GLMM is the Poisson GLMM with log-link. Here, the mean association be-
comes
log E(Yi | B = b = ηi = Xi β + Zi b.
You are going to study the Poisson GLMM in exercises, both from a theoretical and an applied point
of view.
56
8.3 Estimation (Laplace and Gauss-Hermite approximations)
The assumptions behind the GLMM specify the joint distribution of (B, Y ) and therefore also the
marginal distribution of Y . In the LMMs we could find an explicit expression for the marginal dis-
tribution of Y : It was simply another normal distribution due to the nice conditioning properties of
the Gaussian distribution. It is not as easy as that for the GLMMs. The marginal distribution of Yi in
a binomial GLMM is not a binomial distribution (unless the size is 1 such that the binomial is sim-
ply a Bernoulli distribution), and the marginal distribution of Yi in a Poisson GLMM is not a Poisson
distribution.
The marginal distribution of Y — and thus the likelihood — is only given as an integral wrt. the random
effects. If we use g generically for densities, let f (yi ; β, φ, b) be the relevant exponential family density
corresponding to ηi = Xi β + Zi b and dispersion parameter φ, and h the density for N (0, Στ ), then
the likelihood is:
L(β, φ, τ ) = g(y)
Z
= g(y, b) db
Z
= g(y|b) g(b) db
Z Y
= f (yi ; β, φ, b) h(b; τ ) db.
i
The last equation follows from the conditional independence in assumption (ii’).
The integral is q-dimensional, where q is the dimension of the random effects B, but in many models
the integral reduces to several integrals of lower dimension, often just one-dimensional integrals. Say,
for example, that we just have a random intercept corresponding to a single factor. Then, Yi s from
different groups are independent, and L reduces to a product over groups, and therefore separate one-
dimensional integrals.
Low-dimensional integrals can be appoximated with numerical methods, such that (approximate) like-
lihood inference is possible in these cases. One approximation is the Laplace approximation. Consider
first the situation with independent one-dimensional integrals, and recall that f is an exponential family
density and h is a Gaussian density. Both densities have an exponential form so up to proportionality
each integral has the form Z
e−k(x) dx
for a suitably defined function k : R → R. Assume that k is well-behaved in the sense that it has a
unique minimum at x̃ with k 0 (x̃) = 0 and k 00 (x̃) > 0. A Taylor expansion around x̃ yields
1
k(x) ≈ k(x̃) + k 00 (x̃)(x − x̃)2 .
2
We plug the right-hand side expression into the integral and get
Z Z s
− 12 k00 (x̃)(x−x̃)2 2π
e−k(x) dx ≈ e−k(x̃) e dx = e−k(x̃) , (8.2)
k 00 (x̃)
57
where the last expression follows from recognizing the unnormalized density of N x̃, 1/k 00 (x̃) . Equa-
tion (8.2) is the Laplace approximation of the integral. In order to use it, we only need computation of
x̃, k(x̃) and k 00 (x̃).
There is also a multivariate version of the Laplace approximation, for k : Rd → R, with double
derivative k 00 which is now a matrix of dimension d × d:
Z
−1/2
e−k(x) dx = (2π)d/2 e−k(x̃) det(k 00 (x̃)) .
In the GLMM situation, with the integral being with respect to a (possibly multi-dimensional) com-
ponent Bj of random effects, let Ỹj denotes the collection of Yi s whose distribution depend on Bj .
Then, the function k is the log-density (except for an additive constant) of the joint density of (Ỹj , Bj ),
considered as a function of Bj only, so for fixed Ỹj and fixed (β, φ). This is also the log-density of the
conditional distribution of Bj given Ỹj , again up to an additive constant. The Laplace approximation
uses a quadratic approximation to this log-density, i.e., a Gaussian approximation to the conditional
distribution, and the Gaussian approximation has mean in the mode of the conditional distribution. In
linear mixed models, where the conditional distribution of Ỹj given Bj is Gaussian, the conditional
distribution of Bj given Ỹj is Gaussian, too, and the Laplace approximation to the integral is therefore
identical to the true integral.
The parameters in the Gaussian approximation also has a Bayesian interpretation:3 If β and φ are fixed
and Bj is considered the “parameter” of interest, then the mean in the Gaussian approximation is the
maximum a posteriori (MAP) estimator for Bj . Furthermore, if the prior for Bj is flat, then the variance
in the Laplace approximation is the inverse observed Fisher variance (and the MAP is identical to the
MLE).
2 be the
For later use, and in the case with independent one-dimensional random effects, let µLP and σLP
mean and variance from the Laplace approximation for the likelihood contribution associated to Bj .
Then, the Laplace approximation to the jth term, including normalization constant, becomes
q
ILP = g(ỹj | bj = µLP )φN (bj = µLP , 0, τ 2 ) 2πσLP
2 . (8.3)
Another approximation is the Gauss-Hermite approximation. It utilizes that integrals of the form
Z
2
r(z)e−z dz
where knot points z1 < · · · < zm are roots of the Hermite polynomial of order m, and w1 , . . . , wm
are associated weights. These knots and weights are deterministic (non-adaptive, do not depend on
data) so they can be computed once and for all (and have been tabulated in the literature). The sum is
identical to the integral if r is a polynomial of degree at most 2m − 1.
For m odd, the first knot is z1 = 0 and the subsequent knots come in pairs symmetric around zero (e.g.,
z2 = −z3 ). Moreover, m must be large to get knots far from zero; with m = 11, for example, knots
3
This will be more clear after the Bayesian part of the coirse
58
vary between −3.67 and 3.67. As a consequence, approximations with small m are only appropriate
when the major contributions from r come from z-values around zero. This can be obtained with a
data-dependent change of variables.
To be specific, consider the GLMM case with independent one-dimensional random effects Bj ∼
N (0, τ 2 ) and where the conditional distribution of eachR Yi given B only depends on one Bj . The
likelihood contribution from Ỹj (defined above) is Ij = g(ỹj | bj )g(bj ) dbj . If we leave out index j
from the notation and let φN (b; µ, σ 2 ) denote the density of N (µ, σ 2 ) evaluated at b, then it reads
Z
I = g(ỹ | b)φN (b, 0, τ 2 ) db.
where we have suppressed that the right-hand side also depends on ỹ. Then,
Z Z
1 2 2
I = t(b; τ )φN (b; µLP , σLP ) db = t(b; τ 2 ) √
2 2
e−(b−µLP ) /(2σLP ) db,
2πσLP
b−µLP
and a change-of-variable, z = √
2σLP
gives
1
Z √ 2
I=√ t µLP + z 2σLP ; τ 2 e−z dz.
π
The point is that the new integration variable, z by construction, “lives around zero”, because µLP is
the mode in the conditional distribution of BJ given Ỹj . Therefore, the deterministic Gauss-Hermite
approximation is now appropriate even for smaller values of m. That is,
m
1 X √
t µLP + z 2σLP ; τ 2 wi .
I≈√
π
i=1
59
correlated random effects (such as dependent intercept and slope) or non-nested random factors where
the likelihood does not split into one-dimensional integrals.
It is recommended to check the sensitivity of results to the number of quadrature points, when possible.
The typical pattern is that estimates and SEs change a little, but not dramatically, for smaller values
of m but quickly stabilize, and it is fairly common to use Laplace approximation or up to five, say,
quadrature points.
Both Laplace and Gauss-Hermite approximations are implemented in the R function glmer, where the
argument nAGQ sets the number of quadrature points. Laplace approximation is the default. Gauss-
Hermite is only implemented for a single random effect, so the estimates reported in Examples 8.1
and 8.2 are based on the Laplace approximation. The glmmTMB function uses the Laplace approxi-
mation. It has a larger variety of possible response distributions than glmer, including zero-inflated
distributions.
The chosen likelihood approximation is used for statistical inference as if it was the true likelihood: The
estimated variance matrix for β̂ is computed as the inverse of the second order derivative of minus the
(approximate) log-likelihood, and SEs for each coordinate are computed as the squareroots of diagonal
elements. Confidence intervals are constructed as Wald CIs, profile CIs or by bootstrap. Hypothesis
tests are carried out as likelihood ratio or Wald tests with χ2 approximation or with simulations for
computation of p-values. Prediction of random effects are computed as the modes in the conditional
distribution of B given Y = yobs . This distribution is typically not known explicitly, even for known
parameter values, so numerical methods are necessary. The paper by Thiele and Markussen (2012)
contains good advice for practitioners.
Model validation for GLMMs is inherently difficult. One possibility is to rely on simulations, where
appropriate validation plots and/or numerical quantitities are generated/computed from the observed
data as well as from data simulated from the model. Repeated simulations give an impression about
the dataset-to-dataset variation of these graphical/numerical summaries if the model is correct, and sys-
tematic deviations between the summaries for simulated and observed data are interpreted as evidence
that the model is misspecified in one or more respects.
Finally, a general warning: Numerics in GLMMs can be difficult, in particular when more than one
random effect are included, and you must therefore be (extra) careful and take a critical look at the
model fit, i.e., estimates and standard errors. Sometimes it is necessary to be pragmatic and use a
simpler model than originally suggested in order to get robust and plausible results. Bayesian analysis
is also a possibility, for example with the Stan software and the package brms as we shall see in the
last part of the course.
60
9 Bayesian analysis of mixed models
We have considered mixed models from a frequentistic point of view in these notes, but one could also
tackle them from a Bayesian angle — and we will indeed do that in the final part of the course. At this
point, we just give some superficial comments.
From a modeling point of view, the Bayesian approach differs from the frequentistic one by assuming
that the parameters (β, τ , σ 2 for the LMM and β and τ for the GLMM) are random variables rather
than fixed (but unknown) values. Prior distributions for the parameters must be provided in order to
define a joint model for parameters, random effects, and observations.
Mixed models are usually called hierarchical models in the Bayesian framework. This is because there
is a ”hierarchy” in the model assumption: The prior distributions describes the marginal distribution
of the parameters; the assumption B ∼ Nq (0, Σ) describes the conditional distribution of the latent
random effects given the parameters, some of which enter into Σ; and Y ∼ Nn (Xβ + Zu, σ 2 ) or the
GLMM pendant describes the conditional distribution of Y given both random effects and parameters.
The end product of a Bayesian analysis is the posterior distribution, i.e., the conditional distribution of
the parameters given the data. Conclusions are based on properties of the posterior distribution. There
are no closed-form expressions for the posterior, except for very simple models — and certainly not
for the mixed models from this note. Instead, the analysis produces a multivariate Markov chain, with
a component for each unknown quantity in the model, which has the joint posterior distribution as its
invariant distribution. Notice that the Markov chain has components for the parameters as well as the
random effects B.
The state-of-the-art software for Bayesian data analysis in general is Stan (Stan Development Team,
2022b). There is an R interface to Stan implented in the package RStan (Stan Development Team,
2022a), and furthermore a package called brms (Bürkner, 2021). The latter package includes a func-
tion brm which can be used for analyses with mixed models (and also models without random effects).
Conveniently, the syntax is just the same as for lmer and glmer, and the function has default choices
for prior distributions, so it is easy to use (but be careful, anyway, of course!). The output also looks the
same in many respects, but recall that interpretation of results from a Bayesian analysis is in general
different from the interpretation of results from a frequentistic analysis. The brm is very flexible when
it comes to error distributions, so in that sense the Bayesian approach in practice applies to a wider
spectrum of mixed models than the likelihood approach.
61
Bibliography
Bates, D., Mächler, M., Bolker, B., and Walker, S. (2015). Fitting linear mixed-effects models using
lme4. Journal of Statistical Software, 67:1–48.
Bibby, B. M., Martinussen, T., and Skovgaard, I. M. (2006). Experimental Design in the Agricultural
Sciences. Lecture notes, The Royal Veterinary and Agricultural University.
Brooks, M. E., Kristensen, K., van Benthem, K. J., Magnusson, A., Berg, C. W., Nielsen, A., Skaug,
H. J., Maechler, M., and Bolker, B. M. (2017). glmmTMB balances speed and flexibility among
packages for zero-inflated generalized linear mixed modeling. The R Journal, 9(2):378–400.
Bürkner, P.-C. (2021). Bayesian item response modeling in R with brms and Stan. Journal of Statistical
Software, 100(5):1–54.
Fitzmaurice, G., Davidian, M., Verbeke, G., and Mohlenberghs, G. (2013). Longitudinal Data Analy-
sis. CRC Press.
Galecki, A. and Burzykowski, T. (2013). Linear Mixed-Effects Models Using R. Springer, New York.
Greven, S., Crainiceanu, C. M., Küchenhoff, H., and Peters, A. (2008). Restricted likelihood ratio test-
ing for zero variance components in linear mixed models. Journal of Computational and Graphical
Statistics, 17:870–891.
Hansen, E. (2012). Introduktion til Matematisk Statistik. Lecture notes, Department of Mathematical
Sciences, University of Copenhagen, third edition.
Hansen, N. R. and Tolver, A. (2023). The Mathematics Behind ModernDive. Lecture notes, Department
of Mathematical Sciences, University of Copenhagen (version from January 2023).
Jiang, J. and Nguyen, T. (2021). Linear and Generalized Linear Mixed Models and Their Applications.
Springer, New York, second edition.
Kuznetsova, A., Brockhoff, P. B., and Christensen, R. H. B. (2017). lmerTest package: Tests in linear
mixed effects models. Journal of Statistical Software, 82(13):1–26.
Lauritzen, S. (2023). Fundamentals of Mathematical Statistics. Chapman & Hall / CRC Press.
Markussen, B. (2022). LabApplStat: R package with miscellaneous scripts developed at the Data
Science Laboratory, University of Copenhagen. R package version 1.4.4.
62
Mohlenberghs, G. and Verbeke, G. (2005). Models for Discrete Longitudinal Data. Springer, New
York.
Pinheiro, J., Bates, D., and R Core Team (2022). nlme: Linear and Nonlinear Mixed Effects Models.
R package version 3.1-157.
Pinheiro, J. C. and Bates, D. M. (2000). Mixed-Effects Models in S and S-PLUS. Springer, New York.
Snijders, T. A. B. and Bosker, R. J. (2012). Multilevel Analysis. SAGE, Thousand Oakes, California,
second edition.
Stan Development Team (2022a). RStan: the R interface to Stan. R package version 2.21.5.
Stan Development Team (2022b). Stan modeling language users guide and reference manual. version
2.30.
Stram, D. O. and Lee, J. W. (1994). Variance components testing in the longitudinal mixed effects
model. Biometrics,, 50:1171–1177.
Sørensen, H. (2023). Conditional distributions (with densities). Lecture notes, University of Copen-
hagen.
Thiele, J. and Markussen, B. (2012). Potential of glmm in modelling invasive spread. CAB Reviews,
7(016):1–10.
Tjur, T. (1984). Analysis of variance models in orthogonal designs. nternational Statistical Review,
52:33–65.
West, B. T., Welch, K. B., and Galecki, A. T. (2015). Linear Mixed Models: A Practical Guide Using
Statistical Software. CRC Press, second edition.
Wood, S. (2017). Generalized Additive Models: An Introduction with R. Chapman and Hall/CRC, 2
edition.
Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation
of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 73(1):3–36.
63