Privacy-Preserving and Lossless Distributed Estimation of High-Dimensional Generalized Additive Mixed Models
Privacy-Preserving and Lossless Distributed Estimation of High-Dimensional Generalized Additive Mixed Models
https://doi.org/10.1007/s11222-023-10323-2
ORIGINAL PAPER
Received: 7 March 2023 / Accepted: 2 October 2023 / Published online: 7 November 2023
© The Author(s) 2023
Abstract
Various privacy-preserving frameworks that respect the individual’s privacy in the analysis of data have been developed in
recent years. However, available model classes such as simple statistics or generalized linear models lack the flexibility required
for a good approximation of the underlying data-generating process in practice. In this paper, we propose an algorithm for
a distributed, privacy-preserving, and lossless estimation of generalized additive mixed models (GAMM) using component-
wise gradient boosting (CWB). Making use of CWB allows us to reframe the GAMM estimation as a distributed fitting
of base learners using the L 2 -loss. In order to account for the heterogeneity of different data location sites, we propose a
distributed version of a row-wise tensor product that allows the computation of site-specific (smooth) effects. Our adaption of
CWB preserves all the important properties of the original algorithm, such as an unbiased feature selection and the feasibility
to fit models in high-dimensional feature spaces, and yields equivalent model estimates as CWB on pooled data. Next to a
derivation of the equivalence of both algorithms, we also showcase the efficacy of our algorithm on a distributed heart disease
data set and compare it with state-of-the-art methods.
Keywords Distributed computing · Functional gradient descent · Generalized linear mixed model · Machine learning ·
Privacy-preserving modelling
123
31 Page 2 of 13 Statistics and Computing (2024) 34:31
distributed data. One example is the collection of R (R Core preserving boosting techniques often focus on the AdaBoost
Team 2021) packages DataSHIELD (Gaye et al. 2014), algorithm by using aggregation techniques of the base clas-
which enables data management and descriptive data analy- sifier (Lazarevic and Obradovic 2001; Gambs et al. 2007). A
sis as well as securely fitting of simple statistical models in a different approach to boosting decision trees in a federated
distributed setup without leaking information from one site learning setup was introduced by Li et al. (2020b) using a
to the others. locality-sensitive hashing to obtain similarities between data
Interpretability and data heterogeneity In many research sets without sharing private information. These algorithms
areas that involve critical decision-making, especially in focus on aggregating tree-based base components, making
medicine, methods should not only excel in predictive per- them difficult to interpret, and come with no inferential guar-
formance but also be interpretable. Models should provide antees.
information about the decision-making process, the feature In order to account for repeated measurements, Luo et al
effects, and the feature importance as well as intrinsi- (2022) propose a privacy-preserving and lossless way to fit
cally select important features. Generalized additive models linear mixed models (LMMs) to correct for heterogeneous
(GAMs; see, e.g., Wood 2017) are one of the most flexible site-specific random effects. Their concept of only sharing
approaches in this respect, providing interpretable yet com- aggregated values is similar to our approach, but is limited
plex models that also allow for non-linearity in the data. in the complexity of the model and only allows normally
As longitudinal studies are often the most practical way distributed outcomes. Other methods to estimate LMMs in a
to gather information in many research fields, methods secure and distributed fashion are Zhu et al. (2020), Anjum
should also be able to account for subject-specific effects et al. (2022), or Yan et al. (2022).
and account for the correlation of repeated measurements. Besides privacy-preserving and distributed approaches,
Furthermore, when analyzing data originating from differ- integrative analysis is another technique based on pooling the
ent sites, the assumption of having identically distributed data sets into one and analyzing this pooled data set while
observations across all sites often does not hold. In this case, considering challenges such as heterogeneity or the curse of
a reasonable assumption for the data-generating process is dimensionality (Curran and Hussong 2009; Bazeley 2012;
a site-specific deviation from the general population mean. Mirza et al. 2019). While advanced from a technical per-
Adjusting models to this situation is called interoperabil- spective by, e.g., outsourcing computational demanding tasks
ity (Litwin et al. 1990), while ignoring it may lead to biased such as the analysis of multi-omics data to cloud services
or wrong predictions. (Augustyn et al. 2021), the existing statistical cloud-based
methods only deal with basic statistics. The challenges of
1.1 Related literature integrative analysis are similar to the ones tackled in this
work, our approach, however, does not allow merging the
Various approaches for distributed and privacy-preserving data sets in order to preserve privacy.
analysis have been proposed in recent years. In the con-
text of statistical models, Karr et al. (2005) describe how 1.2 Our contribution
to calculate a linear model (LM) in a distributed and privacy-
preserving fashion by sharing data summaries. Jones et al. This work presents a method to fit generalized additive mixed
(2013) propose a similar approach for GLMs by communi- models (GAMMs) in a privacy-preserving and lossless man-
cating the Fisher information and score vector to conduct a ner1 to horizontally distributed data. This not only allows the
distributed Fisher scoring algorithm. The site information is incorporation of site-specific random effects and accounts
then globally aggregated to estimate the model parameters. for repeated measurements in LMMs, but also facilitates
Other privacy-preserving techniques include ridge regres- the estimation of mixed models with responses following
sion (Chen et al. 2018), logistic regression, and neural any distribution from the exponential family and provides
networks (Mohassel and Zhang 2017). the possibility to estimate complex non-linear relationships
In machine learning, methods such as the naive Bayes clas- between covariates and the response. To the best of our
sifier, trees, support vector machines, and random forests (Li knowledge, we are the first to provide an algorithm to fit
et al. 2020a) exist with specific encryption techniques (e.g., the class of GAMMs in a privacy-preserving and lossless
the Paillier cryptosystem; Paillier 1999) to conduct model fashion on distributed data.
updates. In these setups, a trusted third party is usually Our approach is based on component-wise gradient boost-
required. However, this is often unrealistic and difficult ing (CWB; Bühlmann and Yu 2003). CWB can be used
to implement, especially in a medical or clinical setup.
Furthermore, as encryption is an expensive operation, its 1 In this article, we define a distributed fitting procedure as lossless if the
application is infeasible for complex algorithms that require model parameters of the algorithm are the same as the ones computed
many encryption calls (Naehrig et al. 2011). Existing privacy- on the pooled data.
123
Statistics and Computing (2024) 34:31 Page 3 of 13 31
2 Background
Fig. 1 Method overview of the proposed distributed CWB approach Our proposed approach uses the CWB algorithm as fitting
with one main CWB model maintained by a host (center) and distributed engine. Since this method was initially developed in machine
computations on different sites (three as an example in this case) that
incorporate and provide site-specific information while preserving pri-
learning, we introduce here both the statistical notation used
vacy for GAMMs as well as the respective machine learning ter-
minology and explain how to relate the two concepts.
We assume a p-dimensional covariate or feature space
to estimate additive models, account for repeated measure- X = (X1 × · · · × X p ) ⊆ R p and response or outcome values
ments, compute feature importance, and conduct feature from a target space Y. The goal of boosting is to find the
selection. Furthermore, CWB is suited for high-dimensional unknown relationship f between X and Y. In turn, GAMMs
data situations (n p). CWB is therefore often used in (as presented in Sect. 2.2) model the conditional distribu-
practice for, e.g., predicting the development of oral can- tion of an outcome variable Y with realizations y ∈ Y,
cer (Saintigny et al 2011), classifying individuals with and (1) x(1) = (x1 ,. .(n)
given features . , x p) ∈
X . Given a data set
without patellofemoral pain syndrome (Liew et al. 2020), or D = x ,y , . . . , x , y (n) with n observations
detecting synchronization in bioelectrical signals (Rügamer drawn (conditionally) independently from an unknown prob-
et al. 2018). However, there have so far not been any attempts ability distribution Px y on the joint space X × Y, we aim to
to allow for a distributed, privacy-preserving, and lossless estimate this functional relationship in CWB with fˆ. The
computation of the CWB algorithm. In this paper, we pro- goodness-of-fit of a given model fˆ is assessed by calculat-
pose a distributed version of CWB that yields the identical ing the empirical risk Remp ( fˆ) = n −1 (x,y)∈D L(y, fˆ(x))
model produced by the original algorithm on pooled data based on a loss function L : Y × R → R and the
and that accounts for site heterogeneity by including interac- data set D. Minimizing Remp using this loss function is
tions between features and a site variable. This is achieved equivalent to estimating f using maximum likelihood by
by adjusting the fitting process using (1) a distributed estima- defining L(y, f (x)) = −(y, h( f (x))) with log-likelihood
tion procedure, (2) a distributed version of row-wise tensor , response function h and minimizing the sum of log-
product base learners, and (3) an adaption of the algorithm likelihood contributions.
to conduct feature selection in the distributed setup. Figure 1 In the following, we also require the vector x j =
(1) (n)
sketches the proposed distributed estimation procedure. (x j , . . . , x j )T ∈ X j , which refers to the jth feature.
We implement our method in R using the DataSHIELD Furthermore, let x = (x1 , . . . , x p ) and y denote arbitrary
framework and demonstrate its application in an exemplary members of X and Y, respectively. A special role is further
medical data analysis. Our distributed version of the original given to a subset u = (u 1 , . . . , u q ) , q ≤ p, of features x,
CWB algorithm does not have any additional hyperparam- which will be used to model the heterogeneity in the data.
eters (HPs) and uses optimization strategies from previous
research results to define meaningful values for all HPs, effec-
tively yielding a tuning-free method.
The remainder of this paper is structured as follows: First,
2
we introduce the basic notation, terminology, and setup of github.com/schalkdaniel/dsCWB.
GAMMs in Sect. 2. We then describe the original CWB algo- 3 github.com/schalkdaniel/dsCWB/blob/main/usecase/analyse.R.
123
31 Page 4 of 13 Statistics and Computing (2024) 34:31
123
Statistics and Computing (2024) 34:31 Page 5 of 13 31
(gl,1 (x), . . . , gl,dl +1 (x))T = (1, x j1 , . . . , x jdl )T . Linear base and gk (x) = (gk,1 (xk ), . . . , gk,dk (xk ))T , the basis represen-
learners can be regularized by incorporating a ridge penal- tation of the row-wise tensor product base learner bl =
ization (Hoerl and Kennard 1970) with tunable penalty b j × bk is defined as gl (x) = (g j (x)T ⊗ gk (x)T )T =
parameter λl as an HP αl . Fitting a ridge penalized linear (g j,1 (x j )gk (x)T , . . . , g j,d j (x j )gk (x)T )T ∈ Rdl with dl =
base learner to a response vector y ∈ Rn results in the penal- d j dk . The HPs αl = {α j , α k } of a row-wise tensor prod-
ized least squares estimator θ̂ l = (ZTl Zl + P l )−1 ZTl y with uct base learner are induced by the HPs α j and α k of the
penalty matrix P l = λl K l , K l = I dl +1 , where I d denotes respective individual base learners. Analogously to other
the d-dimensional identity matrix. Often, an unregularized base learners, the penalized least squared estimator in this
linear base learner is also included to model the contribution case is θ̂ l = (ZTl Zl + P l )−1 ZTl y with penalization matrix
of one feature x j as a linear base learner without penalization. P l = τ j K j ⊗ I dk + I d j ⊗ τk K k ∈ Rdl ×dl . This Kronecker
The basis transformation is then given by gl (x) = (1, x j )T sum penalty, in particular, allows for anisotropic smoothing
and λl = 0. with penalties τ j and τk when using two spline bases for
Spline base learners These base learners model smooth g j and gk , and varying coefficients or random splines when
effects using univariate splines. A common choice is penal- combining a (penalized) categorical base learner and a spline
ized B-splines (P-Splines; Eilers and Marx 1996), where base learner.
the feature x j is transformed using a B-spline basis trans-
formation gl (x) = (Bl,1 (x j ), . . . , Bl,dl (x j ))T with dl basis 2.3.2 Fitting algorithm
functions gl,m = Bl,m , m = 1, . . . , dl . In this case, the
choice of the spline order B, the number of basis functions CWB first initializes an estimate fˆ of the additive predictor
dl , the penalization term λl , and the order v of the difference with a loss-optimal constant value fˆ[0] = arg minc∈R Remp (c).
penalty (represented by a matrix Dl ∈ Rdl−v ×dl ) are consid- It then proceeds and estimates Eq. (1) using an iterative
ered HPs αl of the base learner. The base learner’s parameter steepest descent minimization in function space by fitting
estimator in general is given by the penalized least squares the previously defined base learners to the model’s func-
solution θ̂ l = (ZTl Zl + P l )−1 ZTl y, with penalization matrix tional gradient ∇ f L(y, f ) evaluated at the current model
P l = λl K l and K l = Dl Dl in the case of P-splines. estimate fˆ. Let fˆ[m] denote the model estimation after
m ∈ N iterations. In each step in CWB, the pseudo residuals
Categorical and random effect base learners Categorical fea-
r̃ [m](i) = −∇ f L(y (i) , f (x (i) ))| f = fˆ[m−1] for i ∈ {1, . . . , n}
tures x j ∈ {1, . . . , G} with G ∈ N, G ≥ 2 classes are han-
are first computed. CWB then selects the best-fitting base
dled by a binary encoding gl (x) = (1{1} (x j ), . . . , 1{G} (x j ))T
learner from a pre-defined pool of base-learners denoted by
with the indicator function 1 A (x) = 1 if x ∈ A and 1 A (x) =
B = {bl }l∈{1,...,|B|} and adds the base learner’s contribution
0 if x ∈/ A. A possible alternative encoding is the dummy
to the previous model fˆ[m] . The selected base learner is cho-
encoding with ğl (x) = (1, 1{1} (x j ), . . . , 1{G−1} (x j ))T with
sen based on its sum of squared errors (SSE) when regressing
reference group G. Similar to linear and spline base learn-
the pseudo residuals r̃ [m] = (r [m](1) , . . . , r [m](n) )T onto the
ers, it is possible to incorporate a ridge penalization with
base learner’s features using the L 2 -loss. Further details of
HP αl = λl . This results in the base learner’s penalized least
CWB are given in Algorithm 1 (see, e.g., Schalk et al. 2023).
squared estimator θ̂ l = (ZTl Zl + P l )−1 ZTl y with penalization
Controling HPs of CWB
matrix P l = λl K l , K l = I G . Due to the mathematical equiv-
Good estimation performance can be achieved by select-
alence of ridge penalized linear effects and random effects
ing a sufficiently small learning rate, e.g., 0.01, as suggested
with normal prior (see, e.g., Brumback et al. 1999), this base
in Bühlmann et al. (2007), and adaptively selecting the num-
learner can further be used to estimate random effect pre-
ber of boosting iterations via early stopping on a validation
dictions γ̂ j when using categorical features u j and thereby
set. To enforce a fair selection of model terms and thus
account for heterogeneity in the data. Hence, this base learner
unbiased effect estimation, regularization parameters are set
can also be used to model site-specific effects in a distributed
such that all base learners have the same degrees-of-freedom
system, as outlined later. While such random effects do not
(Hofner et al. 2011). As noted by Bühlmann et al. (2007),
directly provide a variance estimate of the different measure-
choosing smaller degrees-of-freedom induces more penal-
ment units and are primarily used to account for intra-class
ization (and thus, e.g., smoother estimated function for spline
correlation, an approximation of the variance components
base learners), which yields a model with lower variance at
can be retrieved post-model fitting by, e.g., computing the
the cost of a larger bias. This bias induces a shrinkage in the
empirical variance of (γ̂ j ) j∈J3 .
estimated coefficients towards zero but can be reduced by
Row-wise tensor product base learners This type of base running the optimization process for additional iterations.
learner is used to model a pairwise interaction between two
features x j and xk . Given two base learners b j and bk with
basis representations g j (x) = (g j,1 (x j ), . . . , g j,d j (x j ))T
123
31 Page 6 of 13 Statistics and Computing (2024) 34:31
Algorithm 1 Vanilla CWB algorithm 2.4 Distributed computing setup and privacy
Input Train data D, learning rate ν, number of protection
boosting iterations M, loss function L,
set of base learner B Before presenting our main results, we now introduce the dis-
Output Model fˆ[M] defined by fitted parameters tributed data setup we will work with throughout the remain-
[1] [M]
θ̂ , . . . , θ̂ der of this paper. The dataset D is horizontally partitioned
1: procedure CWB(D, ν, L, B) (1) (1) (n ) (n )
into S data sets Ds = x s , ys , . . . , x s s , ys s ,
2: Initialize: fˆ[0] (x) = arg minc∈R Remp (c)
3: for m ∈ {1, . . . , M} do s = 1, . . . , S with n s observations. Each data set Ds is
4: r̃ [m](i) = −∇ f L(y (i) , f (x (i) ))| f = fˆ[m−1] , ∀i ∈ {1, . . . , n} located at a different site s and potentially follows a different
5: for l ∈ {1, . . . , |B|} do data distributions Px y,s . The union of all data sets yields the
[m] −1 T [m]
6: θ̂ l = ZTl Zl + P l Zl r̃ whole data set D = ∪s=1 S D with mutually exclusive data
s
n [m]
7: SSEl = i=1 (r̃ [m](i) − bl (x (i) , θ̂ l ))2 sets Ds ∩ Dl = ∅ ∀l, s ∈ {1, . . . , S}, l = s. The vector of
8: end for realizations per site is denoted by ys ∈ Y n s .
9: l [m] = arg minl∈{1,...,|B|} SSEl
[m] In this distributed setup, multiple ways exist to commu-
10: fˆ[m] (x) = fˆ[m−1] (x) + νbl [m] (x, θ̂ l [m] ) nicate information without revealing individual information.
11: end for
12: return fˆ = fˆ[M] More complex methods such as differential privacy (Dwork
13: end procedure 2006), homomorphic encryption (e.g., the Paillier cryp-
tosystem; Paillier 1999), or k-anonymity (Samarati and
Sweeney 1998; Sweeney 2002) allow sharing information
without violating an individual’s privacy. An alternative
2.3.3 Properties and link to generalized additive mixed option is to only communicate aggregated statistics. This is
models one of the most common approaches and is also used by
DataSHIELD (Gaye et al. 2014) for GLMs or by Luo et al
The estimated coefficients θ̂ resulting from running the CWB (2022) for LMMs. DataSHIELD, for example, uses a pri-
algorithm are known to converge to the maximum likelihood vacy level that indicates how many individual values must be
solution (see, e.g., Schmid and Hothorn 2008) for M → ∞ aggregated to allow the communication of aggregated val-
under certain conditions. This is due to the fact that CWB ues. For example, setting the privacy level to a value of 5
performs a coordinate gradient descent update of a model enables sharing of summary statistics such as sums, means,
defined by its additive base learners that exactly represent variances, etc. if these are computed on at least 5 elements
the structure of an additive mixed model (when defining the (observations).
base learners according to Sect. 2.3.1) and by the objective Host and site setup Throughout this article, we assume the
function that corresponds to the negative (penalized) log- 1, . . . , S sites or servers to have access to their respective data
likelihood. Two important properties of this algorithm are set Ds . Each server is allowed to communicate with a host
(1) its coordinate-wise update routine, and (2) the nature of server that is also the analyst’s machine. In this setting, the
model updates using the L 2 -loss. Due to the first property, analyst can potentially see intermediate data used when run-
CWB can be used in settings with p n, as only a single ning the algorithms, and hence each message communicated
additive term is fitted onto the pseudo-residuals in every iter- from the servers to the host must not allow any reconstruction
ation. This not only reduces the computational complexity of the original data. The host server is responsible for aggre-
of the algorithm for an increasing number of additive pre- gating intermediate results and communicating these results
dictors (linear instead of quadratic) but also allows variable back to the servers.
selection when stopping the routine early (e.g., based on a
validation data set), as not all the additive components might
have been selected into the model. In particular, this allows 3 Distributed component-wise boosting
users to specify the full GAMM model without manual spec-
ification of the type of feature effect (fixed or random, linear We now present our distributed version of the CWB algo-
or non-linear) and then automatically sparsify this model by rithm to fit privacy-preserving and lossless GAMMs. In the
an objective and data-driven feature selection. The second following, we first describe further specifications of our setup
property, allows fitting models of the class of generalized lin- in Sect. 3.1, elaborate on the changes made to the set of base
ear/additive (mixed) models using only the L 2 -loss instead learners in Sect. 3.2, and then show how to adapt CWB’s
of having to work with some iterative weighted least squares fitting routine in Sect. 3.3.
routine. In particular, this allows performing the proposed
lossless distributed computations described in this paper, as
we will discuss in Sect. 3.
123
Statistics and Computing (2024) 34:31 Page 7 of 13 31
3.1 Setup where the basis transformation gl is equal for all S sites.
After distributed computation (see Eq. (4) in the next sec-
T T
In the following, we distinguish between site-specific and tion), the estimated coefficients are θ̂ l× = (θ̂ l× ,1 , . . . , θ̂ l× ,S )T
shared effects. As effects estimated across sites typically cor-
with θ̂ l× ,s ∈ Rdl . The regularization of the row-wise Kro-
respond to fixed effects and effects modeled for each site
necker base learners not only controls their flexibility but
separately are usually represented using random effects, we
also assures identifiable when additionally including a shared
use the terms as synonyms in the following, i.e., shared
(fixed) effect for the same covariate. The penalty matrix
effects and fixed effects are treated interchangeably and
P l× = λ0 K 0 ⊗ I dl + I S ⊗λl× K l ∈ R Sdl ×Sdl is given as Kro-
the same holds for site-specific effects and random effects.
necker sum of the penalty matrices K 0 and K l with respective
We note that this is only for ease of presentation and our
regularization strengths λ0 , λl× . As K 0 = I S is a diagonal
approach also allows for site-specific fixed effects and ran-
matrix, P l× is a block matrix with entries λ0 I dl + λl× K l on
dom shared effects. As the data is not only located at different
the diagonal blocks. Moreover, as g0 is a binary vector, we
sites but also potentially follows different data distributions
can also express the design matrix Zl× ∈ Rn×Sdl as a block
Px y,s at each site s, we extend Eq. (1) to not only include
matrix, yielding
random effects per site, but also site-specific smooth (ran-
dom) effects φ j,s (x j ), s = 1, . . . , S for all features x j
Zl× = diag(Zl,1 , . . . , Zl,S ),
with j ∈ J3 . For every of these smooth effects φ j,s we
assume an existing shared effect f j,shared that is equal for P l× = diag(λ0 I dl + λl× K l , . . . , λ0 I dl + λl× K l ), (3)
all sites. These assumptions—particularly the choice of site-
specific effects—are made for demonstration purposes. In a where Zl,k are the distributed design matrices of bl on sites
real-world application, the model structure can be defined s = 1, . . . , S. This Kronecker sum penalty induces a center-
individually to match the given data situation. However, note ing of the site-specific effects around zero and, hence, allows
again that CWB intrinsically performs variable selection, and the interpretation as deviation from the main effect. Note that
there is thus no need to manually define the model structure possible heredity constraints, such as the one described in Wu
in practice. In order to incorporate the site information into and Hamada (2011), are not necessarily met when decom-
(i)
the model, we add a variable x0 ∈ {1, . . . , S} for the site to posing effects in this way. However, introducing a restriction
(i) (i) that forces the inclusion of the shared effect whenever the
the data by setting x̃ = (x0 , x (i) ). The site variable is a
categorical feature with S classes. respective site-specific effect is selected is a straightforward
extension without impairing our proposed framework and
without increasing computational costs.
3.2 Base learners
3.3 Fitting algorithm
For shared effects, we keep the original structure of CWB
with base learners chosen from a set of possible learners We now describe the adaptions required to allow for
B. Section 3.3.1 explains how these shared effects are esti- distributed computations of the CWB fitting routine. In
mated in the distributed setup. We further define a random Sects. 3.3.1 and 3.3.2, we show the equality between our
effect base learner b0 with basis transformation g0 (x0 ) = distributed fitting approach and CWB fitted on pooled data.
(1{1} (x0 ), . . . , 1{S} (x0 ))T and design matrix Z 0 ∈ Rn×S . Section 3.3.3 describes the remaining details such as dis-
We use b0 to extend B with a second set of base learners tributed SSE calculations, distributed model updates, and
B× = {b0 × b | b ∈ B} to model site-specific random effects. pseudo residual updates in the distributed setup. Section 3.4
All base learners in B× are row-wise tensor product base summarizes the distributed CWB algorithm and Sect. 3.5
learners bl× = b0 × bl of the regularized categorical base elaborates on the communication costs of our algorithm.
learner b0 dummy-encoding every site and all other exist-
ing base learners bl ∈ B. This allows for potential inclusion 3.3.1 Distributed shared effects computation
of random effects for every fixed effect in the model. More
specifically, the lth site-specific effect given by the row-wise Fitting CWB in a distributed fashion requires adapting the
tensor product base learner bl× uses the basis transformation fitting process of the base learner bl in Algorithm 1 to
gl× = g0 ⊗ gl distributed data. To allow for shared effects computations
across different sites without jeopardizing privacy, we take
gl× ( x̃) = g0 (x0 )T ⊗ gl (x)T advantage of CWB’s update scheme, which boils down to a
= (1{1} (x0 )gl (x)T , . . . , 1{S} (x0 )gl (x)T )T , (2) (penalized) least squares estimation per iteration for every
base learner. This allows us to build upon existing work such
=gl× ,1 =gl× ,S
as Karr et al. (2005) to fit linear models in a distributed fash-
123
31 Page 8 of 13 Statistics and Computing (2024) 34:31
⎛ ⎞
ion by just communicating aggregated statistics between sites (ZTl,1 Zl,1 + λ0 I dl + P l )−1 ZTl,1 y1
and the host. ⎜ .. ⎟
=⎝ . ⎠, (4)
In a first step, the aggregated matrices F l,s = ZTl,s Zl,s −1
(Zl,S Zl,S + λ0 I dl + P l ) Zl,S y S
T T
and vectors ul,s = ZTl,s ys are computed on each site. In
our privacy setup (Sect. 2.4), communicating F l,s and ul,s where (4) is due to the block structure, as described in (3) of
is allowed as long as the privacy-aggregation level per site Sect. 3.2. This shows that the fitting of the site-specific effects
is met. In a second step, the site information is aggre- θ̂ l× can be split up into the fitting of individual parameters
S
gated to a global information F l = s=1 F l,s + P l and
S
ul = s=1 ul,s and then used to estimate the model param- θ̂ l× ,s = (ZTl,s Zl,s + λ0 I dl + P l )−1 ZTl,s ys . (5)
eters θ̂ l = F l−1 ul . This approach, referred to as distFit,
is explained again in detail in Algorithm 2 and used for It is thus possible to compute site-specific effects at the
the shared effect computations of the model by substitut- respective site without the need to share any information
[m] −1 T [m]
ing θ̂ l = ZTl Zl + P l Zl r̃ (Algorithm 1 line 6) with with the host. The host, in turn, only requires the SSE of the
[m]
= distFit(Zl,1 , . . . , Zl,S , r̃ [m] [m] respective base learner (see next Sect. 3.3.3) to perform the
θ̂ l 1 , . . . , r̃ S , P l ).
next iteration of CWB. Hence, during the fitting process, the
Note that the pseudo residuals r̃ [m] k are also securely
parameter estimates remain at their sites and are just updated
located at each site and are updated after each iteration.
if the site-specific base learner is selected. This again mini-
Details about the distributed pseudo residuals updates are
mizes the amount of data communication between sites and
explained in Sect. 3.3.3. We also note that the computational
host and speeds up the fitting process. After the fitting phase,
complexity of fitting CWB can be drastically reduced by
the aggregated site-specific parameters are communicated
pre-calculating and storing (ZTl Zl + P l )−1 in a first initial-
once in a last communication step to obtain the final model.
ization step, as the matrix is independent of iteration m, and
A possible alternative implementation that circumvents the
reusing these pre-calculated matrices in all subsequent itera-
need to handle site-specific heterogeneity separately is to
tions (cf. Schalk et al. 2023). Using pre-calculated matrices
apply the estimation scheme of main effects (Algorithm 2).
also reduces the amount of required communication between
While this simplifies computation, this would increase com-
sites and host.
munication costs and, hence, runtime.
Algorithm 2 Distributed Effect Estimation. 3.3.3 Pseudo residual updates, SSE calculation, and base
The line prefixes [S] and [H] indicate whether the opera- learner selection
tion is conducted at the sites ([S]) or at the host ([H]).
Input Sites design matrices Zl,1 , . . . , Zl,S , The remaining challenges to run the distributed CWB algo-
response vectors y1 , . . . , y S and rithm are (1) the pseudo residual calculation (Algorithm 1
an optional penalty matrix P l . line 4), (2) the SSE calculation (Algorithm 1 line 7), and (3)
Output Estimated parameter vector θ̂ l . base learner selection (Algorithm 1 line 9).
1: procedure distFit(Zl,1 , . . . , Zl,S , y1 , . . . , y S , P l )
2: for s ∈ {1, . . . , S} do Distributed pseudo residual updates The site-specific response
3: [S] F l,s = ZTl,s Zl,s vector ys containing the values y (i) , i ∈ {1, . . . , n s } is the
4: [S] ul,s = ZTl,s ys basis of the pseudo residual calculation. We assume that
5: [S] Communicate F l,s and ul,s to the host every site s has access to all shared effects as well as the
6: end for
S site-specific information of all site-specific base learners bl×
7: [H] F l = s=1 F l,s + P l
S only containing the respective parameters θ̂ l× ,s . Based on
8: [H] ul = s=1 ul,s
9: [H] return θ̂ l = F l−1 ul these base learners, it is thus possible to compute a site model
10: end procedure fˆs[m] as a representative of fˆ[m] on every site s. The pseudo
residual updates r̃ [m]
s per site are then based on fˆs[m] via
[m](i)
r̃s = −∇ f L(y (i) , f (x (i) ))| f = fˆ[m−1] , i ∈ {1, . . . , n s }
s
using Ds . Most importantly, all remaining steps of the dis-
3.3.2 Distributed site-specific effects computation tributed CWB fitting procedure do not share the pseudo
residuals r̃ [m]
s in order to avoid information leakage about
If we pretend that the fitting of the base learner bl× is per- ys .
formed on the pooled data, we obtain Distributed SSE calculation and base learner selection After
fitting all base learners bl ∈ B and bl× ∈ B× to r̃ [m]
s , we
−1 [m] [m]
θ̂ l× = ZTl× Zl× + P l× ZTl× y obtain θ̂ l , l = 1, . . . , |B|, and θ̂ l× , l× = 1× , . . . , |B× |.
123
Statistics and Computing (2024) 34:31 Page 9 of 13 31
Calculating the SSE distributively for the lth and l× th base Algorithm 3 Distributed CWB Algorithm.
learner bl and bl× , respectively, requires calculating 2S site- The line prefixes [S] and [H] indicate whether the opera-
specific SSE values: tion is conducted at the sites ([S]) or at the host ([H]).
ns
Input Sites with site data Dk , learning rate ν, number of boosting
[m] 2 iterations M, loss
SSEl,s = r̃s[m](i) − bl x (i)
s , θ̂ l function L, set of shared effects B and respective site-specific
i=1 effects B×
ns Output Prediction model fˆ
[m] 2
= r̃ [m](i) − gl (x (i) )T θ̂ l , 1: procedure distrCWB(ν, L, B, B× )
i=1 2: Initialization:
[0]
ns 2
3: [H] Initialize shared model fˆshared (x) = arg minc∈R Remp (c)
[m]
SSEl× ,s = r̃s[m](i) − bl× x (i)
s , θ̂ l×
4: [S] Calculate Zl,s and F l,s = ZTl,s Zl,s , ∀l ∈ {1, . . . , |B|}, s ∈
{1, . . . , S}
i=1
ns 5: [S] Set fˆs[0] = fˆshared
[0]
[m] 2 6: for m ∈ {1, . . . , M} or while an early stopping criterion is not
= r̃s[m](i) − gl (x (i) )T θ̂ l× ,s . met do
i=1 7: [S] Update pseudo residuals:
8: [S] r̃s[m](i) = −∇ f L(y (i) , f (x (i) ))| f = fˆ[m−1] , ∀i ∈
The site-specific SSE values
S are then sent to the host and {1, . . . , n s }
s
123
31 Page 10 of 13 Statistics and Computing (2024) 34:31
costs in terms of the number of base learners, we distinguish of these covariates is given in Table 1 in the Supplementary
between the initialization phase and the fitting phase. Material B.1. For our application, we assume that missing
Initialization As only the sites share F l,s ∈ Rd×d , ∀l ∈ values are completely at random and all data sets are exclu-
{1, . . . , |B|}, the transmitted amount of values is d 2 |B| for sively located at each sites. The task is to determine important
each site and therefore scales linearly with |B|, i.e., O(|B|). risk factors for heart diseases. The target variable is therefore
The host does not communicate any values during the ini- a binary outcome indicating the presence of heart disease or
tialization. not.
Fitting In each iteration, every site shares its vector ZTl,s r̃ [m]
s ∈
Rd , ∀l ∈ {1, . . . , |B|}. Over the course of M boosting iter- 4.2 Analysis and results
ations, each site therefore shares d M|B| values. Every site
also communicates the SSE values, i.e., 2 values (index and We follow the practices to setup CWB as mentioned in
SSE value) for every base learner and thus 2M|B| values for Sect. 2.3.2 and run the distributed CWB algorithm with a
all iterations and base learners. In total, each site communi- learning rate of 0.1 and a maximum number of 100,000
cates M|B|(d + 2) values. The communication costs for all iterations. To determine an optimal stopping iteration for
sites are therefore O(|B|). The host, in turn, communicates CWB, we use 20 % of the data as validation data and set the
[m]
the estimated parameters θ̂ ∈ Rd of the |B| shared effects. patience to 5 iterations. In other words, the algorithm stops if
Hence, d M|B| values as well as the index of the best base no risk improvement on the validation data is observed in 5
learner in each iteration are transmitted. In total, the host consecutive iterations. For the numerical covariates, we use
therefore communicates d M|B| + M values to the sites, and a P-spline with 10 cubic basis functions and second-order
costs are therefore also O(|B|). difference penalties. All base learners are penalized accord-
ingly to a global degree of freedom that we set to 2.2 (to
obtain unbiased feature selection) while the random inter-
4 Application cept is penalized according to 3 degrees of freedom (see the
Supplementary Material B.2 for more details). Since we are
We now showcase our algorithm on a heart disease data set modelling a binary response variable, h −1 is the inverse logit
that consists of patient data gathered all over the world. The function logit−1 ( f ) = (1 + exp(− f ))−1 . The model for an
data were collected at four different sites by the (1) Hungar- observation of site s, conditional on its random effects γ , is
ian Institute of Cardiology, Budapest (Andras Janosi, M.D.), given in the Supplementary Material B.3.
(2) University Hospital, Zurich, Switzerland (William Stein-
brunn, M.D.), (3) University Hospital, Basel, Switzerland Results The algorithm stops after m stop = 5578 iterations as
(Matthias Pfisterer, M.D.), and (4) V.A. Medical Center, the risk on the validation data set starts to increase (cf. Fig-
Long Beach, and Cleveland Clinic Foundation (Robert ure 1 in the Supplementary Material B.4) and selects covari-
Detrano, M.D., Ph.D.), and is thus suited for a multi-site dis- ates oldpeak, cp, trestbps, age, sex, restecg,
tributed analysis. The individual data sets are freely available exang, and thalach. Out of these 5578 iterations, the
at https://archive.ics.uci.edu/ml/datasets/heart+disease (Dua distributed CWB algorithm selects a shared effect in 782
and Graff 2017). For our analysis, we set the privacy level iterations and site-specific effects in 4796 iterations. This
(cf. Sect. 2.4) to 5 which is a common default. indicates that the data is rather heterogeneous and requires
site-specific (random) effects. We want to emphasize that
4.1 Data description the given data is from an observational study and that the
sole purpose of our analysis is to better understand the het-
The raw data set contains 14 covariates, such as the chest erogeneity in the data. Hence, the estimated effects have
pain type (cp), resting blood pressure (trestbps), maxi- a non-causal relationship. To alleviate problems that come
mum heart rate (thalach), sex, exercise-induced angina from such data, and allow for the estimation of causal effects,
(exang), or ST depression (i.e., abnormal difference of one could use, e.g., propensity score matching (Rosenbaum
the ST segment from the baseline on an electrocardiogram) and Rubin 1983) before applying our algorithm. From our
induced by exercise relative to rest (oldpeak). A full list application, we can, e.g., see that the data from Hungary
of covariates and their abbreviations is given on the data could potentially be problematic in this respect. However,
set’s website. After removing non-informative (constant) note that applying such measures would also have to be done
covariates and columns with too many missing values at in a privacy-preserving manner. Figure 2 (Left) shows traces
each site, we obtain n cleveland = 303, n hungarian = 292, of how and when the different additive terms (base learners)
n switzerland = 116, and n va = 140 observations and 8 covari- entered the model during the fitting process and illustrates
ates. A table containing the description of the abbreviations the selection process of CWB.
123
Statistics and Computing (2024) 34:31 Page 11 of 13 31
Fig. 2 Left: Model trace showing how and when the four most selected
additive terms entered the model. Right: Variable importance (cf. Au Fig. 4 Comparison of the site-specific effects for oldpeak between
et al. 2019) of selected features in decreasing order the distributed (dsCWB) and pooled CWB approach (compboost) as
well as estimates of from mgcv
123
31 Page 12 of 13 Statistics and Computing (2024) 34:31
ever, require a further advanced technical setup and the need References
to ensure consistency across sites.
As an alternative to CWB, a penalized likelihood approach Anjum, M.M., Mohammed, N., Li, W., et al.: Privacy preserving col-
laborative learning of generalized linear mixed model. J. Biomed.
like mgcv could be considered for distributed computing.
Inform. 127(104), 008 (2022)
Unlike CWB, which benefits from parallelized base learn- Au, Q., Schalk, D., Casalicchio, G., et al.: Component-wise boost-
ers, decomposing the entire design matrix for distributed ing of targets for multi-output prediction. arXiv preprint
computing with this approach is more intricate. The paral- arXiv:1904.03943 (2019)
Augustyn, D.R., Wyciślik, Ł, Mrozek, D.: Perspectives of using cloud
lelization strategy of Wood et al. (2017) could be adapted
computing in integrative analysis of multi-omics data. Brief.
by viewing cores as sites and the main process as the host. Funct. Genom. 20(4), 198–206 (2021). https://doi.org/10.1093/
However, ensuring privacy for this approach would require bfgp/elab007
additional attention. A notable obstacle for smoothing param- Bazeley, P.: Integrative analysis strategies for mixed data sources. Am.
Behav. Sci. 56(6), 814–828 (2012)
eter estimation is the requirement of the Hessian matrix Wood
Bender, A., Rügamer, D., Scheipl, F., et al.: A general machine learning
et al. (2016). Since the Hessian matrix cannot be directly framework for survival analysis. In: Joint European Conference
computed from distributed data, methods like subsampling on Machine Learning and Knowledge Discovery in Databases.
(Umlauf et al. 2023) or more advanced techniques would Springer, pp. 158–173 (2020)
Boyd, K., Lantz, E., Page, D.: Differential privacy for classifier eval-
be necessary to achieve unbiased estimates and convergence uation. In: Proceedings of the 8th ACM Workshop on Artificial
of the whole process. In general, unlike CWB which fits Intelligence and Security, pp. 15–23 (2015)
pseudo-residuals using the L 2 -loss and estimates smoothness Brockhaus, S., Rügamer, D., Greven, S.: Boosting functional regression
implicitly through iterative gradient updates, penalized like- models with FDboost. J. Stat. Softw. 94(10), 1–50 (2020)
Brumback, B.A., Ruppert, D., Wand, M.P.: Variable selection and
lihood approaches such as the one implemented in mgcv are function estimation in additive nonparametric regression using a
less straightforward to distribute, and a privacy-preserving data-based prior: Comment. J. Am. Stat. Assoc. 94(447), 794–797
lossless computation would involve specialized procedures. (1999)
Bühlmann, P., Yu, B.: Boosting with the L2 loss: regression and classi-
Supplementary Information The online version contains supplemen- fication. J. Am. Stat. Assoc. 98(462), 324–339 (2003)
tary material available at https://doi.org/10.1007/s11222-023-10323- Bühlmann, P., Hothorn, T., et al.: Boosting algorithms: regularization,
2. prediction and model fitting. Stat. Sci. 22(4), 477–505 (2007)
Chen, Y.R., Rezapour, A., Tzeng, W.G.: Privacy-preserving ridge
Author Contributions DS developed the idea, its theoretical details and regression on distributed data. Inf. Sci. 451, 34–49 (2018)
its implementation in R. He further conducted all experiments and prac- Curran, P.J., Hussong, A.M.: Integrative data analysis: the simultaneous
tical applications. The manuscript was mainly written by DS Connection analysis of multiple data sets. Psychol. Methods 14(2), 81 (2009)
to GAMMs have been worked out by D.R., who also wrote the corre- Dua, D., Graff, C.: UCI machine learning repository. http://archive.ics.
sponding text passages. BB and DR helped revising and finalizing the uci.edu/ml (2017)
manuscript. Dwork, C.: Differential privacy. In: International Colloquium on
Automata, Languages, and Programming. Springer, pp. 1–12
Funding Open Access funding enabled and organized by Projekt (2006)
DEAL. Eilers, P.H., Marx, B.D.: Flexible smoothing with B-splines and penal-
ties. Stat. Sci. 11, 89–102 (1996)
Gambs, S., Kégl, B., Aïmeur, E.: Privacy-preserving boosting. Data
Declarations Min. Knowl. Disc. 14(1), 131–170 (2007)
Gaye, A., Marcon, Y., Isaeva, J., et al.: Datashield: taking the analysis
to the data, not the data to the analysis. Int. J. Epidemiol. 43(6),
Conflict of interest The authors declare that they have no known com- 1929–1944 (2014)
peting financial interests or personal relationships that could have Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for
appeared to influence the work reported in this paper. nonorthogonal problems. Technometrics 12(1), 55–67 (1970)
Hofner, B., Hothorn, T., Kneib, T., et al.: A framework for unbiased
Open Access This article is licensed under a Creative Commons model selection based on boosting. J. Comput. Graph. Stat. 20(4),
Attribution 4.0 International License, which permits use, sharing, adap- 956–971 (2011)
tation, distribution and reproduction in any medium or format, as Jones, E.M., Sheehan, N.A., Gaye, A., et al.: Combined analysis of cor-
long as you give appropriate credit to the original author(s) and the related data when data cannot be pooled. Stat 2(1), 72–85 (2013)
source, provide a link to the Creative Commons licence, and indi- Karr, A.F., Lin, X., Sanil, A.P., et al.: Secure regression on distributed
cate if changes were made. The images or other third party material databases. J. Comput. Graph. Stat. 14(2), 263–279 (2005)
in this article are included in the article’s Creative Commons licence, Lazarevic, A., Obradovic, Z.: The distributed boosting algorithm. In:
unless indicated otherwise in a credit line to the material. If material Proceedings of the seventh ACM SIGKDD international con-
is not included in the article’s Creative Commons licence and your ference on Knowledge discovery and data mining, pp. 311–316
intended use is not permitted by statutory regulation or exceeds the (2001)
permitted use, you will need to obtain permission directly from the copy- Li, J., Kuang, X., Lin, S., et al.: Privacy preservation for machine learn-
right holder. To view a copy of this licence, visit http://creativecomm ing training and classification based on homomorphic encryption
ons.org/licenses/by/4.0/. schemes. Inf. Sci. 526, 166–179 (2020a)
123
Statistics and Computing (2024) 34:31 Page 13 of 13 31
Li, Q., Wen, Z., He, B.: Practical federated gradient boosting decision Ünal, A.B., Pfeifer, N., Akgün, M.: ppAURORA: privacy preserving
trees. In: Proceedings of the AAAI Conference on Artificial Intel- area under receiver operating characteristic and precision-recall
ligence, pp. 4642–4649 (2020b) curves with secure 3-party computation. arXiv 2102 (2021)
Liew, B.X., Rügamer, D., Abichandani, D., et al.: Classifying individ- Wood, S.N.: Generalized Additive Models: An Introduction with R.
uals with and without patellofemoral pain syndrome using ground Chapman and Hall/CRC, Boca Raton (2017)
force profiles—development of a method using functional data Wood, S.N., Pya, N., Säfken, B.: Smoothing parameter and model
boosting. Gait Posture 80, 90–95 (2020) selection for general smooth models. J. Am. Stat. Assoc.
Litwin, W., Mark, L., Roussopoulos, N.: Interoperability of multiple 111(516), 1548–1563 (2016). https://doi.org/10.1080/01621459.
autonomous databases. ACM Comput. Surv. (CSUR) 22(3), 267– 2016.1180986
293 (1990) Wood, S.N., Li, Z., Shaddick, G., et al.: Generalized additive models
Lu, C.L., Wang, S., Ji, Z., et al.: Webdisco: a web service for distributed for gigadata: modeling the U.K. black smoke network daily data.
cox model learning without patient-level data sharing. J. Am. Med. J. Am. Stat. Assoc. 112(519), 1199–1210 (2017). https://doi.org/
Inform. Assoc. 22(6), 1212–1219 (2015) 10.1080/01621459.2016.1195744
Luo, C., Islam, M., Sheils, N.E., et al.: DLMM as a lossless one- Wu, C.J., Hamada, M.S.: Experiments: Planning, Analysis, and Opti-
shot algorithm for collaborative multi-site distributed linear mixed mization. Wiley, Hoboken (2011)
models. Nat. Commun. 13(1), 1–10 (2022) Wu, Y., Jiang, X., Kim, J., et al.: Grid binary logistic regression
McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Rout- (GLORE): building shared models without sharing data. J. Am.
ledge, Milton Park (1989) Med. Inform. Assoc. 19(5), 758–764 (2012)
McMahan, B., Moore, E., Ramage, D., et al.: Communication-efficient Yan, Z., Zachrison, K.S., Schwamm, L.H., et al.: Fed-GLMM: a
learning of deep networks from decentralized data. In: Artificial privacy-preserving and computation-efficient federated algorithm
Intelligence and Statistics, PMLR, pp. 1273–1282 (2017) for generalized linear mixed models to analyze correlated elec-
Mirza, B., Wang, W., Wang, J., et al.: Machine learning and integrative tronic health records data. medRxiv (2022)
analysis of biomedical big data. Genes 10(2), 87 (2019) Zhu, R., Jiang, C., Wang, X., et al.: Privacy-preserving construction
Mohassel, P., Zhang, Y.: SecureML: a system for scalable privacy- of generalized linear mixed model for biomedical computation.
preserving machine learning. In: 2017 IEEE Symposium on Bioinformatics 36(Supplement–1), i128–i135 (2020). https://doi.
Security and Privacy (SP), IEEE, pp. 19–38 (2017) org/10.1093/bioinformatics/btaa478
Naehrig, M., Lauter, K., Vaikuntanathan, V.: Can homomorphic encryp-
tion be practical? In: Proceedings of the 3rd ACM Workshop on
Cloud Computing Security Workshop, pp. 113–124 (2011) Publisher’s Note Springer Nature remains neutral with regard to juris-
Paillier, P.: Public-key cryptosystems based on composite degree resid- dictional claims in published maps and institutional affiliations.
uosity classes. In: International Conference on the Theory and
Applications of Cryptographic Techniques. Springer, pp. 223–238
(1999)
R Core Team: R: A Language and Environment for Statistical Com-
puting. R Foundation for Statistical Computing, Vienna, Austria,
https://www.R-project.org/ (2021)
Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score
in observational studies for causal effects. Biometrika 70(1), 41–
55 (1983)
Rügamer, D., Brockhaus, S., Gentsch, K., et al.: Boosting factor-specific
functional historical models for the detection of synchronization
in bioelectrical signals. J. R. Stat. Soc.: Ser. C (Appl. Stat.) 67(3),
621–642 (2018). https://doi.org/10.1111/rssc.12241
Saintigny, P., Zhang, L., Fan, Y.H., et al.: Gene expression profiling
predicts the development of oral cancer. Cancer Prev. Res. 4(2),
218–229 (2011)
Samarati, P., Sweeney, L.: Protecting privacy when disclosing infor-
mation: k-anonymity and its enforcement through generaliza-
tion and suppression. Technical Report, http://www.csl.sri.com/
papers/sritr-98-04/ (1998)
Schalk, D., Hoffmann, V.S., Bischl, B., et al.: Distributed non-disclosive
validation of predictive models by a modified ROC-GLM. arXiv
preprint arXiv:2203.10828 (2022)
Schalk, D., Bischl, B., Rügamer, D.: Accelerated component wise gra-
dient boosting using efficient data representation and momentum-
based optimization. J. Comput. Graph. Stat. 32(2), 631–641
(2023). https://doi.org/10.1080/10618600.2022.2116446
Schmid, M., Hothorn, T.: Boosting additive models using component-
wise p-splines. Comput. Stat. Data Anal. 53(2), 298–311 (2008)
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncer-
tain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)
Umlauf, N., Seiler, J., Wetscher, M., et al.: Scalable estimation
for structured additive distributional regression. arXiv preprint
arXiv:2301.05593 (2023)
123