0% found this document useful (0 votes)
8 views13 pages

Privacy-Preserving and Lossless Distributed Estimation of High-Dimensional Generalized Additive Mixed Models

This paper presents a novel algorithm for distributed, privacy-preserving, and lossless estimation of generalized additive mixed models (GAMMs) using component-wise gradient boosting (CWB). The proposed method allows for site-specific effects and accounts for repeated measurements while maintaining the model's interpretability and flexibility in high-dimensional settings. The authors demonstrate the efficacy of their approach through application on a distributed heart disease dataset and provide an implementation as an R package.

Uploaded by

thameur.dhieb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views13 pages

Privacy-Preserving and Lossless Distributed Estimation of High-Dimensional Generalized Additive Mixed Models

This paper presents a novel algorithm for distributed, privacy-preserving, and lossless estimation of generalized additive mixed models (GAMMs) using component-wise gradient boosting (CWB). The proposed method allows for site-specific effects and accounts for repeated measurements while maintaining the model's interpretability and flexibility in high-dimensional settings. The authors demonstrate the efficacy of their approach through application on a distributed heart disease dataset and provide an implementation as an R package.

Uploaded by

thameur.dhieb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Statistics and Computing (2024) 34:31

https://doi.org/10.1007/s11222-023-10323-2

ORIGINAL PAPER

Privacy-preserving and lossless distributed estimation of


high-dimensional generalized additive mixed models
Schalk Daniel1,2 · Bischl Bernd1,2 · Rügamer David1,2

Received: 7 March 2023 / Accepted: 2 October 2023 / Published online: 7 November 2023
© The Author(s) 2023

Abstract
Various privacy-preserving frameworks that respect the individual’s privacy in the analysis of data have been developed in
recent years. However, available model classes such as simple statistics or generalized linear models lack the flexibility required
for a good approximation of the underlying data-generating process in practice. In this paper, we propose an algorithm for
a distributed, privacy-preserving, and lossless estimation of generalized additive mixed models (GAMM) using component-
wise gradient boosting (CWB). Making use of CWB allows us to reframe the GAMM estimation as a distributed fitting
of base learners using the L 2 -loss. In order to account for the heterogeneity of different data location sites, we propose a
distributed version of a row-wise tensor product that allows the computation of site-specific (smooth) effects. Our adaption of
CWB preserves all the important properties of the original algorithm, such as an unbiased feature selection and the feasibility
to fit models in high-dimensional feature spaces, and yields equivalent model estimates as CWB on pooled data. Next to a
derivation of the equivalence of both algorithms, we also showcase the efficacy of our algorithm on a distributed heart disease
data set and compare it with state-of-the-art methods.

Keywords Distributed computing · Functional gradient descent · Generalized linear mixed model · Machine learning ·
Privacy-preserving modelling

1 Introduction sonal patient information is typically distributed over several


hospitals, while sharing or merging different data sets in a
More than ever, data is collected to record the ubiquitous central location is prohibited. To overcome this limitation,
information in our everyday life. However, on many occa- different approaches have been developed to directly oper-
sions, the physical location of data points is not confined to ate at different sites and unite information without having to
one place (one global site) but distributed over different loca- share sensitive parts of the data to allow privacy-preserving
tions (sites). This is the case for, e.g., patient records that are data analysis.
gathered at different hospitals but usually not shared between Distributed data Distributed data can be partitioned ver-
hospitals or other facilities due to the sensitive information tically or horizontally across different sites. Horizontally
they contain. This makes data analysis challenging, particu- partitioned data means that observations are spread across
larly if methods require or notably benefit from incorporating different sites with access to all existing features of the avail-
all available (but distributed) information. For example, per- able data point, while for vertically partitioned data, different
sites have access to all observations but different features
B Schalk Daniel (covariates) for each of these observations. In this work, we
daniel.schalk@stat.uni-muenchen.de focus on horizontally partitioned data. Existing approaches
Bischl Bernd for horizontally partitioned data vary from fitting regression
bernd.bischl@stat.uni-muenchen.de models such as generalized linear models (GLMs; Wu et al.
Rügamer David 2012; Lu et al. 2015; Jones et al. 2013; Chen et al. 2018), to
david.ruegamer@stat.uni-muenchen.de conducting distributed evaluations (Boyd et al. 2015; Ünal
1 et al. 2021; Schalk et al. 2022), to fitting artificial neural net-
Department of Statistics, LMU Munich, Munich, Germany
works (McMahan et al. 2017). Furthermore, various software
2 Munich Center for Machine Learning (MCML), Munich, frameworks are available to run a comprehensive analysis of
Germany

123
31 Page 2 of 13 Statistics and Computing (2024) 34:31

distributed data. One example is the collection of R (R Core preserving boosting techniques often focus on the AdaBoost
Team 2021) packages DataSHIELD (Gaye et al. 2014), algorithm by using aggregation techniques of the base clas-
which enables data management and descriptive data analy- sifier (Lazarevic and Obradovic 2001; Gambs et al. 2007). A
sis as well as securely fitting of simple statistical models in a different approach to boosting decision trees in a federated
distributed setup without leaking information from one site learning setup was introduced by Li et al. (2020b) using a
to the others. locality-sensitive hashing to obtain similarities between data
Interpretability and data heterogeneity In many research sets without sharing private information. These algorithms
areas that involve critical decision-making, especially in focus on aggregating tree-based base components, making
medicine, methods should not only excel in predictive per- them difficult to interpret, and come with no inferential guar-
formance but also be interpretable. Models should provide antees.
information about the decision-making process, the feature In order to account for repeated measurements, Luo et al
effects, and the feature importance as well as intrinsi- (2022) propose a privacy-preserving and lossless way to fit
cally select important features. Generalized additive models linear mixed models (LMMs) to correct for heterogeneous
(GAMs; see, e.g., Wood 2017) are one of the most flexible site-specific random effects. Their concept of only sharing
approaches in this respect, providing interpretable yet com- aggregated values is similar to our approach, but is limited
plex models that also allow for non-linearity in the data. in the complexity of the model and only allows normally
As longitudinal studies are often the most practical way distributed outcomes. Other methods to estimate LMMs in a
to gather information in many research fields, methods secure and distributed fashion are Zhu et al. (2020), Anjum
should also be able to account for subject-specific effects et al. (2022), or Yan et al. (2022).
and account for the correlation of repeated measurements. Besides privacy-preserving and distributed approaches,
Furthermore, when analyzing data originating from differ- integrative analysis is another technique based on pooling the
ent sites, the assumption of having identically distributed data sets into one and analyzing this pooled data set while
observations across all sites often does not hold. In this case, considering challenges such as heterogeneity or the curse of
a reasonable assumption for the data-generating process is dimensionality (Curran and Hussong 2009; Bazeley 2012;
a site-specific deviation from the general population mean. Mirza et al. 2019). While advanced from a technical per-
Adjusting models to this situation is called interoperabil- spective by, e.g., outsourcing computational demanding tasks
ity (Litwin et al. 1990), while ignoring it may lead to biased such as the analysis of multi-omics data to cloud services
or wrong predictions. (Augustyn et al. 2021), the existing statistical cloud-based
methods only deal with basic statistics. The challenges of
1.1 Related literature integrative analysis are similar to the ones tackled in this
work, our approach, however, does not allow merging the
Various approaches for distributed and privacy-preserving data sets in order to preserve privacy.
analysis have been proposed in recent years. In the con-
text of statistical models, Karr et al. (2005) describe how 1.2 Our contribution
to calculate a linear model (LM) in a distributed and privacy-
preserving fashion by sharing data summaries. Jones et al. This work presents a method to fit generalized additive mixed
(2013) propose a similar approach for GLMs by communi- models (GAMMs) in a privacy-preserving and lossless man-
cating the Fisher information and score vector to conduct a ner1 to horizontally distributed data. This not only allows the
distributed Fisher scoring algorithm. The site information is incorporation of site-specific random effects and accounts
then globally aggregated to estimate the model parameters. for repeated measurements in LMMs, but also facilitates
Other privacy-preserving techniques include ridge regres- the estimation of mixed models with responses following
sion (Chen et al. 2018), logistic regression, and neural any distribution from the exponential family and provides
networks (Mohassel and Zhang 2017). the possibility to estimate complex non-linear relationships
In machine learning, methods such as the naive Bayes clas- between covariates and the response. To the best of our
sifier, trees, support vector machines, and random forests (Li knowledge, we are the first to provide an algorithm to fit
et al. 2020a) exist with specific encryption techniques (e.g., the class of GAMMs in a privacy-preserving and lossless
the Paillier cryptosystem; Paillier 1999) to conduct model fashion on distributed data.
updates. In these setups, a trusted third party is usually Our approach is based on component-wise gradient boost-
required. However, this is often unrealistic and difficult ing (CWB; Bühlmann and Yu 2003). CWB can be used
to implement, especially in a medical or clinical setup.
Furthermore, as encryption is an expensive operation, its 1 In this article, we define a distributed fitting procedure as lossless if the
application is infeasible for complex algorithms that require model parameters of the algorithm are the same as the ones computed
many encryption calls (Naehrig et al. 2011). Existing privacy- on the pooled data.

123
Statistics and Computing (2024) 34:31 Page 3 of 13 31

rithm in Sect. 2.3 and its link to GAMMs. In Sect. 3, we


present the distributed setup and our novel extension of the
CWB algorithm. Finally, Sect. 4 demonstrates both how our
distributed CWB algorithm can be used in practice and how
to interpret the obtained results.
Implementation We implement our approach as an R package
using the DataSHIELD framework and make it available on
GitHub.2 The code for the analysis can also be found in the
repository.3

2 Background

2.1 Notation and terminology

Fig. 1 Method overview of the proposed distributed CWB approach Our proposed approach uses the CWB algorithm as fitting
with one main CWB model maintained by a host (center) and distributed engine. Since this method was initially developed in machine
computations on different sites (three as an example in this case) that
incorporate and provide site-specific information while preserving pri-
learning, we introduce here both the statistical notation used
vacy for GAMMs as well as the respective machine learning ter-
minology and explain how to relate the two concepts.
We assume a p-dimensional covariate or feature space
to estimate additive models, account for repeated measure- X = (X1 × · · · × X p ) ⊆ R p and response or outcome values
ments, compute feature importance, and conduct feature from a target space Y. The goal of boosting is to find the
selection. Furthermore, CWB is suited for high-dimensional unknown relationship f between X and Y. In turn, GAMMs
data situations (n  p). CWB is therefore often used in (as presented in Sect. 2.2) model the conditional distribu-
practice for, e.g., predicting the development of oral can- tion of an outcome variable Y with realizations y ∈ Y,
cer (Saintigny et al 2011), classifying individuals with and  (1) x(1) = (x1 ,. .(n)
given features . , x p) ∈
 X . Given a data set
without patellofemoral pain syndrome (Liew et al. 2020), or D = x ,y , . . . , x , y (n) with n observations
detecting synchronization in bioelectrical signals (Rügamer drawn (conditionally) independently from an unknown prob-
et al. 2018). However, there have so far not been any attempts ability distribution Px y on the joint space X × Y, we aim to
to allow for a distributed, privacy-preserving, and lossless estimate this functional relationship in CWB with fˆ. The
computation of the CWB algorithm. In this paper, we pro- goodness-of-fit of a given model fˆ is  assessed by calculat-
pose a distributed version of CWB that yields the identical ing the empirical risk Remp ( fˆ) = n −1 (x,y)∈D L(y, fˆ(x))
model produced by the original algorithm on pooled data based on a loss function L : Y × R → R and the
and that accounts for site heterogeneity by including interac- data set D. Minimizing Remp using this loss function is
tions between features and a site variable. This is achieved equivalent to estimating f using maximum likelihood by
by adjusting the fitting process using (1) a distributed estima- defining L(y, f (x)) = −(y, h( f (x))) with log-likelihood
tion procedure, (2) a distributed version of row-wise tensor , response function h and minimizing the sum of log-
product base learners, and (3) an adaption of the algorithm likelihood contributions.
to conduct feature selection in the distributed setup. Figure 1 In the following, we also require the vector x j =
(1) (n)
sketches the proposed distributed estimation procedure. (x j , . . . , x j )T ∈ X j , which refers to the jth feature.
We implement our method in R using the DataSHIELD Furthermore, let x = (x1 , . . . , x p ) and y denote arbitrary
framework and demonstrate its application in an exemplary members of X and Y, respectively. A special role is further
medical data analysis. Our distributed version of the original given to a subset u = (u 1 , . . . , u q ) , q ≤ p, of features x,
CWB algorithm does not have any additional hyperparam- which will be used to model the heterogeneity in the data.
eters (HPs) and uses optimization strategies from previous
research results to define meaningful values for all HPs, effec-
tively yielding a tuning-free method.
The remainder of this paper is structured as follows: First,
2
we introduce the basic notation, terminology, and setup of github.com/schalkdaniel/dsCWB.
GAMMs in Sect. 2. We then describe the original CWB algo- 3 github.com/schalkdaniel/dsCWB/blob/main/usecase/analyse.R.

123
31 Page 4 of 13 Statistics and Computing (2024) 34:31

2.2 Generalized additive mixed models 2.3 Component-wise boosting

A very flexible class of regression models to model the rela-


tionship between covariates and the response are GAMMs Component-wise (gradient) boosting (CWB; Bühlmann
(see, e.g., Wood 2017). In GAMMs, the response Y (i) for and Yu 2003; Bühlmann et al. 2007) is an iterative algorithm
observation i = 1, . . . , n s of measurement unit (or site) s is that performs block-coordinate descent steps with blocks (or
assumed to follow some exponential family distribution such base learners) corresponding to the additive terms in (1).
as the Poisson, binomial, or normal distributions (see, e.g., With a suitable choice of base learners and objective function,
McCullagh and Nelder 1989), conditional on features x (i) CWB allows efficient optimization of GAMMs, even in high-
and the realization of some random effects. The expectation dimensional settings with p n. We will first introduce the
μ := E(Y (i) |x (i) , u(i) ) of the response Y (i) for observations concept of base learners that embed additive terms of the
i = 1, . . . , n s of measurement unit (or site) s in GAMMs is GAMM into boosting and subsequently describe the actual
given by fitting routine of CWB. Lastly, we will describe the prop-
erties of the algorithm and explain its connection to model
(1).
h −1 (μ(i) ) = f (i)
 (i)  (i)  
(i)
= xj βj + u j γ j,s + φj xj . (1) 2.3.1 Base learners
j∈J1 j∈J2 j∈J3
In CWB, the lth base learner bl : X → R is used to model
the contribution of one or multiple features in the model. In
In (1), h is a smooth monotonic response function, f cor- this work, we investigate parametrized base learners bl (x, θ l )
responds to the additive predictor, γ j,s ∼ N (0, ψ) are with parameters θ l ∈ Rdl . For simplicity, we will use θ as a
random effects accounting for heterogeneity in the data, wildcard for the coefficients of either fixed effects, random
and φ j are non-linear effects of pre-specified covariates. effects, or spline bases in the following. We assume that each
The different index sets J1 , J2 , J3 ⊆ {1, . . . , p} ∪ ∅ indi- base learner can be represented by a generic basis represen-
cate which features are modeled as fixed effects, random tation gl : X → Rdl , x → gl (x) = (gl,1 (x), . . . , gl,dl (x))T
effects, or non-linear (smooth) effects, respectively. The and is linear in the parameters, i.e., bl (x, θ l ) = gl (x)T θ l .
modeler usually defines these sets. However, as we will also Note that the basis transformation gl of the lth base learner
explain later, the use of CWB as a fitting engine allows for does not necessarily select the jth feature x j . This is required
automatic feature selection and therefore does not require to, e.g., let two base learners l and k depend on the same fea-
explicitly defining these sets. In GAMMs, smooth effects ture x j . For n observations, we define the design matrix of a
are usually represented by (spline) basis functions, i.e., base learner bl as Zl := (gl (x (1) ), . . . , gl (x (n) ))T ∈ Rn×dl .
φ j (x j ) ≈ (B j,1 (x j ), . . . , B j,d j (x j )) θ j , where θ j ∈ Rd j Note that base learners are typically not defined on the whole
are the basis coefficients corresponding to each basis func- feature space but on a subset Xl ⊆ X . For example, a com-
tion B j,d j . The coefficients are typically constrained in their mon choice for CWB is to define one base learner for every
flexibility by adding a quadratic (difference) penalty for feature xl ∈ Xl to model the univariate contributions of that
(neighboring) coefficients to the objective function to enforce feature.
smoothness. GAMMs, as in (1), are not limited to uni- A base learner bl (x, θ l ) can depend on HPs αl that are
variate smooth effects φ j , but allow for higher-dimensional set prior to the fitting process. For example, choosing a base
non-linear effects φ(x j1 , x j2 , . . . , x jk ). The most common learner using a P-spline (Eilers and Marx 1996) representa-
higher-dimensional smooth interaction effects are bivariate tion requires setting the degree of the basis functions, the
effects (k = 2) and can be represented using a bivariate or a order of the difference penalty term, and a parameter λl
tensor product spline basis (see Sect. 2.3.1 for more details). determining the smoothness of the spline. Regularized base
Although higher-order splines with k > 2 are possible, mod- learners, in addition, will have pre-defined penalty matri-
els are often restricted to bivariate interactions for the sake of ces K l . For convenience, we further denote the penalty
interpretability and computational feasibility. In Sect. 3, we matrix already augmented with the corresponding smooth-
will further introduce varying coefficient terms φ j,s (x j ) in ing parameter with P l , e.g., P l = λl K l . In order to represent
the model (1), i.e., smooth effects f varying with a second GAMMs in CWB, the following four base learner types are
variable s. Analogous to random slopes, s can also be the used.
index set defining observation units of random effects J2 .
Using an appropriate distribution assumption for the basis (Regularized) linear base learners A linear base learner is
coefficients θ j , these varying coefficients can then be con- used to include linear effects of a features x j1 , . . . , x jdl into
sidered as random smooth effects. the model. The basis transformation is given by gl (x) =

123
Statistics and Computing (2024) 34:31 Page 5 of 13 31

(gl,1 (x), . . . , gl,dl +1 (x))T = (1, x j1 , . . . , x jdl )T . Linear base and gk (x) = (gk,1 (xk ), . . . , gk,dk (xk ))T , the basis represen-
learners can be regularized by incorporating a ridge penal- tation of the row-wise tensor product base learner bl =
ization (Hoerl and Kennard 1970) with tunable penalty b j × bk is defined as gl (x) = (g j (x)T ⊗ gk (x)T )T =
parameter λl as an HP αl . Fitting a ridge penalized linear (g j,1 (x j )gk (x)T , . . . , g j,d j (x j )gk (x)T )T ∈ Rdl with dl =
base learner to a response vector y ∈ Rn results in the penal- d j dk . The HPs αl = {α j , α k } of a row-wise tensor prod-
ized least squares estimator θ̂ l = (ZTl Zl + P l )−1 ZTl y with uct base learner are induced by the HPs α j and α k of the
penalty matrix P l = λl K l , K l = I dl +1 , where I d denotes respective individual base learners. Analogously to other
the d-dimensional identity matrix. Often, an unregularized base learners, the penalized least squared estimator in this
linear base learner is also included to model the contribution case is θ̂ l = (ZTl Zl + P l )−1 ZTl y with penalization matrix
of one feature x j as a linear base learner without penalization. P l = τ j K j ⊗ I dk + I d j ⊗ τk K k ∈ Rdl ×dl . This Kronecker
The basis transformation is then given by gl (x) = (1, x j )T sum penalty, in particular, allows for anisotropic smoothing
and λl = 0. with penalties τ j and τk when using two spline bases for
Spline base learners These base learners model smooth g j and gk , and varying coefficients or random splines when
effects using univariate splines. A common choice is penal- combining a (penalized) categorical base learner and a spline
ized B-splines (P-Splines; Eilers and Marx 1996), where base learner.
the feature x j is transformed using a B-spline basis trans-
formation gl (x) = (Bl,1 (x j ), . . . , Bl,dl (x j ))T with dl basis 2.3.2 Fitting algorithm
functions gl,m = Bl,m , m = 1, . . . , dl . In this case, the
choice of the spline order B, the number of basis functions CWB first initializes an estimate fˆ of the additive predictor
dl , the penalization term λl , and the order v of the difference with a loss-optimal constant value fˆ[0] = arg minc∈R Remp (c).
penalty (represented by a matrix Dl ∈ Rdl−v ×dl ) are consid- It then proceeds and estimates Eq. (1) using an iterative
ered HPs αl of the base learner. The base learner’s parameter steepest descent minimization in function space by fitting
estimator in general is given by the penalized least squares the previously defined base learners to the model’s func-
solution θ̂ l = (ZTl Zl + P l )−1 ZTl y, with penalization matrix tional gradient ∇ f L(y, f ) evaluated at the current model
P l = λl K l and K l = Dl Dl in the case of P-splines. estimate fˆ. Let fˆ[m] denote the model estimation after
m ∈ N iterations. In each step in CWB, the pseudo residuals
Categorical and random effect base learners Categorical fea-
r̃ [m](i) = −∇ f L(y (i) , f (x (i) ))| f = fˆ[m−1] for i ∈ {1, . . . , n}
tures x j ∈ {1, . . . , G} with G ∈ N, G ≥ 2 classes are han-
are first computed. CWB then selects the best-fitting base
dled by a binary encoding gl (x) = (1{1} (x j ), . . . , 1{G} (x j ))T
learner from a pre-defined pool of base-learners denoted by
with the indicator function 1 A (x) = 1 if x ∈ A and 1 A (x) =
B = {bl }l∈{1,...,|B|} and adds the base learner’s contribution
0 if x ∈/ A. A possible alternative encoding is the dummy
to the previous model fˆ[m] . The selected base learner is cho-
encoding with ğl (x) = (1, 1{1} (x j ), . . . , 1{G−1} (x j ))T with
sen based on its sum of squared errors (SSE) when regressing
reference group G. Similar to linear and spline base learn-
the pseudo residuals r̃ [m] = (r [m](1) , . . . , r [m](n) )T onto the
ers, it is possible to incorporate a ridge penalization with
base learner’s features using the L 2 -loss. Further details of
HP αl = λl . This results in the base learner’s penalized least
CWB are given in Algorithm 1 (see, e.g., Schalk et al. 2023).
squared estimator θ̂ l = (ZTl Zl + P l )−1 ZTl y with penalization
Controling HPs of CWB
matrix P l = λl K l , K l = I G . Due to the mathematical equiv-
Good estimation performance can be achieved by select-
alence of ridge penalized linear effects and random effects
ing a sufficiently small learning rate, e.g., 0.01, as suggested
with normal prior (see, e.g., Brumback et al. 1999), this base
in Bühlmann et al. (2007), and adaptively selecting the num-
learner can further be used to estimate random effect pre-
ber of boosting iterations via early stopping on a validation
dictions γ̂ j when using categorical features u j and thereby
set. To enforce a fair selection of model terms and thus
account for heterogeneity in the data. Hence, this base learner
unbiased effect estimation, regularization parameters are set
can also be used to model site-specific effects in a distributed
such that all base learners have the same degrees-of-freedom
system, as outlined later. While such random effects do not
(Hofner et al. 2011). As noted by Bühlmann et al. (2007),
directly provide a variance estimate of the different measure-
choosing smaller degrees-of-freedom induces more penal-
ment units and are primarily used to account for intra-class
ization (and thus, e.g., smoother estimated function for spline
correlation, an approximation of the variance components
base learners), which yields a model with lower variance at
can be retrieved post-model fitting by, e.g., computing the
the cost of a larger bias. This bias induces a shrinkage in the
empirical variance of (γ̂ j ) j∈J3 .
estimated coefficients towards zero but can be reduced by
Row-wise tensor product base learners This type of base running the optimization process for additional iterations.
learner is used to model a pairwise interaction between two
features x j and xk . Given two base learners b j and bk with
basis representations g j (x) = (g j,1 (x j ), . . . , g j,d j (x j ))T

123
31 Page 6 of 13 Statistics and Computing (2024) 34:31

Algorithm 1 Vanilla CWB algorithm 2.4 Distributed computing setup and privacy
Input Train data D, learning rate ν, number of protection
boosting iterations M, loss function L,
set of base learner B Before presenting our main results, we now introduce the dis-
Output Model fˆ[M] defined by fitted parameters tributed data setup we will work with throughout the remain-
[1] [M]
θ̂ , . . . , θ̂ der of this paper. The dataset D is horizontally  partitioned
1: procedure CWB(D, ν, L, B) (1) (1) (n ) (n )
into S data sets Ds = x s , ys , . . . , x s s , ys s ,
2: Initialize: fˆ[0] (x) = arg minc∈R Remp (c)
3: for m ∈ {1, . . . , M} do s = 1, . . . , S with n s observations. Each data set Ds is
4: r̃ [m](i) = −∇ f L(y (i) , f (x (i) ))| f = fˆ[m−1] , ∀i ∈ {1, . . . , n} located at a different site s and potentially follows a different
5: for l ∈ {1, . . . , |B|} do data distributions Px y,s . The union of all data sets yields the
[m]  −1 T [m]
6: θ̂ l = ZTl Zl + P l Zl r̃ whole data set D = ∪s=1 S D with mutually exclusive data
s
n [m]
7: SSEl = i=1 (r̃ [m](i) − bl (x (i) , θ̂ l ))2 sets Ds ∩ Dl = ∅ ∀l, s ∈ {1, . . . , S}, l = s. The vector of
8: end for realizations per site is denoted by ys ∈ Y n s .
9: l [m] = arg minl∈{1,...,|B|} SSEl
[m] In this distributed setup, multiple ways exist to commu-
10: fˆ[m] (x) = fˆ[m−1] (x) + νbl [m] (x, θ̂ l [m] ) nicate information without revealing individual information.
11: end for
12: return fˆ = fˆ[M] More complex methods such as differential privacy (Dwork
13: end procedure 2006), homomorphic encryption (e.g., the Paillier cryp-
tosystem; Paillier 1999), or k-anonymity (Samarati and
Sweeney 1998; Sweeney 2002) allow sharing information
without violating an individual’s privacy. An alternative
2.3.3 Properties and link to generalized additive mixed option is to only communicate aggregated statistics. This is
models one of the most common approaches and is also used by
DataSHIELD (Gaye et al. 2014) for GLMs or by Luo et al
The estimated coefficients θ̂ resulting from running the CWB (2022) for LMMs. DataSHIELD, for example, uses a pri-
algorithm are known to converge to the maximum likelihood vacy level that indicates how many individual values must be
solution (see, e.g., Schmid and Hothorn 2008) for M → ∞ aggregated to allow the communication of aggregated val-
under certain conditions. This is due to the fact that CWB ues. For example, setting the privacy level to a value of 5
performs a coordinate gradient descent update of a model enables sharing of summary statistics such as sums, means,
defined by its additive base learners that exactly represent variances, etc. if these are computed on at least 5 elements
the structure of an additive mixed model (when defining the (observations).
base learners according to Sect. 2.3.1) and by the objective Host and site setup Throughout this article, we assume the
function that corresponds to the negative (penalized) log- 1, . . . , S sites or servers to have access to their respective data
likelihood. Two important properties of this algorithm are set Ds . Each server is allowed to communicate with a host
(1) its coordinate-wise update routine, and (2) the nature of server that is also the analyst’s machine. In this setting, the
model updates using the L 2 -loss. Due to the first property, analyst can potentially see intermediate data used when run-
CWB can be used in settings with p n, as only a single ning the algorithms, and hence each message communicated
additive term is fitted onto the pseudo-residuals in every iter- from the servers to the host must not allow any reconstruction
ation. This not only reduces the computational complexity of the original data. The host server is responsible for aggre-
of the algorithm for an increasing number of additive pre- gating intermediate results and communicating these results
dictors (linear instead of quadratic) but also allows variable back to the servers.
selection when stopping the routine early (e.g., based on a
validation data set), as not all the additive components might
have been selected into the model. In particular, this allows 3 Distributed component-wise boosting
users to specify the full GAMM model without manual spec-
ification of the type of feature effect (fixed or random, linear We now present our distributed version of the CWB algo-
or non-linear) and then automatically sparsify this model by rithm to fit privacy-preserving and lossless GAMMs. In the
an objective and data-driven feature selection. The second following, we first describe further specifications of our setup
property, allows fitting models of the class of generalized lin- in Sect. 3.1, elaborate on the changes made to the set of base
ear/additive (mixed) models using only the L 2 -loss instead learners in Sect. 3.2, and then show how to adapt CWB’s
of having to work with some iterative weighted least squares fitting routine in Sect. 3.3.
routine. In particular, this allows performing the proposed
lossless distributed computations described in this paper, as
we will discuss in Sect. 3.

123
Statistics and Computing (2024) 34:31 Page 7 of 13 31

3.1 Setup where the basis transformation gl is equal for all S sites.
After distributed computation (see Eq. (4) in the next sec-
T T
In the following, we distinguish between site-specific and tion), the estimated coefficients are θ̂ l× = (θ̂ l× ,1 , . . . , θ̂ l× ,S )T
shared effects. As effects estimated across sites typically cor-
with θ̂ l× ,s ∈ Rdl . The regularization of the row-wise Kro-
respond to fixed effects and effects modeled for each site
necker base learners not only controls their flexibility but
separately are usually represented using random effects, we
also assures identifiable when additionally including a shared
use the terms as synonyms in the following, i.e., shared
(fixed) effect for the same covariate. The penalty matrix
effects and fixed effects are treated interchangeably and
P l× = λ0 K 0 ⊗ I dl + I S ⊗λl× K l ∈ R Sdl ×Sdl is given as Kro-
the same holds for site-specific effects and random effects.
necker sum of the penalty matrices K 0 and K l with respective
We note that this is only for ease of presentation and our
regularization strengths λ0 , λl× . As K 0 = I S is a diagonal
approach also allows for site-specific fixed effects and ran-
matrix, P l× is a block matrix with entries λ0 I dl + λl× K l on
dom shared effects. As the data is not only located at different
the diagonal blocks. Moreover, as g0 is a binary vector, we
sites but also potentially follows different data distributions
can also express the design matrix Zl× ∈ Rn×Sdl as a block
Px y,s at each site s, we extend Eq. (1) to not only include
matrix, yielding
random effects per site, but also site-specific smooth (ran-
dom) effects φ j,s (x j ), s = 1, . . . , S for all features x j
Zl× = diag(Zl,1 , . . . , Zl,S ),
with j ∈ J3 . For every of these smooth effects φ j,s we
assume an existing shared effect f j,shared that is equal for P l× = diag(λ0 I dl + λl× K l , . . . , λ0 I dl + λl× K l ), (3)
all sites. These assumptions—particularly the choice of site-
specific effects—are made for demonstration purposes. In a where Zl,k are the distributed design matrices of bl on sites
real-world application, the model structure can be defined s = 1, . . . , S. This Kronecker sum penalty induces a center-
individually to match the given data situation. However, note ing of the site-specific effects around zero and, hence, allows
again that CWB intrinsically performs variable selection, and the interpretation as deviation from the main effect. Note that
there is thus no need to manually define the model structure possible heredity constraints, such as the one described in Wu
in practice. In order to incorporate the site information into and Hamada (2011), are not necessarily met when decom-
(i)
the model, we add a variable x0 ∈ {1, . . . , S} for the site to posing effects in this way. However, introducing a restriction
(i) (i) that forces the inclusion of the shared effect whenever the
the data by setting x̃ = (x0 , x (i) ). The site variable is a
categorical feature with S classes. respective site-specific effect is selected is a straightforward
extension without impairing our proposed framework and
without increasing computational costs.
3.2 Base learners
3.3 Fitting algorithm
For shared effects, we keep the original structure of CWB
with base learners chosen from a set of possible learners We now describe the adaptions required to allow for
B. Section 3.3.1 explains how these shared effects are esti- distributed computations of the CWB fitting routine. In
mated in the distributed setup. We further define a random Sects. 3.3.1 and 3.3.2, we show the equality between our
effect base learner b0 with basis transformation g0 (x0 ) = distributed fitting approach and CWB fitted on pooled data.
(1{1} (x0 ), . . . , 1{S} (x0 ))T and design matrix Z 0 ∈ Rn×S . Section 3.3.3 describes the remaining details such as dis-
We use b0 to extend B with a second set of base learners tributed SSE calculations, distributed model updates, and
B× = {b0 × b | b ∈ B} to model site-specific random effects. pseudo residual updates in the distributed setup. Section 3.4
All base learners in B× are row-wise tensor product base summarizes the distributed CWB algorithm and Sect. 3.5
learners bl× = b0 × bl of the regularized categorical base elaborates on the communication costs of our algorithm.
learner b0 dummy-encoding every site and all other exist-
ing base learners bl ∈ B. This allows for potential inclusion 3.3.1 Distributed shared effects computation
of random effects for every fixed effect in the model. More
specifically, the lth site-specific effect given by the row-wise Fitting CWB in a distributed fashion requires adapting the
tensor product base learner bl× uses the basis transformation fitting process of the base learner bl in Algorithm 1 to
gl× = g0 ⊗ gl distributed data. To allow for shared effects computations
across different sites without jeopardizing privacy, we take
gl× ( x̃) = g0 (x0 )T ⊗ gl (x)T advantage of CWB’s update scheme, which boils down to a
= (1{1} (x0 )gl (x)T , . . . , 1{S} (x0 )gl (x)T )T , (2) (penalized) least squares estimation per iteration for every
    base learner. This allows us to build upon existing work such
=gl× ,1 =gl× ,S
as Karr et al. (2005) to fit linear models in a distributed fash-

123
31 Page 8 of 13 Statistics and Computing (2024) 34:31

⎛ ⎞
ion by just communicating aggregated statistics between sites (ZTl,1 Zl,1 + λ0 I dl + P l )−1 ZTl,1 y1
and the host. ⎜ .. ⎟
=⎝ . ⎠, (4)
In a first step, the aggregated matrices F l,s = ZTl,s Zl,s −1
(Zl,S Zl,S + λ0 I dl + P l ) Zl,S y S
T T
and vectors ul,s = ZTl,s ys are computed on each site. In
our privacy setup (Sect. 2.4), communicating F l,s and ul,s where (4) is due to the block structure, as described in (3) of
is allowed as long as the privacy-aggregation level per site Sect. 3.2. This shows that the fitting of the site-specific effects
is met. In a second step, the site information is aggre- θ̂ l× can be split up into the fitting of individual parameters
S
gated to a global information F l = s=1 F l,s + P l and
S
ul = s=1 ul,s and then used to estimate the model param- θ̂ l× ,s = (ZTl,s Zl,s + λ0 I dl + P l )−1 ZTl,s ys . (5)
eters θ̂ l = F l−1 ul . This approach, referred to as distFit,
is explained again in detail in Algorithm 2 and used for It is thus possible to compute site-specific effects at the
the shared effect computations of the model by substitut- respective site without the need to share any information
[m]  −1 T [m]
ing θ̂ l = ZTl Zl + P l Zl r̃ (Algorithm 1 line 6) with with the host. The host, in turn, only requires the SSE of the
[m]
= distFit(Zl,1 , . . . , Zl,S , r̃ [m] [m] respective base learner (see next Sect. 3.3.3) to perform the
θ̂ l 1 , . . . , r̃ S , P l ).
next iteration of CWB. Hence, during the fitting process, the
Note that the pseudo residuals r̃ [m] k are also securely
parameter estimates remain at their sites and are just updated
located at each site and are updated after each iteration.
if the site-specific base learner is selected. This again mini-
Details about the distributed pseudo residuals updates are
mizes the amount of data communication between sites and
explained in Sect. 3.3.3. We also note that the computational
host and speeds up the fitting process. After the fitting phase,
complexity of fitting CWB can be drastically reduced by
the aggregated site-specific parameters are communicated
pre-calculating and storing (ZTl Zl + P l )−1 in a first initial-
once in a last communication step to obtain the final model.
ization step, as the matrix is independent of iteration m, and
A possible alternative implementation that circumvents the
reusing these pre-calculated matrices in all subsequent itera-
need to handle site-specific heterogeneity separately is to
tions (cf. Schalk et al. 2023). Using pre-calculated matrices
apply the estimation scheme of main effects (Algorithm 2).
also reduces the amount of required communication between
While this simplifies computation, this would increase com-
sites and host.
munication costs and, hence, runtime.

Algorithm 2 Distributed Effect Estimation. 3.3.3 Pseudo residual updates, SSE calculation, and base
The line prefixes [S] and [H] indicate whether the opera- learner selection
tion is conducted at the sites ([S]) or at the host ([H]).
Input Sites design matrices Zl,1 , . . . , Zl,S , The remaining challenges to run the distributed CWB algo-
response vectors y1 , . . . , y S and rithm are (1) the pseudo residual calculation (Algorithm 1
an optional penalty matrix P l . line 4), (2) the SSE calculation (Algorithm 1 line 7), and (3)
Output Estimated parameter vector θ̂ l . base learner selection (Algorithm 1 line 9).
1: procedure distFit(Zl,1 , . . . , Zl,S , y1 , . . . , y S , P l )
2: for s ∈ {1, . . . , S} do Distributed pseudo residual updates The site-specific response
3: [S] F l,s = ZTl,s Zl,s vector ys containing the values y (i) , i ∈ {1, . . . , n s } is the
4: [S] ul,s = ZTl,s ys basis of the pseudo residual calculation. We assume that
5: [S] Communicate F l,s and ul,s to the host every site s has access to all shared effects as well as the
6: end for 
S site-specific information of all site-specific base learners bl×
7: [H] F l = s=1 F l,s + P l
S only containing the respective parameters θ̂ l× ,s . Based on
8: [H] ul = s=1 ul,s
9: [H] return θ̂ l = F l−1 ul these base learners, it is thus possible to compute a site model
10: end procedure fˆs[m] as a representative of fˆ[m] on every site s. The pseudo
residual updates r̃ [m]
s per site are then based on fˆs[m] via
[m](i)
r̃s = −∇ f L(y (i) , f (x (i) ))| f = fˆ[m−1] , i ∈ {1, . . . , n s }
s
using Ds . Most importantly, all remaining steps of the dis-
3.3.2 Distributed site-specific effects computation tributed CWB fitting procedure do not share the pseudo
residuals r̃ [m]
s in order to avoid information leakage about
If we pretend that the fitting of the base learner bl× is per- ys .
formed on the pooled data, we obtain Distributed SSE calculation and base learner selection After
 fitting all base learners bl ∈ B and bl× ∈ B× to r̃ [m]
s , we
−1 [m] [m]
θ̂ l× = ZTl× Zl× + P l× ZTl× y obtain θ̂ l , l = 1, . . . , |B|, and θ̂ l× , l× = 1× , . . . , |B× |.

123
Statistics and Computing (2024) 34:31 Page 9 of 13 31

Calculating the SSE distributively for the lth and l× th base Algorithm 3 Distributed CWB Algorithm.
learner bl and bl× , respectively, requires calculating 2S site- The line prefixes [S] and [H] indicate whether the opera-
specific SSE values: tion is conducted at the sites ([S]) or at the host ([H]).

ns 
Input Sites with site data Dk , learning rate ν, number of boosting
  [m] 2 iterations M, loss
SSEl,s = r̃s[m](i) − bl x (i)
s , θ̂ l function L, set of shared effects B and respective site-specific
i=1 effects B×
ns  Output Prediction model fˆ
[m] 2
= r̃ [m](i) − gl (x (i) )T θ̂ l , 1: procedure distrCWB(ν, L, B, B× )
i=1 2: Initialization:
[0]
ns   2
3: [H] Initialize shared model fˆshared (x) = arg minc∈R Remp (c)
[m]
SSEl× ,s = r̃s[m](i) − bl× x (i)
s , θ̂ l×
4: [S] Calculate Zl,s and F l,s = ZTl,s Zl,s , ∀l ∈ {1, . . . , |B|}, s ∈
{1, . . . , S}
i=1
ns  5: [S] Set fˆs[0] = fˆshared
[0]
 [m] 2 6: for m ∈ {1, . . . , M} or while an early stopping criterion is not
= r̃s[m](i) − gl (x (i) )T θ̂ l× ,s . met do
i=1 7: [S] Update pseudo residuals:
8: [S] r̃s[m](i) = −∇ f L(y (i) , f (x (i) ))| f = fˆ[m−1] , ∀i ∈
The site-specific SSE values
 S are then sent to the host and {1, . . . , n s }
s

aggregated to SSEl = s=1 SSEl,s . If privacy constraints 9: for l ∈ {1, . . . , |B|} do


[m]
have been met in all previous calculations, sharing the indi- 10: [H] Calculate shared effect: θ̂ l =
vidual SSE values is not critical and does not violate any distFit(Zl,1 , . . . , Zl,S , y1 , . . . , y S , P l )
[m]
privacy constraints as the value is an aggregation of all n s 11: [H] Communicate θ̂ l to the sites
observations for all sites s. 12: for k ∈ {1, . . . , S} do
[m]
Having gathered all SSE values at the host location, select- 13: [S] Fit lþ site-specific effect: θ̂ l× ,s = (F l,s + λ0 I dl +
[m]
ing the best base learner in the current iteration is done in the Pl )−1 Z l,s r̃ s
14: [S] Calculate the SSE for the lþ shared and site-
exact same manner as for the non-distributed CWB algo-
specific effect:
rithm by selecting l [m] = arg minl∈{1,...,|B|,1× ,...,|B|× } SSEl . n s [m]
15: [S] SSEl,s = i=1 (r̃ [m](i) − gl (x (i) )T θ̂ l )2
After the selection, the index l [m] is shared with all sites to n s [m](i) [m]
16: [S] SSEl× ,s = i=1 (r̃s − gl (x (i) )T θ̂ l× ,s )2
enable the update of the site-specific models fˆs[m] . If a shared 17: [S] Send SSEl,s and SSEl× ,s to the host
[m]
effect is selected, the parameter vector θ̂ l [m] is shared with all 18: end for S
sites. Caution must be taken when the number of parameters 19: [H] Aggregate SSE values: SSEl = s=1 SSEl,s and
S
SSEl× = s=1 SSEl× ,s
of one base learner is equal to the number of observations, as
20: end for
this allows reverse-engineering private data. In the case of a 21: [H] Select best base learner: l [m] =
site-specific effect selection, no parameter needs to be com- arg minl∈{1,...,|B|,1× ,...,|B|×} SSEl
municated, as the respective estimates are already located at 22: if bl [m] is a shared effect then
[m] [m−1]
each site. 23: [H] Update model: fˆshared (x) = fˆshared (x) +
[m]
νbl [m] (x, θ̂ l [m] )
[m]
3.4 Distributed CWB algorithm with site-specific 24: [H] Upload model update θ̂ l [m] to the sites.
effects 25: end if
26: [S] Update site model fˆs[m] via parameter updates θ̂ l [m] =
[m]
θ̂ l [m] + ν θ̂ l [m]
Assembling all pieces, our distributed CWB algorithm is
27: end for
summarized in Algorithm 3. 28: [S] Communicate site-specific effects θ̂ 1× , . . . , θ̂ |B|× to the
host
3.5 Communication costs 29: [H] Add site-specific effects to the model of shared effects
[M]
fˆshared to obtain the full model fˆ[M]
While the CWB iterations themselves can be performed in 30: [H] return fˆ = fˆ[M]
31: end procedure
parallel on every site and do not slow down the process
compared to a pooled calculation, it is worth discussing the
communication costs of distrCWB. During the initialization,
data is shared just once, while the fitting phase requires the ing iterations M and the number of base learners |B|. Because
communication of data in each iteration. Let d = maxl dl of the iterative nature of CWB with a single loop over the
be the maximum number of basis functions (or, alternatively, boosting iterations, the communication costs (both for the
assume d basis functions for all base learners). The two main host and each site) scale linearly with the number of boosting
drivers of the communication costs are the number of boost- iterations M, i.e., O(M). For the analysis of communication

123
31 Page 10 of 13 Statistics and Computing (2024) 34:31

costs in terms of the number of base learners, we distinguish of these covariates is given in Table 1 in the Supplementary
between the initialization phase and the fitting phase. Material B.1. For our application, we assume that missing
Initialization As only the sites share F l,s ∈ Rd×d , ∀l ∈ values are completely at random and all data sets are exclu-
{1, . . . , |B|}, the transmitted amount of values is d 2 |B| for sively located at each sites. The task is to determine important
each site and therefore scales linearly with |B|, i.e., O(|B|). risk factors for heart diseases. The target variable is therefore
The host does not communicate any values during the ini- a binary outcome indicating the presence of heart disease or
tialization. not.

Fitting In each iteration, every site shares its vector ZTl,s r̃ [m]
s ∈
Rd , ∀l ∈ {1, . . . , |B|}. Over the course of M boosting iter- 4.2 Analysis and results
ations, each site therefore shares d M|B| values. Every site
also communicates the SSE values, i.e., 2 values (index and We follow the practices to setup CWB as mentioned in
SSE value) for every base learner and thus 2M|B| values for Sect. 2.3.2 and run the distributed CWB algorithm with a
all iterations and base learners. In total, each site communi- learning rate of 0.1 and a maximum number of 100,000
cates M|B|(d + 2) values. The communication costs for all iterations. To determine an optimal stopping iteration for
sites are therefore O(|B|). The host, in turn, communicates CWB, we use 20 % of the data as validation data and set the
[m]
the estimated parameters θ̂ ∈ Rd of the |B| shared effects. patience to 5 iterations. In other words, the algorithm stops if
Hence, d M|B| values as well as the index of the best base no risk improvement on the validation data is observed in 5
learner in each iteration are transmitted. In total, the host consecutive iterations. For the numerical covariates, we use
therefore communicates d M|B| + M values to the sites, and a P-spline with 10 cubic basis functions and second-order
costs are therefore also O(|B|). difference penalties. All base learners are penalized accord-
ingly to a global degree of freedom that we set to 2.2 (to
obtain unbiased feature selection) while the random inter-
4 Application cept is penalized according to 3 degrees of freedom (see the
Supplementary Material B.2 for more details). Since we are
We now showcase our algorithm on a heart disease data set modelling a binary response variable, h −1 is the inverse logit
that consists of patient data gathered all over the world. The function logit−1 ( f ) = (1 + exp(− f ))−1 . The model for an
data were collected at four different sites by the (1) Hungar- observation of site s, conditional on its random effects γ , is
ian Institute of Cardiology, Budapest (Andras Janosi, M.D.), given in the Supplementary Material B.3.
(2) University Hospital, Zurich, Switzerland (William Stein-
brunn, M.D.), (3) University Hospital, Basel, Switzerland Results The algorithm stops after m stop = 5578 iterations as
(Matthias Pfisterer, M.D.), and (4) V.A. Medical Center, the risk on the validation data set starts to increase (cf. Fig-
Long Beach, and Cleveland Clinic Foundation (Robert ure 1 in the Supplementary Material B.4) and selects covari-
Detrano, M.D., Ph.D.), and is thus suited for a multi-site dis- ates oldpeak, cp, trestbps, age, sex, restecg,
tributed analysis. The individual data sets are freely available exang, and thalach. Out of these 5578 iterations, the
at https://archive.ics.uci.edu/ml/datasets/heart+disease (Dua distributed CWB algorithm selects a shared effect in 782
and Graff 2017). For our analysis, we set the privacy level iterations and site-specific effects in 4796 iterations. This
(cf. Sect. 2.4) to 5 which is a common default. indicates that the data is rather heterogeneous and requires
site-specific (random) effects. We want to emphasize that
4.1 Data description the given data is from an observational study and that the
sole purpose of our analysis is to better understand the het-
The raw data set contains 14 covariates, such as the chest erogeneity in the data. Hence, the estimated effects have
pain type (cp), resting blood pressure (trestbps), maxi- a non-causal relationship. To alleviate problems that come
mum heart rate (thalach), sex, exercise-induced angina from such data, and allow for the estimation of causal effects,
(exang), or ST depression (i.e., abnormal difference of one could use, e.g., propensity score matching (Rosenbaum
the ST segment from the baseline on an electrocardiogram) and Rubin 1983) before applying our algorithm. From our
induced by exercise relative to rest (oldpeak). A full list application, we can, e.g., see that the data from Hungary
of covariates and their abbreviations is given on the data could potentially be problematic in this respect. However,
set’s website. After removing non-informative (constant) note that applying such measures would also have to be done
covariates and columns with too many missing values at in a privacy-preserving manner. Figure 2 (Left) shows traces
each site, we obtain n cleveland = 303, n hungarian = 292, of how and when the different additive terms (base learners)
n switzerland = 116, and n va = 140 observations and 8 covari- entered the model during the fitting process and illustrates
ates. A table containing the description of the abbreviations the selection process of CWB.

123
Statistics and Computing (2024) 34:31 Page 11 of 13 31

Fig. 2 Left: Model trace showing how and when the four most selected
additive terms entered the model. Right: Variable importance (cf. Au Fig. 4 Comparison of the site-specific effects for oldpeak between
et al. 2019) of selected features in decreasing order the distributed (dsCWB) and pooled CWB approach (compboost) as
well as estimates of from mgcv

between the main feature and the categorical site variable. A


comparison of all partial feature effects is given in the Supple-
Fig. 3 Decomposition of the effect of oldpeak into the shared (left) mentary Material B.7 showing good alignment between the
and the site-specific effects (middle). The plot on the right-hand side different methods. For the oldpeak effect shown in Fig. 4,
shows the sum of shared and site-specific effects
we also see that the partial effects of the two CWB methods
are very close to the mixed model-based estimation, with
only smaller differences caused by a slightly different penal-
The estimated effect of the most important feature ization strength of both approaches. The empirical risk is
oldpeak (cf. Fig. 2, Right) found is further visualized in 0.4245 for our distributed CWB algorithm, 0.4245 for CWB
Fig. 3. Looking at the shared effect, we find a negative on the pooled data, and 0.4441 for the GAMM on the pooled
influence on the risk of heart disease when increasing ST data.
depression (oldpeak). When accounting for site-specific
deviations, the effect becomes more diverse, particularly for
Hungary.
5 Discussion
In the Supplementary Material B.5 and B.6, we provide the
partial effects for all features and showcase the conditional
We proposed a novel algorithm for distributed, lossless, and
predictions of the fitted GAMM model for a given site.
privacy-preserving GAMM estimation to analyze horizon-
Comparison of estimation approaches The previous example tally partitioned data. To account for data heterogeneity of
shows partial feature effects that exhibit shrinkage due to the different sites we introduced site-specific (smooth) random
early stopping of CWB’s fitting routine. While this prevents effects. Using CWB as the fitting engine allows estimation
overfitting and induces a sparse model, we can also run CWB in high-dimensional settings and fosters variable as well as
for a very large amount of iterations without early stopping effect selection. This also includes a data-driven selection of
to approximate the unregularized and hence unbiased maxi- shared and site-specific features, providing additional data
mum likelihood solution. We illustrate this in the following insights. Owing to the flexibility of boosting and its base
by training CWB and our distributed version for 100,000 learners, our algorithm is easy to extend and can also account
iterations and compare its partial effects to the ones of a clas- for interactions, functional regression settings (Brockhaus
sical mixed model-based estimation routine implemented in et al. 2020), or modeling survival tasks (Bender et al. 2020).
the R package mgcv (Wood 2017). Our R prototype took An open challenge for the practical use of our approach
≈ 3.5 h for 100,000 iterations and ≈ 700 s with early stop- is its high communication costs. For larger iterations (in the
ping after 5578 iterations. The corresponding computation 10 or 100 thousands), computing a distributed model can
on a local machine with compboost took ≈ 25 min and take several hours. One option to reduce the total runtime
≈ 85 s, respectively. We want to emphasize that the runtime is to incorporate accelerated optimization recently proposed
of our algorithm strongly depends on the distributed system in Schalk et al. (2023). Another driver that influences the
that controls the communication as well as the bandwidth. runtime is the latency of the technical setup. Future improve-
Results of the estimated partial effects of our distributed ments could reduce the number of communications, e.g., via
CWB algorithm and the original CWB on pooled data show a multiple fitting rounds at the different sites before commu-
perfect overlap (cf. Fig. 4). This again underpins the lossless nicating the intermediate results.
property of the proposed algorithm. The site-specific effects A possible future extension of our approach is to account
on the pooled data are fitted by defining a row-wise Kro- for both horizontally and vertically distributed data. Since the
necker base learner for all features and the site as a categorical algorithm is performing component-wise (coordinate-wise)
variable. The same approach is used to estimate a GAMM updates, the extension to vertically distributed data naturally
using mgcv fitted on the pooled data with tensor products falls into the scope of its fitting procedure. This would, how-

123
31 Page 12 of 13 Statistics and Computing (2024) 34:31

ever, require a further advanced technical setup and the need References
to ensure consistency across sites.
As an alternative to CWB, a penalized likelihood approach Anjum, M.M., Mohammed, N., Li, W., et al.: Privacy preserving col-
laborative learning of generalized linear mixed model. J. Biomed.
like mgcv could be considered for distributed computing.
Inform. 127(104), 008 (2022)
Unlike CWB, which benefits from parallelized base learn- Au, Q., Schalk, D., Casalicchio, G., et al.: Component-wise boost-
ers, decomposing the entire design matrix for distributed ing of targets for multi-output prediction. arXiv preprint
computing with this approach is more intricate. The paral- arXiv:1904.03943 (2019)
Augustyn, D.R., Wyciślik, Ł, Mrozek, D.: Perspectives of using cloud
lelization strategy of Wood et al. (2017) could be adapted
computing in integrative analysis of multi-omics data. Brief.
by viewing cores as sites and the main process as the host. Funct. Genom. 20(4), 198–206 (2021). https://doi.org/10.1093/
However, ensuring privacy for this approach would require bfgp/elab007
additional attention. A notable obstacle for smoothing param- Bazeley, P.: Integrative analysis strategies for mixed data sources. Am.
Behav. Sci. 56(6), 814–828 (2012)
eter estimation is the requirement of the Hessian matrix Wood
Bender, A., Rügamer, D., Scheipl, F., et al.: A general machine learning
et al. (2016). Since the Hessian matrix cannot be directly framework for survival analysis. In: Joint European Conference
computed from distributed data, methods like subsampling on Machine Learning and Knowledge Discovery in Databases.
(Umlauf et al. 2023) or more advanced techniques would Springer, pp. 158–173 (2020)
Boyd, K., Lantz, E., Page, D.: Differential privacy for classifier eval-
be necessary to achieve unbiased estimates and convergence uation. In: Proceedings of the 8th ACM Workshop on Artificial
of the whole process. In general, unlike CWB which fits Intelligence and Security, pp. 15–23 (2015)
pseudo-residuals using the L 2 -loss and estimates smoothness Brockhaus, S., Rügamer, D., Greven, S.: Boosting functional regression
implicitly through iterative gradient updates, penalized like- models with FDboost. J. Stat. Softw. 94(10), 1–50 (2020)
Brumback, B.A., Ruppert, D., Wand, M.P.: Variable selection and
lihood approaches such as the one implemented in mgcv are function estimation in additive nonparametric regression using a
less straightforward to distribute, and a privacy-preserving data-based prior: Comment. J. Am. Stat. Assoc. 94(447), 794–797
lossless computation would involve specialized procedures. (1999)
Bühlmann, P., Yu, B.: Boosting with the L2 loss: regression and classi-
Supplementary Information The online version contains supplemen- fication. J. Am. Stat. Assoc. 98(462), 324–339 (2003)
tary material available at https://doi.org/10.1007/s11222-023-10323- Bühlmann, P., Hothorn, T., et al.: Boosting algorithms: regularization,
2. prediction and model fitting. Stat. Sci. 22(4), 477–505 (2007)
Chen, Y.R., Rezapour, A., Tzeng, W.G.: Privacy-preserving ridge
Author Contributions DS developed the idea, its theoretical details and regression on distributed data. Inf. Sci. 451, 34–49 (2018)
its implementation in R. He further conducted all experiments and prac- Curran, P.J., Hussong, A.M.: Integrative data analysis: the simultaneous
tical applications. The manuscript was mainly written by DS Connection analysis of multiple data sets. Psychol. Methods 14(2), 81 (2009)
to GAMMs have been worked out by D.R., who also wrote the corre- Dua, D., Graff, C.: UCI machine learning repository. http://archive.ics.
sponding text passages. BB and DR helped revising and finalizing the uci.edu/ml (2017)
manuscript. Dwork, C.: Differential privacy. In: International Colloquium on
Automata, Languages, and Programming. Springer, pp. 1–12
Funding Open Access funding enabled and organized by Projekt (2006)
DEAL. Eilers, P.H., Marx, B.D.: Flexible smoothing with B-splines and penal-
ties. Stat. Sci. 11, 89–102 (1996)
Gambs, S., Kégl, B., Aïmeur, E.: Privacy-preserving boosting. Data
Declarations Min. Knowl. Disc. 14(1), 131–170 (2007)
Gaye, A., Marcon, Y., Isaeva, J., et al.: Datashield: taking the analysis
to the data, not the data to the analysis. Int. J. Epidemiol. 43(6),
Conflict of interest The authors declare that they have no known com- 1929–1944 (2014)
peting financial interests or personal relationships that could have Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for
appeared to influence the work reported in this paper. nonorthogonal problems. Technometrics 12(1), 55–67 (1970)
Hofner, B., Hothorn, T., Kneib, T., et al.: A framework for unbiased
Open Access This article is licensed under a Creative Commons model selection based on boosting. J. Comput. Graph. Stat. 20(4),
Attribution 4.0 International License, which permits use, sharing, adap- 956–971 (2011)
tation, distribution and reproduction in any medium or format, as Jones, E.M., Sheehan, N.A., Gaye, A., et al.: Combined analysis of cor-
long as you give appropriate credit to the original author(s) and the related data when data cannot be pooled. Stat 2(1), 72–85 (2013)
source, provide a link to the Creative Commons licence, and indi- Karr, A.F., Lin, X., Sanil, A.P., et al.: Secure regression on distributed
cate if changes were made. The images or other third party material databases. J. Comput. Graph. Stat. 14(2), 263–279 (2005)
in this article are included in the article’s Creative Commons licence, Lazarevic, A., Obradovic, Z.: The distributed boosting algorithm. In:
unless indicated otherwise in a credit line to the material. If material Proceedings of the seventh ACM SIGKDD international con-
is not included in the article’s Creative Commons licence and your ference on Knowledge discovery and data mining, pp. 311–316
intended use is not permitted by statutory regulation or exceeds the (2001)
permitted use, you will need to obtain permission directly from the copy- Li, J., Kuang, X., Lin, S., et al.: Privacy preservation for machine learn-
right holder. To view a copy of this licence, visit http://creativecomm ing training and classification based on homomorphic encryption
ons.org/licenses/by/4.0/. schemes. Inf. Sci. 526, 166–179 (2020a)

123
Statistics and Computing (2024) 34:31 Page 13 of 13 31

Li, Q., Wen, Z., He, B.: Practical federated gradient boosting decision Ünal, A.B., Pfeifer, N., Akgün, M.: ppAURORA: privacy preserving
trees. In: Proceedings of the AAAI Conference on Artificial Intel- area under receiver operating characteristic and precision-recall
ligence, pp. 4642–4649 (2020b) curves with secure 3-party computation. arXiv 2102 (2021)
Liew, B.X., Rügamer, D., Abichandani, D., et al.: Classifying individ- Wood, S.N.: Generalized Additive Models: An Introduction with R.
uals with and without patellofemoral pain syndrome using ground Chapman and Hall/CRC, Boca Raton (2017)
force profiles—development of a method using functional data Wood, S.N., Pya, N., Säfken, B.: Smoothing parameter and model
boosting. Gait Posture 80, 90–95 (2020) selection for general smooth models. J. Am. Stat. Assoc.
Litwin, W., Mark, L., Roussopoulos, N.: Interoperability of multiple 111(516), 1548–1563 (2016). https://doi.org/10.1080/01621459.
autonomous databases. ACM Comput. Surv. (CSUR) 22(3), 267– 2016.1180986
293 (1990) Wood, S.N., Li, Z., Shaddick, G., et al.: Generalized additive models
Lu, C.L., Wang, S., Ji, Z., et al.: Webdisco: a web service for distributed for gigadata: modeling the U.K. black smoke network daily data.
cox model learning without patient-level data sharing. J. Am. Med. J. Am. Stat. Assoc. 112(519), 1199–1210 (2017). https://doi.org/
Inform. Assoc. 22(6), 1212–1219 (2015) 10.1080/01621459.2016.1195744
Luo, C., Islam, M., Sheils, N.E., et al.: DLMM as a lossless one- Wu, C.J., Hamada, M.S.: Experiments: Planning, Analysis, and Opti-
shot algorithm for collaborative multi-site distributed linear mixed mization. Wiley, Hoboken (2011)
models. Nat. Commun. 13(1), 1–10 (2022) Wu, Y., Jiang, X., Kim, J., et al.: Grid binary logistic regression
McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Rout- (GLORE): building shared models without sharing data. J. Am.
ledge, Milton Park (1989) Med. Inform. Assoc. 19(5), 758–764 (2012)
McMahan, B., Moore, E., Ramage, D., et al.: Communication-efficient Yan, Z., Zachrison, K.S., Schwamm, L.H., et al.: Fed-GLMM: a
learning of deep networks from decentralized data. In: Artificial privacy-preserving and computation-efficient federated algorithm
Intelligence and Statistics, PMLR, pp. 1273–1282 (2017) for generalized linear mixed models to analyze correlated elec-
Mirza, B., Wang, W., Wang, J., et al.: Machine learning and integrative tronic health records data. medRxiv (2022)
analysis of biomedical big data. Genes 10(2), 87 (2019) Zhu, R., Jiang, C., Wang, X., et al.: Privacy-preserving construction
Mohassel, P., Zhang, Y.: SecureML: a system for scalable privacy- of generalized linear mixed model for biomedical computation.
preserving machine learning. In: 2017 IEEE Symposium on Bioinformatics 36(Supplement–1), i128–i135 (2020). https://doi.
Security and Privacy (SP), IEEE, pp. 19–38 (2017) org/10.1093/bioinformatics/btaa478
Naehrig, M., Lauter, K., Vaikuntanathan, V.: Can homomorphic encryp-
tion be practical? In: Proceedings of the 3rd ACM Workshop on
Cloud Computing Security Workshop, pp. 113–124 (2011) Publisher’s Note Springer Nature remains neutral with regard to juris-
Paillier, P.: Public-key cryptosystems based on composite degree resid- dictional claims in published maps and institutional affiliations.
uosity classes. In: International Conference on the Theory and
Applications of Cryptographic Techniques. Springer, pp. 223–238
(1999)
R Core Team: R: A Language and Environment for Statistical Com-
puting. R Foundation for Statistical Computing, Vienna, Austria,
https://www.R-project.org/ (2021)
Rosenbaum, P.R., Rubin, D.B.: The central role of the propensity score
in observational studies for causal effects. Biometrika 70(1), 41–
55 (1983)
Rügamer, D., Brockhaus, S., Gentsch, K., et al.: Boosting factor-specific
functional historical models for the detection of synchronization
in bioelectrical signals. J. R. Stat. Soc.: Ser. C (Appl. Stat.) 67(3),
621–642 (2018). https://doi.org/10.1111/rssc.12241
Saintigny, P., Zhang, L., Fan, Y.H., et al.: Gene expression profiling
predicts the development of oral cancer. Cancer Prev. Res. 4(2),
218–229 (2011)
Samarati, P., Sweeney, L.: Protecting privacy when disclosing infor-
mation: k-anonymity and its enforcement through generaliza-
tion and suppression. Technical Report, http://www.csl.sri.com/
papers/sritr-98-04/ (1998)
Schalk, D., Hoffmann, V.S., Bischl, B., et al.: Distributed non-disclosive
validation of predictive models by a modified ROC-GLM. arXiv
preprint arXiv:2203.10828 (2022)
Schalk, D., Bischl, B., Rügamer, D.: Accelerated component wise gra-
dient boosting using efficient data representation and momentum-
based optimization. J. Comput. Graph. Stat. 32(2), 631–641
(2023). https://doi.org/10.1080/10618600.2022.2116446
Schmid, M., Hothorn, T.: Boosting additive models using component-
wise p-splines. Comput. Stat. Data Anal. 53(2), 298–311 (2008)
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncer-
tain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)
Umlauf, N., Seiler, J., Wetscher, M., et al.: Scalable estimation
for structured additive distributional regression. arXiv preprint
arXiv:2301.05593 (2023)

123

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy