0% found this document useful (0 votes)
30 views50 pages

Shrinkage Priors For Bayesian Penalized Regression

This document provides an overview of Bayesian penalized regression. It discusses how Bayesian penalization can be used as an alternative to classical penalized regression techniques. Nine different shrinkage priors that are commonly used for Bayesian penalization are compared, including how they perform variable selection and prediction. The document also describes an R package that can be used to perform Bayesian penalized regression with different shrinkage priors.

Uploaded by

joey qi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views50 pages

Shrinkage Priors For Bayesian Penalized Regression

This document provides an overview of Bayesian penalized regression. It discusses how Bayesian penalization can be used as an alternative to classical penalized regression techniques. Nine different shrinkage priors that are commonly used for Bayesian penalization are compared, including how they perform variable selection and prediction. The document also describes an R package that can be used to perform Bayesian penalized regression with different shrinkage priors.

Uploaded by

joey qi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Shrinkage Priors for Bayesian Penalized

Regression.

Sara van Erp1 , Daniel L. Oberski2 , and Joris Mulder1


1
Tilburg University
2
Utrecht University

This manuscript has been published in the Journal of Mathematical Psychology.


Please cite the published version:
Van Erp, S., Oberski, D. L., & Mulder, J. (2019). Shrinkage priors for Bayesian
penalized regression. Journal of Mathematical Psychology, 89, 31–50.
doi:10.1016/j.jmp.2018.12.004
Abstract

In linear regression problems with many predictors, penalized regression tech-


niques are often used to guard against overfitting and to select variables rel-
evant for predicting an outcome variable. Recently, Bayesian penalization
is becoming increasingly popular in which the prior distribution performs a
function similar to that of the penalty term in classical penalization. Specif-
ically, the so-called shrinkage priors in Bayesian penalization aim to shrink
small effects to zero while maintaining true large effects. Compared to clas-
sical penalization techniques, Bayesian penalization techniques perform sim-
ilarly or sometimes even better, and they offer additional advantages such as
readily available uncertainty estimates, automatic estimation of the penalty
parameter, and more flexibility in terms of penalties that can be considered.
However, many different shrinkage priors exist and the available, often quite
technical, literature primarily focuses on presenting one shrinkage prior and
often provides comparisons with only one or two other shrinkage priors. This
can make it difficult for researchers to navigate through the many prior op-
tions and choose a shrinkage prior for the problem at hand. Therefore, the
aim of this paper is to provide a comprehensive overview of the literature
on Bayesian penalization. We provide a theoretical and conceptual compari-
son of nine different shrinkage priors and parametrize the priors, if possible,
in terms of scale mixture of normal distributions to facilitate comparisons.
We illustrate different characteristics and behaviors of the shrinkage priors
and compare their performance in terms of prediction and variable selec-
tion in a simulation study. Additionally, we provide two empirical examples
to illustrate the application of Bayesian penalization. Finally, an R pack-
age bayesreg is available online (https://github.com/sara-vanerp/bayesreg)
which allows researchers to perform Bayesian penalized regression with novel
shrinkage priors in an easy manner.

Bayesian, Shrinkage Priors, Penalization, Empirical Bayes, Regression


Bayesian Penalization

1 Introduction
Regression analysis is one of the main statistical techniques often used in
the field of psychology to determine the effect of a set of predictors on an
outcome variable. The number of predictors is often large, especially in the
current “Age of Big Data”. For example, the Kavli HUMAN project (Az-
mak et al., 2015) aims to collect longitudinal data on all aspects of human
life for 10,000 individuals. Measurements include psychological assessments
(e.g., personality, IQ), health assessments (e.g., genome sequencing, brain
activity scanning), social network assessment, and variables related to edu-
cation, employment, and financial status, resulting in an extremely large set
of variables. Furthermore, personal tracking devices allow the collection of
large amounts of data on various topics, including for example mood, in a
longitudinal manner (Fawcett, 2015). The problem with regular regression
techniques such as ordinary least squares (OLS) is that they quickly lead to
overfitting as the ratio of predictor variables to observations increases (see
for example, McNeish, 2015, for an overview of the problems with OLS).
Penalized regression is a statistical technique widely used to guard against
overfitting in the case of many predictors. Penalized regression techniques
have the ability to select variables out of a large set of variables that are rele-
vant for predicting some outcome. Therefore, a popular setting for penalized
regression is in high-dimensional data, where the number of predictors p is
larger than the sample size n. Furthermore, in settings where the number
of predictors p is smaller than the sample size n (but still relatively large),
penalized regression can offer advantages in terms of avoiding overfitting and
achieving model parsimony compared to traditional variable selection meth-
ods such as null-hypothesis testing or stepwise selection methods (Derksen
and Keselman, 1992; Tibshirani, 1996). The central idea of penalized regres-
sion approaches is to add a penalty term to the minimization of the sum of
squared residuals, with the goal of shrinking small coefficients towards zero
while leaving large coefficients large, i.e.,

{ }
1
minimize ||yy − β01 − X β ||2 + λc ||β
2
β ||q , (1)
β0 , β 2n
( p ) 1q

where ||β β ||q = |βj |q ,
j=1

where y = (y1 , . . . , yn )′ is an n-dimensional vector containing the observa-


tions on the outcome variable, β0 reflects the intercept, 1 is an n-dimensional
Bayesian Penalization

vector of ones, X is an (n × p) matrix of the observed scores on the p pre-


dictor variables, and β = (β1 , . . . , βp )′ is a p-dimensional parameter vector of
regression coefficients. λc reflects the penalty parameter, with large values
resulting in more shrinkage towards zero while λc = 0 leads to the ordinary
least squares solution. The choice of q determines the type of penalty in-
duced, for example, q = 1 results in the well-known least absolute shrinkage
and selection operator (lasso; Tibshirani, 1996) solution and q = 2 results
in the ridge solution (Hoerl and Kennard, 1970). We refer to Hastie et al.
(2015) for a comprehensive introduction and overview of various penalized
regression methods in a frequentist framework.
It is well known that many solutions to the penalized minimization prob-
lem in Equation (1) can also be obtained in the Bayesian framework by using
a specific prior combined with the posterior mode estimate, which has been
shown to perform similar to or better than their classical counterparts (Hans,
2009; Kyung et al., 2010; Li and Lin, 2010).1 Adopting a Bayesian perspec-
tive on penalized regression offers several advantages. First, penalization fits
naturally in a Bayesian framework since a prior distribution is needed any-
way and shrinkage towards zero can be straightforwardly achieved by choos-
ing a specific parametric form for the prior. Second, parameter uncertainty
and standard errors follow naturally from the posterior standard deviations.
As shown by Kyung et al. (2010) classical penalized regression procedures
can result in estimated standard errors that suffer from multiple problems,
such as variances estimated to be 0 (in the case of sandwich estimates), and
unstable or poorly performing variance estimates (in the case of bootstrap
estimates). Third, with Bayesian penalization it is possible to estimate the
penalty parameter(s) λ simultaneously with the model parameters in a sin-
gle step. This is especially advantageous when there are multiple penalty
parameters (e.g., in the elastic net; Zou and Hastie, 2005), since sequential
cross-validation procedures to determine multiple penalty parameters induce
too much shrinkage (i.e., the double shrinkage problem; see e.g., Zou and
Hastie, 2005). Fourth, Bayesian penalization relies on Markov Chain Monte
Carlo (MCMC) sampling rather than optimization, which provides more flex-
ibility in the sense that priors that would correspond to non-convex penalties
(i.e., q < 1 in (1)) are easier to implement. Non-convex penalties would result
in multiple modes, making them difficult to implement in an optimization
framework. The cost of the flexibility of MCMC, however, is that it requires
more computation time compared to standard optimization procedures. Fi-
nally, Bayesian estimates have an intuitive interpretation. For example, a
1
Note that from a Bayesian perspective, however, there is no theoretical justification
for reporting the posterior mode estimate (Tibshirani, 2011).
Bayesian Penalization

95% Bayesian credibility interval can simply be interpreted as the interval


in which the true value lies with 95% probability (e.g., Berger, 2006).
Due to these advantages, Bayesian penalization is becoming increasingly
popular in the literature (see e.g., Alhamzawi et al., 2012; Andersen et al.,
2017; Armagan et al., 2013; Bae and Mallick, 2004; Bhadra et al., 2016;
Bhattacharya et al., 2012; Bornn et al., 2010; Caron and Doucet, 2008; Car-
valho et al., 2010; Feng et al., 2017; Griffin and Brown, 2017; Hans, 2009;
Ishwaran and Rao, 2005; Lu et al., 2016; Peltola et al., 2014; Polson and
Scott, 2011; Roy and Chakraborty, 2016; Zhao et al., 2016). An active area
of research investigates theoretical properties of priors for Bayesian penal-
ization, such as the Bayesian lasso prior (for a recent overview, see Bhadra
et al., 2017). In addition to the Bayesian counterparts of classical penal-
ized regression solutions, many other priors have been proposed that have
desirable properties in terms of prediction and variable selection. However,
the extensive (and often technical) literature and subtle differences between
the priors can make it difficult for researchers to navigate the options and
make sensible choices for the problem at hand. Therefore, the aim of this
paper is to provide a comprehensive overview of the priors that have been
proposed for penalization in (sparse) regression. We use the term shrinkage
priors to emphasize that these priors aim to shrink small effects towards
zero. We place the shrinkage priors in a general framework of scale mixtures
of normal distributions to emphasize the similarities and differences between
the priors. By providing insight in the characteristics and behaviors of the
priors, we aid researchers in choosing a prior for their specific problem. Ad-
ditionally, we present a straightforward method to obtain empirical Bayes
(EB) priors for Bayesian penalization. We conduct a simulation study to
compare the performance of the priors in terms of prediction and variable
selection in a linear regression model, and provide two empirical examples to
further illustrate the Bayesian penalization methods. Finally, the shrinkage
priors have been implemented in the R package bayesreg, available from
https://github.com/sara-vanerp/bayesreg, to allow general utilization.
The remainder of this paper is organized as follows: Section 2 introduces
Bayesian penalized regression. A theoretical overview of the different shrink-
age priors can be found in Section 3. Further insights into the priors is
provided through illustrations in Section 4 and the priors are compared in a
simulation study in Section 5. Section 6 presents the empirical applications,
followed by a discussion in Section 7.
Bayesian Penalization

2 Bayesian penalized regression


The likelihood for the linear regression model is given by:

p
yi |β0 , x i , β , σ ∼ Normal(β0 +
2
xij βj , σ 2 ), (2)
j=1

where β0 represents the intercept, βj the regression coefficient for predictor


j, and σ 2 is the residual variance.
In a Bayesian analysis, a prior distribution is specified for each parameter
β |σ 2 , λ)p(σ 2 )p(λ). Note that the
in the model, e.g., p(β0 , β , σ 2 , λ) = p(β0 )p(β
prior for β is conditioned on the residual variance σ 2 , as well as on λ. The
conditioning on σ 2 is necessary in certain cases to obtain a unimodal posterior
(Park and Casella, 2008). In Bayesian penalized regression, λ is a parameter
in the prior (i.e., a hyperparameter) but has a similar role as the penalty
parameter in classical penalized regression. Since this penalty parameter λ
is used to penalize the regression coefficient, it only appears in the prior for
β . Throughout this paper we will focus on priors for the regression coeffi-
cients β1 , . . . , βj and we will assume noninformative improper priors for the
nuisance parameters, specifically, p(β0 ) = 1 and a uniform prior on log(σ 2 ),
i.e., p(σ 2 ) = σ −2 . Please note that these priors are chosen as noninformative
choices for the linear regression model considered in this paper. However,
other choices (including informative priors when prior information is avail-
able) are possible and might be preferred in other applications. We refer the
reader to van Erp et al. (2018) for general recommendations on specifying
prior distributions. We generally assume that the priors for the regression
coefficients are independent, unless stated otherwise.
The prior distribution is then multiplied by the likelihood of the data to
obtain the posterior distribution, i.e.,
p(β0 , β , σ 2 , λ|yy , X) ∝ p(yy |X, β0 , β , σ 2 )p(β0 )p(β
β |σ 2 , λ)p(σ 2 )p(λ). (3)
Here, the normalizing constant is not included such that the right-hand side
is proportional to the posterior. The only difference with the unpenalized
problem (e.g., Bayesian linear regression) is the introduction of the penalty
parameter λ. As a result of the shrinkage prior the posterior in (3) is generally
more concentrated, or ”shrunk towards”, zero in comparison to the likelihood
of the model.
An important choice is how to specify the penalty parameter λ. There
are different possibilities for this.
1. Full Bayes. Treat λ as an unknown model parameter for which a prior
needs to be specified. Typically, a vague prior p(λ) is specified for
Bayesian Penalization

λ. Due to its similarity with multilevel (or hierarchical) modeling, full


Bayes (FB) is also known as “hierarchical Bayes” (see e.g., Wolpert and
Strauss, 1996). This results in a fully Bayesian solution that incorpo-
rates the uncertainty about λ. The advantage of this approach is that
the model can be estimated in one step. Throughout this paper, we
will consider the half-Cauchy prior on λ, i.e., λ ∼ half-Cauchy(0, 1),
which is a robust alternative and a popular prior distribution in the
Bayesian literature (see e.g., Gelman, 2006; Mulder and Pericchi, 2018;
Polson and Scott, 2012).

2. Empirical Bayes. Empirical Bayes (EB) methods, also known as the


“evidence” procedure (see e.g., Wolpert and Strauss, 1996), first es-
timate the penalty parameter λ from the data and then plug in this
EB estimate for λ in the model (see van de Wiel et al., 2017, for an
overview of EB methodology in high-dimensional data). The resulting
prior is called an EB prior. Since an EB estimate is used for λ, the EB
approach does not require the specification of a prior p(λ) as in the FB
approach. Since the exact choice of this prior can sometimes have a
serious effect on the Bayesian estimates (Roy and Chakraborty, 2016),
the EB approach would avoid sensitivity of the results to the exact
choice of the prior p(λ), while keeping the advantages of the Bayesian
approach.
Empirical Bayes is a two-step approach: first, the empirical Bayes
choice for λ needs to be determined; second, the model is fitted us-
ing the EB prior. In order to obtain an EB estimate for λ, we need to
find the solution that maximizes the marginal likelihood2 , i.e.,

λEB = arg max p(yy |λ). (4)

To obtain λEB , first note that the marginal likelihood is the product
of the likelihood and prior integrated over the model parameters, i.e.,
∫∫∫
p(yy |λ) = p(yy |X, β0 , β , σ 2 )p(β0 )p(β
β |σ 2 , λ)p(σ 2 ) dβ0 dβ
β dσ 2 . (5)

Instead of directly optimizing, we achieve (4) by sampling from the


posterior with a noninformative prior for λ.3 The EB estimate λEB
2
The marginal likelihood quantifies the probability of observing the data given the
model. Therefore, plugging in the EB estimate for λ will result in a prior that predicts
the observed data best.
3
Specifically, we use λ ∼ half-Cauchy(0, 10000) to ensure a stable MCMC sampler.
Bayesian Penalization

is the mode of the marginal posterior for λ, i.e., p(λ|yy ). This corre-
sponds to the maximum of the marginal likelihood p(yy |λ) because of
the noninformative prior for λ.

3. Cross-validation. For cross-validation (CV), the data is split into a


training, validation, and test set. The goal is to find a value for λ
which results in a model that is accurate in predicting new data, i.e., a
generalizable model that captures the signal in the data, but does not
overfit (Hastie et al., 2015). To find this value for λ, a range of values
is considered, using the training data y train to fit all models with the
different λ values. Next, each resulting model is used to predict the
responses in the validation set y val . The value for λ that minimizes
some loss function is selected, i.e.,

λCV = arg min L(yy train , y val ). (6)

Given that the loss function is the negative of the log likelihood, this
is equivalent to:

λCV = arg max p(yy val |yy train , λ). (7)

Finally, λCV is used to fit the model on the test set. Generally, the
prediction mean squared error (PMSE) is used to determine λCV , which
corresponds to a quadratic loss function.
In practice k-fold cross-validation is often used. k-fold cross-validation
is a specific implementation of cross-validation in which the data is
split in only a training and a test set. The training set is split in K
parts (usually K = 5 or K = 10) and the range of λ values is applied
K times on K − 1 parts of the training set, each time with a different
part as validation set. The K estimates of the PMSE are then averaged
and a standard error is computed.
Frequentist penalization approaches often rely on cross-validation. In the
Bayesian literature, full and empirical Bayes are often employed, although
cross-validation is also possible in a Bayesian approach (see for example the
loo package in R; Vehtari et al., 2018). The intuition behind empirical Bayes
and cross-validation is similar: empirical Bayes aims to choose the value for
λ that is best in predicting the full data set, while cross-validation aims to
choose the value for λ that is best in predicting the validation set given a
Bayesian Penalization

training set. A possible disadvantage of empirical Bayes and cross-validation


is that the (marginal) likelihood can be flat or multimodal when there are
multiple penalty parameters (van de Wiel et al., 2017).4
Throughout this paper, we will focus on the full and empirical Bayes ap-
proach to determine λ, and only consider cross-validation for the frequentist
penalization methods we will compare the priors to.

3 Overview shrinkage priors


In this section we will give a general overview of shrinkage priors that have
been proposed in the literature. Given the extensive number of shrinkage
priors that has been investigated, we will limit the overview to priors that
are related to well-known classical penalization methods and shrinkage priors
that are popular in the Bayesian literature. Given that most shrinkage priors
fall into these categories, the resulting overview, while not exhaustive, is
intended to be comprehensive and will help researchers to navigate through
this literature. In total, we will discuss nine different shrinkage priors.
Many continuous, unimodal, and symmetric distributions can be parametrized
as a scale mixture of normals meaning that the distribution is rewritten as a
normal distribution (i.e., Normal(µ, σ 2 )) where the scale parameter is given a
mixing density h(σ 2 ) (see e.g., West, 1987). Where possible, we will present
the different priors in a common framework by providing the scale mixture
of normals formulation for each prior. Using this formulation, the theoretical
differences and similarities between the priors become more clear and, ad-
ditionally, the scale mixture of normals formulation can be computationally
more efficient.
We will now describe each prior in turn. The densities for several of the
shrinkage priors are presented in Table 1 and plotted in Figure 1. We will
consider a full and empirical Bayes approach to obtain the penalty parameter
λ (see Section 2) for all shrinkage priors, unless stated otherwise. For the
full Bayesian approach, we will consider standard half-Cauchy priors for the
penalty parameters as a robust default prior choice. We have included this
choice for the prior on λ in the descriptions below, but note that other choices
are possible as well.
4
In the initial empirical Bayes approach we used a uniform prior for λ and this problem
became evident through non-convergence of the sampler or extreme estimates for λEB .
The problem was solved by using the half-Cauchy prior instead.
Bayesian Penalization

Shrinkage prior Conditional prior density p(βj |λ, . . .) Reference


√ { λβ 2 }
Ridge p(βj |σ 2 , λ) = 2πσ λ
2 exp − 2σj2 Hsiang (1975)
2
( ) Griffin and Brown (2005);
σ2 2
Local Student’s t p(βj |σ 2 , λ) = πλσ
1 + ( λβ )
{
j
} Meuwissen et al. (2001)
−λ|β |
Lasso p(βj |σ 2 , λ) = 2√λσ2 exp √σ2j Park and Casella (2008)
{ }
Elastic net p(βj |σ 2 , λ1 , λ2 ) = C exp {− 2σ1 2 (λ1 |βj | + λ}2 βj2 ) Li and Lin (2010)
∑G
Group lasso p(βj |σ, λ) = C exp − σ g=1 ||βλ
β g || Kyung et al. (2010)
[ ]
1 (λ|βj |){1−Φ(λ|βj |)}
Hyperlasso p(βj |λ) = λ(2π) 2 1 − ϕ(λ|βj |)
Griffin and Brown (2011)
Horseshoe Not analytically
( tractable ) Carvalho et al. (2010)
{ β2 } ( ) George and McCulloch (1993);
Discrete normal mixture p(βj |γj , ϕ2j ) = (1 − γj ) √ 1 2 exp − 2ϕj2 1
+ γj π(1+β 2)
2πϕj j j Mitchell and Beauchamp (1988)
Note. C denotes a normalization constant. Φ() and ϕ() in the hyperlasso are the cumulative density function and the
probability density function of the standard normal distribution.

Table 1: Conditional prior densities for the regression coefficients β implied


by the various shrinkage priors and references for each shrinkage prior.

Ridge Local Student's t Lasso

2.0

1.5

1.0

0.5

0.0

Elastic net Group lasso Hyperlasso

2.0

1.5

1.0

0.5

0.0
−5.0 −2.5 0.0 2.5 5.0
Horseshoe Normal mixture

2.0

1.5

1.0

0.5

0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0

Figure 1: Densities of the shrinkage priors


Bayesian Penalization

3.1 Ridge
The ridge prior corresponds to normal priors centered around 0 on the re-
gression coefficients, i.e., (see e.g., Hsiang, 1975)

σ2
βj |λ, σ 2 ∼ Normal(0, ), for j = 1, . . . , p. (8)
λ
λ ∼ half-Cauchy(0, 1)

The posterior mean estimates under this prior will correspond to estimates
obtained using the ridge penalty or l2 norm, i.e., q = 2 in Equation (1) (Hoerl
and Kennard, 1970). The penalty parameter λ determines the amount of
shrinkage, with larger values resulting in smaller prior variation and thus
more shrinkage of the coefficients towards zero.

3.2 Local Student’s t


We can extend the ridge prior in Equation 8 by making the prior variances
predictor-specific, thereby allowing for more variation, i.e.,

βj |τj2 ∼ Normal(0, σ 2 τj2 ) (9)


ν ν
τj2 |ν, λ ∼ Inverse-Gamma( , ), for j = 1, . . . , p,
2 2λ
λ ∼ half-Cauchy(0, 1)

When integrating τj2 out, the following conditional prior distribution for the
regression coefficients is obtained:

σ2
βj |ν, λ, σ 2 ∼ Student(ν, 0, ), (10)
λ
2
where Student(ν, 0, σλ ) denotes a non-standardized Student’s t distribution
2
centered around 0 with ν degrees of freedom and scale parameter σλ . A
smaller value for ν results in a distribution with heavier tails, with ν = 1
implying a Cauchy prior for βj . Larger (smaller) values for λ result in more
(less) shrinkage towards m. This prior has been considered, among others,
by Griffin and Brown (2005) and Meuwissen et al. (2001). Compared to the
ridge prior in (8), the local Student’s t prior has heavier tails. Throughout
this paper, we will consider ν = 1, such that the prior has Cauchy-like tails.
Bayesian Penalization

3.3 Lasso
The Bayesian counterpart of the lasso penalty was first proposed by Park
and Casella (2008). The Bayesian lasso can be obtained as a scale mixture
of normals with an exponential mixing density, i.e.,

βj |τj2 , σ 2 ∼ Normal(0, σ 2 τj2 ) (11)


λ2
τj2 |λ2 ∼ Exponential( ), for j = 1, . . . , p,
2
λ ∼ half-Cauchy(0, 1)

Integrating τj2 out results in double-exponential or Laplace priors on the


regression coefficients, i.e.,

σ
βj |λ, σ ∼ Double-exponential(0, ), for j = 1, . . . , p. (12)
λ
With this prior, the posterior mode estimates are similar to estimates
obtained under the lasso penalty or l1 norm, i.e., q = 1 in Equation (1)
(Tibshirani, 1996). In addition to the overall shrinkage parameter λ, the lasso
prior has an additional predictor-specific shrinkage parameter τj . Therefore,
the lasso prior is more flexible than the ridge prior which only relies on the
overall shrinkage parameter in (8). Figure 1 clearly shows that the lasso prior
has a sharper peak around zero compared to the ridge prior.

Disadvantages of the lasso


The popularity of the classical lasso lies in its ability to shrink coefficients
to zero, thereby automatically performing variable selection. However, there
are several disadvantages to the classical lasso. Specifically, (i) it cannot se-
lect more predictors than observations, which is problematic when p > n; (ii)
when a group of predictors is correlated, the lasso generally selects only one
predictor of that group; (iii) the prediction error is higher for the lasso com-
pared to the ridge when n > p and the predictors are highly correlated; (iv)
it can lead to overshrinkage of large coefficients (see e.g., Polson and Scott,
2011); and (v) it does not always have the oracle property, which implies it
does not always perform as well in terms of variable selection as if the true
underlying model has been given (Fan and Li, 2001). The lasso only enjoys
the oracle property under specific and stringent conditions (Fan and Li, 2001;
Zou, 2006). These disadvantages have sparked the development of several
generalizations of the lasso. We will now discuss the Bayesian counterparts
Bayesian Penalization

of several of these generalizations, including the elastic net, group lasso, and
hyperlasso. Note that for the Bayesian lasso, coefficients cannot become
exactly zero and thus a criterion is needed to select the relevant variables.
Depending on the criterion used, more predictors than observations could be
selected. However, the Bayesian lasso does not allow a grouping structure to
be included, it overshrinks large coefficients, and it does not have the oracle
property since the tails for the prior on βj are not heavier than exponential
tails (Polson et al., 2014).

3.4 Elastic net


The most popular generalization of the lasso is the elastic net (Zou and
Hastie, 2005). The elastic net can be seen as a combination of the ridge
and lasso. The elastic net resolves issues (i), (ii), and (iii) of the ordinary
lasso. The elastic net prior can be obtained as the following scale mixture of
normals (Li and Lin, 2010):

( )
λ2 τj −1
βj |λ2 , τj , σ 2
∼ Normal 0, ( 2 ) (13)
σ τj − 1
( )
1 8λ2 σ 2
τj |λ2 , λ1 , σ 2 ∼ Truncated-Gamma , , for j = 1, . . . , p,
2 λ21
λ1 ∼ half-Cauchy(0, 1)
λ2 ∼ half-Cauchy(0, 1)

where the truncated Gamma density has support (1, ∞). This implies the
following conditional prior distributions for the regression coefficients:

{ }
1
p(βj |σ , λ1 , λ2 ) = C(λ1 , λ2 , σ ) exp − 2 (λ1 |βj | + λ2 βj ) ,
2 2 2
(14)

for j = 1, . . . , p,

where C(λ1 , λ2 , σ 2 ) denotes the normalizing constant. The corresponding


posterior modes for βj are equivalent to the estimates from the classical
elastic net penalty. Expression (14) illustrates how the elastic net prior offers
a combination of the double-exponential prior, i.e., the lasso penalty λ|βj |,
and the normal prior, i.e., the ridge penalty λβj2 . Specifically, the two penalty
parameters λ1 and λ2 determine the relative influence of the lasso and ridge
penalty, respectively. This can also be seen in Figure 1: the elastic net is not
as sharply peaked as the lasso prior, but it is sharper than the ridge prior.
Bayesian Penalization

As mentioned in the Introduction, a disadvantage of the classical elastic


net is that the sequential cross-validation procedure used to determine the
penalty parameters results in overshrinkage of the coefficients. This problem
is resolved in the Bayesian approach by estimating both penalty parameters
simultaneously through a full or empirical Bayes approach.

3.5 Group lasso


The group lasso (Yuan and Lin, 2006) is a generalization of the lasso primarily
aimed at improving performance when predictors are grouped in some way,
for example when qualitative predictors are coded as dummy or one-hot
variables (as is often implicitly done in ANOVA, for instance). Similarly to
the elastic net, the penalty function induced by the group lasso lies between
the l1 penalty of the lasso in (12) and the l2 penalty of the ridge in (8).
To apply the group lasso, the vector of regression coefficients β is split in
G vectors β g , where each vector represents the coefficients of predictors in
that group. Denote by mg the dimension of each vector β g . The group lasso
corresponds to the following scale mixture of normals (Kyung et al., 2010):

β g |τg2 , σ 2 ∼ MVN(00, σ 2 τg2 Img ) (15)


2
mg + 1 λ
τg2 |λ2 ∼ Gamma( , ), for g = 1, . . . , G,
2 2
λ ∼ half-Cauchy(0, 1)

where MVN denotes the multivariate normal distribution with dimension mg


and Img denotes an (mg ×mg ) identity matrix. Note that, contrary to the pri-
ors considered thus far, the group lasso prior does not consist of independent
priors on the regression coefficients βj , but rather independent priors on the
groups of regression coefficients β g . If there is no grouping structure, mg = 1
and the Bayesian group lasso in (15) reduces to the Bayesian lasso in (11).
The scale mixture of normals in (15) leads to the following conditional prior
for the regression coefficients (Kyung et al., 2010):

{ }
λ ∑G
p(βj |σ 2 , λ) = C exp − √ ||β
β g || , for g = 1, . . . , G, and j = 1, . . . , p,
σ 2 g=1
(16)
1
where ||β β ′g β g ) 2 and C denotes the normalizing constant. Due to
β g || = (β
the simultaneous penalization of all coefficients in one group, all estimated
Bayesian Penalization

regression coefficients in one group will be either zero or nonzero, depending


on the value for λ.

3.6 Hyperlasso
Zou (2006) proposes the adaptive lasso as a generalization of the lasso that
enjoys the oracle property (limitation (v) of the lasso), i.e., it performs as
well as if the true underlying model has been given. The central idea of the
adaptive lasso is to separately weigh the penalty for each coefficient based
on the observed data. A Bayesian adaptive lasso has been proposed, among
others, by Alhamzawi et al. (2012) and Feng et al. (2015). However, as noted
by Griffin and Brown (2011), the weights included in the adaptive lasso place
great demands on the data, which can lead to poor performance in terms of
prediction and variable selection when the sample size is small. Therefore,
Griffin and Brown (2011) propose the hyperlasso as a Bayesian alternative
to the adaptive lasso, which is obtained through the following mixture of
normals:

βj |ϕ2j ∼ Normal(0, ϕ2j ) (17)


ϕ2j |τj ∼ Exponential(τj )
1
τj |ν, λ2 ∼ Gamma(ν, 2 ) for j = 1, . . . , p.
λ
λ ∼ half-Cauchy(0, 1)
This is equivalent to placing a Gamma mixing density on the hyperparameter
of the double-exponential prior:

βj |τj ∼ Double-Exponential(0, (2τj )1/2 ) (18)


1
τj |ν, λ2 ∼ Gamma(ν, 2 ), for j = 1, . . . , p.
λ
Note that the density of the hyperlasso prior strongly resembles the density
of the lasso prior (Figure 1), the main difference being that the hyperlasso
has heavier tails than the lasso. Contrary to the priors considered thus far,
this prior corresponds to a penalty that is non-convex implying that multiple
posterior modes can exist. Therefore, care must be taken to ensure that
the complete posterior distribution is explored. In addition, the hyperlasso
prior for β is not conditioned on the error variance σ 2 . Following Griffin and
Brown (2011), we will consider the specific case of ν = 0.5. However, whereas
Griffin and Brown (2011) use cross-validation to choose λ, we will rely on a
full and empirical Bayes approach.
Bayesian Penalization

3.7 Horseshoe
A popular shrinkage prior in the Bayesian literature is the horseshoe prior
(Carvalho et al., 2010):

βj |τj2 ∼ Normal(0, τj2 ) (19)


τj |λ ∼ Half-Cauchy(0, λ), for j = 1, . . . , p
λ|σ ∼ Half-Cauchy(0, σ).

Note that Carvalho et al. (2010) explicitly include the half-Cauchy prior for
λ in their specification, thereby implying a full Bayes approach. This formu-
lation results in a horseshoe prior that is automatically scaled by the error
standard deviation σ. The half-Cauchy prior can be written as a mixture
of inverse Gamma and Gamma densities, so that the horseshoe prior in (19)
can be equivalently specified as:

βj |τj2 ∼ Normal(0, τj2 ) (20)


1
τj2 |ω ∼ inverse Gamma( , ω)
2
1
ω|λ2 ∼ Gamma( , λ2 )
2
1
λ2 |γ ∼ inverse Gamma( , γ)
2
1
γ|σ 2 ∼ Gamma( , σ 2 )
2

An expression for the marginal prior of the regression coefficients βj is not


analytically tractable. The name “horseshoe” prior arises from the fact that
for fixed values λ = σ = 1, the implied prior for the shrinkage coefficient κj =
1
1+τj2
is similar to a horseshoe shaped Beta(0.5, 0.5) prior. Large coefficients
will lead to a shrinkage coefficient κj that is close to zero such that there
is practically no shrinkage, whereas small coefficients will have a κj close to
1 and will be shrunken heavily. Note that the horseshoe prior is the only
prior with an asymptote at zero (Figure 1). Combined with the heavy tails,
this ensures that small coefficients are heavily shrunken towards zero while
large coefficients remain large. The horseshoe prior has also been termed a
global-local shrinkage prior (e.g., Polson and Scott, 2011) because it has a
predictor-specific local shrinkage component τj as well as a global shrinkage
component λ. The basic intuition is that the global shrinkage parameter λ
performs shrinkage on all coefficients and the local shrinkage parameters τj
Bayesian Penalization

loosen the amount of shrinkage for truly large coefficients. Many global-local
shrinkage priors (including the horseshoe and hyperlasso) are special cases
of the general class of hypergeometric inverted-beta distributions (Polson
and Scott, 2012). In addition to the full Bayes approach implied by the
specification in (19), we will also consider an empirical Bayes approach to
determine λ.

3.8 Regularized horseshoe


The horseshoe prior in Subsection 3.7 has the characteristic that large coef-
ficients will not be shrunken towards zero too heavily. Indeed, this is one of
the advertised qualities of the horseshoe prior (Carvalho et al., 2010). Al-
though this property is desirable in theory, it can be problematic in practice,
especially when parameters are weakly identified. In this situation, the pos-
terior means of the regression coefficients might not exist and even if they
do, the horseshoe prior can result in an unstable MCMC sampler (Ghosh
et al., 2017). To solve these problems Piironen and Vehtari (2017) propose
the regularized horseshoe, which is defined as follows:

c2 τj2
βj |τ̃j2 , λ ∼ Normal(0, τ̃j2 λ), with τ̃j2 = (21)
c2 + λ2 τj2
p0 σ
λ|λ20 ∼ half-Cauchy(0, λ20 ), with λ0 = √
p − p0 n
τj ∼ half-Cauchy(0, 1)
c |ν, s2 ∼ inverse Gamma(ν/2, νs2 /2),
2

where p0 represents a prior guess of the number of relevant variables. The


resulting prior will shrink small coefficients in the same way as the horseshoe5 ,
but unlike the horseshoe, large coefficients will be shrunken towards zero by
a Student’s t distribution with ν degrees of freedom and scale s2 . Piironen
and Vehtari (2017) use a Student’s t distribution with ν = 4 and s2 = 2 and
we use the same hyperparameters, although other choices are possible. As
long as the degrees of freedom ν are small enough, the tails will be heavy
enough to ensure a robust shrinkage pattern for large coefficients.
It is possible to specify a half-Cauchy prior for the global shrinkage
parameter with a scale equal to 1 or the error standard deviation, i.e.,
λ ∼ half-Cauchy(0, 1) or λ ∼ half-Cauchy(0, σ), for example when no prior
5
Because of the similarity to the horseshoe, the density and contour plots for the
regularized horseshoe are not substantially different from those of the horseshoe and are
therefore not included in Figure 1 and Figure 3.
Bayesian Penalization

information is available regarding the number of relevant variables. How-


ever, as noted by Piironen and Vehtari (2017), the scale based on the a priori
number of relevant variables will generally be much smaller than 1 or σ. In
addition, even if the prior guess for the number of relevant parameters p0 is
incorrect, the results are robust to this choice as long as a half-Cauchy prior
is used. Following the recommendations of Piironen and Vehtari (2017), we
will only consider a full Bayes approach to determine λ in the regularized
horseshoe.

3.9 Discrete normal mixture


The normal mixture prior is a discrete mixture of a peaked prior around zero
(the spike) and a vague proper prior (the slab); it is therefore also termed a
spike-and-slab prior. It is substantially different from the priors considered
thus far, which are all continuous mixtures of normal densities. Based on
the data, regression coefficients close to zero will be assigned to the spike,
resulting in shrinkage towards 0, while coefficients that deviate substantially
from zero will be assigned to the slab, resulting in (almost) no shrinkage.
Early proposals of mixture priors can be found in George and McCulloch
(1993) and Mitchell and Beauchamp (1988), and a scale mixture of normals
formulation can be found in Ishwaran and Rao (2005). We will consider the
following specification of the mixture prior:

βj |γj , τj2 , ϕ2j ∼ (γj )Normal(0, τj2 ) + (1 − γj )Normal(0, ϕ2j ) (22)


τj2 ∼ inverse Gamma(0.5, 0.5), for j = 1, . . . , p,

where τj is given a vague prior so that the variance of the slab is estimated
based on the data and ϕ2j is fixed to a small number, say ϕ2j = 0.001, to
create the spike. By assigning an inverse Gamma(0.5, 0.5) prior on τj2 , the
resulting marginal distribution of the slab component of the mixture is a
Cauchy distribution.
There are several options for the prior on the mixing parameter γj . In
this paper, we will consider the following two options: 1) γj as a Bernoulli
distributed variable taking on the value 0 or 1 with probability 0.5, i.e.,
γj ∼ Bernoulli(0.5); and 2) γj uniformly distributed between 0 and 1, i.e.,
γj ∼ Uniform(0, 1). In the first option, which we label the Bernoulli mixture,
each coefficient βj is given either the slab or the spike as prior. The second
option, labelled the uniform mixture, is more flexible in that each coefficient
is given a prior consisting of a mixture of the spike and slab, with each
component weighted by the uniform probabilities γj
Bayesian Penalization

The density of the normal mixture prior is presented in Figure 1, which


clearly shows the prior is a combination of two densities. The representation
in Figure 1 is based on a normal mixture with equal mixing probabilities,
rather than a Bernoulli or uniform prior on the mixing probabilities. Note
that the mixture prior is not conditioned on the error variance σ 2 . We will
only consider a full Bayesian approach for the mixture priors.

4 Illustrating the behavior of the shrinkage


priors
4.1 Contour plots
Contour plots provide an insightful way to illustrate the behavior of classical
penalties and Bayesian shrinkage priors. First, consider Figure 2 which shows
the frequentist and Bayesian contour plots for the lasso. In both plots, the
green elliptical lines represent the contours of the sum of squared residuals,
centered around the regular OLS estimate β̂OLS . The solid black diamond
in the left plot represents the constraint region for the classical lasso penalty
function for two predictors β1 and β2 . The classical penalized regression
solution β̂LASSO is the point where the contour of the sum of squared residuals
meets the constraint region. This point corresponds to the minimum of the
penalized regression equation in (1). In the right plot, the diamond shaped
contours reflect the shape of the lasso prior (Section 3.3). The contour of the
Bayesian posterior distribution based on the lasso prior is shown in blue. As
can be seen, the posterior distribution is located between the sum of squared
residuals contour and the prior contour. The Bayesian posterior median
estimate β̂BAY ES is added in blue and shrunken towards zero compared to
the OLS estimate β̂OLS . Note that the posterior mode would correspond to
the classical penalized regression solution, if the same value for the penalty
parameter λ is used.
Bayesian Penalization

4 ^ 4 ^
βOLS βOLS
● ●


^ ^
2 βLASSO ● 2 βBAYES
β2

β2
0 0

−2 −2

−2 0 2 −2 0 2
β1 β1

Figure 2: Contour plot representing the sum of squared residuals, classical


lasso constraint region (left), bivariate lasso prior and posterior distribution
(right), and the classical and Bayesian penalized point estimates.

Figure 3 shows the contour plots of the different shrinkage priors for two
predictors β1 and β2 , while Figure 4 shows the contour plots for the lasso
and group lasso for three predictors. From a classical penalization perspec-
tive, the lasso and elastic net penalties have sharp corners at β1 = β2 = 0.
As a result, the contour of the sum of squared residuals will meet the con-
tours of these penalties more easily at a point where one of the coefficients
equals zero, which explains why these penalties can shrink coefficients to
exactly zero. The ridge penalty, on the other hand, does not show these
sharp corners and can therefore not shrink coefficients to exactly zero. From
a Bayesian penalization perspective, the bivariate prior contour plots illus-
trate the shrinkage behavior of the priors. For example, the hyperlasso and
horseshoe have a lot of prior mass where at least one element is close to
zero, while the ridge has most prior mass where both elements are close to
zero. Figure 3 also shows that the ridge, local Student’s t, lasso, and elastic
net are convex. This can be seen when drawing a straight line from one
point to another point on a contour. For a convex distribution, the line lies
completely within the contour. The hyperlasso and horseshoe prior are non-
convex, which can be seen from the starlike shape of the contour. Frequentist
penalization has generally focused on convex penalties, due to their computa-
Bayesian Penalization

tional convenience for optimization procedures. In the Bayesian framework,


which relies on sampling (MCMC) techniques, the use of convex and non-
convex priors is computationally similar. It is recommendable, however, to
use multiple starting values in the case of non-convex priors due to possible
multimodality of the posterior distribution (Griffin and Brown, 2011).

Figure 3: Contour plots representing the bivariate prior distribution of the


shrinkage priors
Bayesian Penalization

b3 b3

b1 b1

b2 b2

Figure 4: Contour plots of the lasso (left) and group lasso (right) in R3 , with
β1 and β2 belonging to group 1 and β3 belonging to group 2. For the group
lasso, if we consider only β1 and β2 , which belong to the same group, the
contour resembles that of the ridge with most prior mass if both β1 and β2
are close to zero. On the other hand, if we consider β1 and β3 , which belong
to different groups, the contour is similar to that of the lasso, which has more
prior mass where only one element is close to zero. This illustrates how the
group lasso simultaneously shrinks elements belonging to the same group.
Bayesian Penalization

4.2 Shrinkage behavior

0 ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ●

● ● ●
● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−5 ●


Posterior mean−observation


−1 ● ●

−10

−15

−2

−20

−25 −3

0 10 20 30 40 50 0 1 2 3 4 5
True value True value

Ridge ●
Lasso Hyperlasso Regularized horseshoe

Local Student's t Elastic net ●


Horseshoe

Figure 5: Difference between the estimated and true effect for the shrinkage
priors in a simple normal model with the penalty parameter λ fixed to 1.

Prior shrinkage of small effects towards zero is important to obtain sparse


solutions. Figure 5 illustrates the shrinkage behavior of the priors in a simple
normal model: y ∼ Normal(β, 1). We estimate β based on a single obser-
vation y, which is varied from 0 to 50. Using only a single observation is
possible because the variance is known. The penalty parameter λ for each
shrinkage prior is fixed to 1. The resulting difference between the posterior
mean estimates and true means is shown in Figure 5. The behavior of the
priors varies greatly. Specifically, for the ridge and elastic net priors, the
difference between the estimated and true effect increases as the true mean
increases. For the lasso prior, the difference increases for small effects and
then remains constant. Note how the difference for the elastic net lies be-
tween the difference obtained under the ridge and lasso priors, illustrating
that the elastic net is a combination of the ridge and lasso priors. The other
Bayesian Penalization

shrinkage priors all show some differences between estimated and true means
for small effects, indicating shrinkage of these effects towards zero, but the
difference is practically zero for large effects. The right column of Figure 5
provides the same figure, but zoomed in on the small effects. Note how the
regularized horseshoe shrinks large effects more than the horseshoe prior, but
goes to zero eventually.

0.00 ●

● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ● ●

● ● ● ● ●

● ● ● ●

● ● ● ●
● ●

● ●
● ● ●






Ridge
−0.25 ●



Local Student's t
Posterior mean−observation

● ●



Lasso
● ●

−0.50 ●
Elastic net


Hyperlasso




Horseshoe
−0.75 ●

Regularized horseshoe



Bernoulli mix.

−1.00 Uniform mix.

−1.25
0 10 20 30 40 50
True value

Figure 6: Difference between the estimated and true effect for the shrinkage
priors in a simple normal model with a half-Cauchy hyperprior specified for
the penalty parameter λ.

A similar illustration is presented in Figure 6, but based on a full Bayes


approach where λ is freely estimated. Thus, instead of fixing λ to a specific
value, it is given a standard half-Cauchy prior distribution and estimated si-
multaneously with the other parameters in the model. Overall, all shrinkage
priors show differences between true and estimated means for small effects,
which decrease towards zero as the effect grows. Note that the difference is
negative, indicating that the estimated mean is smaller than the true mean.
Thus, all shrinkage priors heavily pull small effects towards zero, while as-
serting almost no influence on larger effects, although some shrinkage still
Bayesian Penalization

occurs even when the true mean equals 50. The mixture priors result in the
largest differences between true and estimated small effects, indicating the
most shrinkage, and the local Student’s t prior shows the smallest difference
for small effects. As the effect grows, the regularized horseshoe prior results
in estimates farthest from the true effects, indicating the most shrinkage for
large effects.
These illustrations indicate that when the penalty parameter is fixed, only
the local Student’s t, hyperlasso, and (regularized) horseshoe priors allow for
shrinkage of small effects while estimating large effects correctly. However,
if a prior is specified for the penalty parameter, so that the uncertainty in
this parameter is taken into account, all shrinkage priors show this desirable
behavior.

5 Simulation study
5.1 Conditions
We conduct a Monte Carlo simulation study to compare the performance of
the shrinkage priors and several frequentist penalization methods. We simu-
late data from the linear regression model, given by: y = β01 + Xβ + ϵ , with
ϵi ∼ Normal(0, σ 2 ). We consider six simulation conditions. Conditions (1)-
(5) are equal to the conditions considered in Li and Lin (2010). In addition,
condition (1) and (2) have also been considered in Kyung et al. (2010); Roy
and Chakraborty (2016); Tibshirani (1996); Zou and Hastie (2005). Con-
dition (6) has been included to investigate a setting in which p > N . The
conditions are as follows6 :

1. β = (3, 1.5, 0, 0, 2, 0, 0, 0)′ ; σ 2 = 9; X generated from a multivariate


normal distribution with mean vector 0 , variances equal to 1, and
pairwise correlations between predictors equal to 0.5. The number
of observations is n = 240, with 40 observations for training and 200
observations for testing the model.

2. β = (0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85)′ ; the other settings are
equal to those in condition (1).
6
We have also considered two additional conditions in which p > n and the predictors
are not highly correlated. Unfortunately, most shrinkage priors resulted in too much
non-convergence to trust the results. A description of these additional conditions and
the available results for the priors that did obtain enough convergence is available at
https://osf.io/nveh3/. Additionally, we would like to refer to Kaseva (2018) where a more
sparse, modified version of condition 1 is considered.
Bayesian Penalization

3. β = (3, . . . , 3, 0, . . . , 0); σ 2 = 225; x j = Z1 + ωj , for j = 1, . . . , 5;


| {z } | {z }
15 15
x j = Z2 + ωj , for j = 6, . . . , 10; x j = Z3 + ωj , for j = 11, . . . , 15;
and x j ∼ Normal(0, 1), for j = 16, . . . , 30. Here, Z1 , Z2 , and Z3 are
independent standard normal variables and ωj ∼ Normal(0, 0.01). The
number of observations is n = 600, with 200 observations for training
and 400 observations for testing the model.

4. The number of observations is n = 800, with 400 observations for


training and 400 observations for testing the model; the other settings
are equal to those in condition (3).

5. β = (3, . . . , 3, 0, . . . , 0, 3, . . . , 3); the number of observations is n = 440,


| {z } | {z } | {z }
10 10 10
with 40 observations for training and 400 observations for testing the
model; the other settings are equal to those in condition (3).

6. The number of observations is n = 55, with 25 observations for training


and 30 observations for testing the model; the other settings are equal
to those in condition (5).

We simulate 500 data sets per condition. All Bayesian methods have
been implemented in the software package Stan (Stan development team,
2017c), which we call from R using Rstan (Stan development team, 2017a).
We include the classical penalization methods available in the R-packages
glmnet (Friedman et al., 2010) and grpreg (Breheny and Huang, 2015),
i.e., the ridge, lasso, elastic net, and group lasso, for comparison. For the
classical penalization methods, the penalty parameter λ is selected based on
cross-validation using 10 folds. We also include classical forward selection
from the leaps (Lumley, 2017) package and we select the model based on
three different criteria: the adjusted R2 , Mallows’ Cp , and the BIC. For
both the Bayesian and the classical group lasso, a grouping structure should
be supplied for the analysis. We have used the grouping structure under
which the data was simulated. Thus, for conditions 3 until 6, we have four
groups with the following regression coefficient belonging to each group: G1 =
β1 , . . . , β5 , G2 = β6 , . . . , β10 , G3 = β11 , . . . , β15 , and G4 = β16 , . . . , β30 . All
code for the simulation study is available at https://osf.io/bf5up/.

5.2 Outcomes
The two main goals of regression analysis are: (1) to select variables that
are relevant for predicting the outcome, and (2) to accurately predict the
Bayesian Penalization

outcome. Therefore, we will focus on the performance of the shrinkage priors


in terms of variable selection and prediction accuracy. Unlike frequentist
penalization methods, Bayesian penalization methods do not automatically
shrink regression coefficients to be exactly zero. A criterion is thus needed
to select the relevant variables, for which we will use the credibility interval
criterion.7 Using the credibility interval criterion, a predictor is excluded
when the credibility interval for βj covers 0, and it is included when 0 is
not contained in the credibility interval. This criterion thus depends on the
percentage of posterior probability mass included in the credibility interval.
We will investigate credibility intervals ranging from 0 to 100%, with steps of
10%. The optimal credibility interval is selected using the distance criterion
(see e.g., Perkins and Schisterman, 2006), i.e.,

distance = (1 − correct inclusion rate)2 + (false inclusion rate)2 , (23)

The credibility interval with the lowest distance is optimal in terms of the
highest correct inclusion rate and lowest false inclusion rate. For the selected
credibility interval, we will report Matthews’ correlation coefficient (MCC;
Matthews, 1975), which is a measure indicating the quality of the classifica-
tion. MCC ranges between -1 and +1 with MCC = -1 indicating complete
disagreement between the observed and predicted classifications and MCC
= +1 indicating complete agreement.
To assess the prediction accuracy of the shrinkage priors, we will consider
the prediction mean squared error (PMSE) for each replication. To compute
the PMSE, we first estimate the regression coefficients β̂ on the training
data only. These estimates are then used to predict the responses on the
outcome variable of the test set, y gen , for which the actual responses, y , are
available. Prediction of y gen occurs within the “generated quantities” block
in Stan, meaning that for each MCMC draw, yigen is generated such that we
obtain the full posterior distribution for each yigen . The mean of this posterior
distribution is used as estimate for yigen . The PMSE for each replication can
i=1 (yi −yi ) . For each condition, this will result in
gen
then be computed as: N1 ΣN 2

500 PMSEs, one for each replication, of which we will compute the median.
Furthermore, to assess the uncertainty in the median PMSE estimate, we
will bootstrap the standard error (SE) by resampling 500 PMSEs from the
7
We have also considered the scaled neighborhood criterion (Li and Lin, 2010) and a
fixed cut-off value to select the predictors. The scaled√neighborhood
√ criterion excludes a
predictor if the posterior probability contained in [− var(βp |y), var(βp |y)] exceeds a
certain threshold. However, this criterion generally performed worse than the credibility
interval criterion. For the fixed cut-off value we excluded predictors when the posterior
estimate |β̂| ≤ 0.1 based on Feng et al. (2015). However, the choice of this threshold is
rather arbitrary and resulted in very high false inclusion rates.
Bayesian Penalization

obtained PMSE values and computing the median. This process is repeated
500 times and the standard deviation of the 500 bootstrapped median PMSEs
is used as SE of the median PMSE.

5.3 Convergence
Convergence will be assessed using split R̂, which is a version of the often
used potential scale reduction factor (PSRF; Gelman and Rubin, 1992) that
is implemented in Stan (Stan development team, 2017b, p. 370-373). Ad-
ditionally, Stan reports the number of divergent transitions. A divergent
transition indicates that the approximation error in the algorithm accumu-
lates (Betancourt, 2017; Monnahan et al., 2016), which can be caused by a
too large step size, or because of strong curvature in the posterior distribu-
tion. As a result, it can be necessary to adapt the settings of the algorithm
or to reparametrize the model. For the simulation, we initially employed a
very small step size (0.001) and high target acceptance rate (0.999), however,
these settings result in much slower sampling. Therefore, in the later con-
ditions we used the default step size (1) and a lower target acceptance rate
(0.85) and only reran the replications that did not converge with the stricter
settings (i.e., smaller step size and higher target acceptance rate). Only if all
parameters had a PSRF < 1.1 and there were no divergent transitions, did
we consider a replication as converged.8 We have only included those condi-
tions in the results with at least 50% convergence (i.e., at least 250 converged
replications). The convergence rates are available at https://osf.io/nveh3/.

5.4 Prediction accuracy


Table 2 shows the median PMSE per condition for the shrinkage priors and
classical penalization methods. For the regularized horseshoe, the prior guess
for the number of relevant variables p0 was based on the data-generating
model, however, the results were comparable when no prior guess or an in-
correct prior guess was used. For all methods, the median PMSE increases
as the condition becomes more complex. The smallest median PMSE per
8
For the horseshoe prior, all replications in all conditions resulted in one or more diver-
gent transitions, despite reparametrization of the model. The regularized horseshoe also
resulted in divergent transitions for most replications, although the percentage of diver-
gent transitions was on average much lower for the regularized horseshoe compared to the
horseshoe. The percentages divergent transitions are available at https://osf.io/nveh3/.
To be able to include these priors in the overview, we have only considered the PSRF to
assess convergence and manually checked the traceplots. However, see Kaseva (2018) for
a deeper investigation into the divergent transitions and alternative parametrizations of
the horseshoe prior.
Bayesian Penalization

Prior Condition 1 Condition 2 Condition 3 Condition 4 Condition 5 Condition 6


Full Bayes
Ridge 10.95 (0.08) 10.49 (0.09) 243.09 (0.86) 236.07 (0.95) 319.84 (1.6) 371.61 (7.14)
Local Student’s t 10.83 (0.09) 10.71 (0.08) 242.79 (0.8) 236.04 (0.89) 317.72 (2.31) 359.32 (6.67)
Lasso 10.78 (0.07) 10.53 (0.09) 238.89 (0.84) 234.29 (0.99) 316.43 (2.14) 360.37 (5.84)
Elastic net 10.93 (0.08) 10.62 (0.08) 243.23 (0.86) 236.1 (0.93) 323.24 (1.86) 387.47 (6.12)
1 1
Group lasso NA NA 241.06 (0.84) 235.35 (0.87) 316.23 (2.14) 358.17 (7.14)
Hyperlasso 10.77 (0.09) 10.52 (0.08) 238.69 (0.78) 234.24 (0.97) 316.29 (2.31) 356.45 (5.28)
Horseshoe 10.68 (0.07) 10.88 (0.09) 231.97 (0.87) 230.20 (0.82) 316.56 (2.43) 355.69 (4.18)
Regularized horseshoe true p0 2 10.69 (0.07) 10.58 (0.08) 233.42 (0.97) 230.43 (0.84) 316.51 (2.03) 356.93 (4.37)
Bernoulli mixture 10.57 (0.10) 11.26 (0.09) 230.34 (0.78) 229.12 (0.94) 322.35 (2.27) 357.62 (4.89)
Uniform mixture 10.57 (0.10) 11.24 (0.09) 230.42 (0.75) 229.36 (0.91) 322.40 (2.03) 359.25 (4.76)
Empirical Bayes
Ridge 10.96 (0.08) 10.29 (0.1) 242.71 (0.9) 235.91 (0.92) 317.85 (2.04) 421.86 (11.93)
Local Student’s t 10.97 (0.08) 10.30 (0.08) 242.85 (0.89) 236.05 (0.88) 317.33 (2.12) 375.99 (6.85)
Lasso 10.78 (0.09) 10.49 (0.08) 238.86 (0.77) 234.10 (0.95) 317.38 (2.02) 424.82 (12.56)
Elastic net 10.95 (0.09) 10.31 (0.09) 242.67 (0.9) 235.92 (0.88) 316.64 (1.79) 365.52 (5.91)
Group lasso NA1 NA1 240.89 (0.78) 235.29 (0.92) 315.05 (2.21) 369.00 (6)
Hyperlasso 10.79 (0.08) 10.43 (0.07) 238.62 (0.8) 234.02 (0.99) 314.44 (2.55) 436.24 (12.02)
Horseshoe 10.67 (0.08) 10.87 (0.08) 231.55 (0.89) 229.79 (0.82) 320.73 (3.08) 354.69 (4.13)
Classical penalization
Ridge 10.96 (0.07) 10.11 (0.06) 241.64 (1.13) 235.25 (1.04) 318.44 (2.48) 494.50 (7.29)
Lasso 10.70 (0.09) 11.06 (0.07) 235.76 (0.90) 231.23 (1.07) 339.09 (3.44) 410.84 (7.33)
Elastic net 10.72 (0.08) 10.89 (0.07) 235.96 (0.88) 231.54 (0.93) 335.50 (3.28) 394.76 (7.78)
1 1
Group lasso NA NA 233.11 (0.70) 229.76 (0.98) 343.06 (3.14) 407.14 (7.47)
Forward selection
BIC 11.01 (0.07) 13.08 (0.11) 681.09 (3.95) 679.33 (2.85) 379.59 (4.48) 417.35 (8.34)
Mallows’ Cp 11.06 (0.10) 12.45 (0.11) 464.94 (3.81) 466.25 (3.09) 478.00 (33.33) 687.46 (12.26)
Adjusted R2 11.31 (0.12) 11.83 (0.10) 248.51 (2.35) 241.03 (1.98) 357.79 (3.20) 402.58 (8.73)
Note.
1
No results are available for the group lasso in condition 1 and 2, since no grouping structure is present in these conditions.
2
p0 denotes the prior guess for the number of relevant variables, which was set to the true number of relevant variables, except in
condition 2 where all eight variables are relevant, so we set p0 = 7.
3
The smallest median PMSE per condition across methods is shown in bold and the smallest median PMSE per condition for the
Bayesian methods is shown in italics.

Table 2: Median prediction mean squared error (PMSE) with bootstrapped


standard errors in brackets for the shrinkage priors.

condition across methods is shown in bold and the smallest median PMSE
per condition for the Bayesian methods is shown in italics. In condition 1, 3,
and 4 the full Bayesian Bernoulli mixture prior performs best; in condition 2,
the classical ridge performs best; in condition 5, the empirical Bayesian hy-
perlasso performs best; and in condition 6, the empirical Bayesian horseshoe
performs best. However, the differences between the methods are relatively
small. Only in condition 6, where the number of predictors is larger than
the number of observations, the differences between the methods in terms of
PMSE become more pronounced. As expected, forward selection performs
the worst, especially when Mallows’ Cp or the BIC is used to select the
best model. This illustrates the advantage of using penalization, even when
p < n. Overall, we can conclude that in terms of prediction accuracy the
penalization methods perform quite similarly, except when p > n.9
9
We have also computed the PMSE for a large test set with 1,000,000 observations as
an approximation to the theoretical prediction error. In general, the theoretical PMSEs
did not differ substantially from the PMSE in Table 2, except in condition 6 where the
Bayesian Penalization

5.5 Variable selection accuracy

theoretical PMSE was generally larger. The theoretical PMSEs are available online at
https://osf.io/nveh3/
Condition 1 Condition 3 Condition 4 Condition 5 Condition 6
Selected Correct False Selected Correct False Selected Correct False Selected Correct False Selected Correct False
Prior MCC MCC MCC MCC MCC
CI (%) inclusion inclusion CI (%) inclusion inclusion CI (%) inclusion inclusion CI (%) inclusion inclusion CI (%) inclusion inclusion
Full Bayes
Ridge 90 0.78 0.852 0.087 60 0.66 0.993 0.385 60 0.67 1.000 0.387 40 0.57 0.826 0.259 30 0.50 0.829 0.341
Local Student’s t 80 0.76 0.884 0.132 60 0.67 0.998 0.381 70 0.60 0.875 0.291 40 0.58 0.827 0.246 30 0.51 0.820 0.318
Lasso 80 0.77 0.887 0.126 50 0.64 0.997 0.421 50 0.63 1.000 0.438 30 0.59 0.863 0.283 30 0.50 0.788 0.284
Elastic net 90 0.77 0.851 0.094 60 0.63 0.973 0.386 60 0.67 0.999 0.387 40 0.55 0.824 0.272 30 0.49 0.836 0.361
Group lasso NA1 NA1 NA1 NA1 60 0.67 0.987 0.364 60 0.68 1.000 0.375 40 0.57 0.822 0.246 30 0.51 0.824 0.320
Hyperlasso 80 0.77 0.880 0.117 50 0.64 0.999 0.419 50 0.63 1.000 0.434 40 0.56 0.773 0.202 30 0.50 0.764 0.253
Horseshoe 70 0.78 0.886 0.116 20 0.49 0.999 0.609 20 0.49 1.000 0.603 20 0.56 0.858 0.311 20 0.50 0.795 0.294
Regularized horseshoe 70 0.78 0.889 0.118 40 0.64 0.949 0.351 30 0.61 1.000 0.453 40 0.59 0.794 0.193 30 0.51 0.771 0.247
true p0 2
Bernoulli mixture 50 0.80 0.893 0.099 20 0.66 0.992 0.381 20 0.66 0.993 0.388 20 0.55 0.735 0.159 20 0.48 0.627 0.127
Uniform mixture 50 0.80 0.889 0.100 20 0.66 0.989 0.381 20 0.65 0.990 0.390 20 0.55 0.733 0.160 20 0.48 0.628 0.123
Empirical Bayes
Ridge 90 0.78 0.847 0.081 70 0.52 0.781 0.276 70 0.72 0.982 0.289 40 0.57 0.819 0.240 30 0.28 0.690 0.222
Local Student’s t 90 0.79 0.845 0.080 60 0.67 0.999 0.377 70 0.70 0.949 0.291 40 0.58 0.821 0.241 30 0.49 0.764 0.279
Lasso 80 0.77 0.885 0.118 50 0.64 0.999 0.417 50 0.63 1.000 0.433 30 0.57 0.831 0.270 20 0.28 0.591 0.273
Elastic net 90 0.78 0.848 0.085 60 0.67 0.999 0.377 70 0.71 0.968 0.290 40 0.58 0.834 0.251 30 0.50 0.810 0.314
Group lasso NA1 NA1 NA1 NA1 60 0.68 0.998 0.361 70 0.60 0.880 0.278 40 0.58 0.820 0.240 30 0.46 0.728 0.277
Hyperlasso 80 0.78 0.883 0.109 50 0.65 0.999 0.414 50 0.63 1.000 0.431 40 0.53 0.767 0.199 20 0.32 0.575 0.268
Horseshoe 70 0.78 0.876 0.108 20 0.51 0.997 0.578 20 0.51 0.997 0.576 20 0.53 0.840 0.308 20 0.48 0.756 0.266
Classical penalization
3
Lasso NA3 0.72 0.923 0.210 NA3 0.66 0.642 0.023 NA 0.67 0.632 0.008 NA3 0.33 0.418 0.108 NA3 0.27 0.368 0.116
Elastic net NA3 0.62 0.971 0.361 NA3 0.97 1.000 0.031 NA3 0.99 1.000 0.013 NA3 0.47 0.645 0.161 NA3 0.39 0.548 0.153
Group lasso NA1 NA1 NA1 NA1 NA3 0.52 1.000 0.482 NA3 0.50 1.000 0.500 NA3 0.34 0.893 0.603 NA3 0.37 0.787 0.462
Bayesian Penalization

Forward selection
BIC NA3 0.77 0.843 0.093 NA3 -0.087 0.086 0.139 NA3 -0.092 0.081 0.137 NA3 -0.056 0.171 0.213 NA3 -0.0056 0.252 0.249
Mallows’ Cp NA3 0.72 0.895 0.176 NA3 0.023 0.188 0.164 NA3 0.023 0.186 0.164 NA3 -0.071 0.122 0.172 NA3 -0.074 0.024 0.052
3
Adjusted R2 NA3 0.60 0.938 0.343 NA3 0.14 0.332 0.201 NA 0.15 0.329 0.197 NA3 0.0085 0.338 0.328 NA3 0.053 0.378 0.323
Note.
1
No results are available for the group lasso in condition 1, since no grouping structure is present in these conditions.
2
p0 denotes the prior guess for the number of relevant variables, which was set to the true number of relevant variables, except in condition 2 where all eight variables are relevant, so we set p0 = 7.
3
For the classical penalization methods, the lasso, elastic net, and forward selection automatically shrink some coefficients to exactly zero so that no criterion such as a confidence interval for variable selection is needed. The
ridge is not included, since it does not automatically shrink coefficients to zero and therefore always has a correct and false inclusion rate of 1.
4
The highest MCC and correct inclusion rate and the lowest false inclusion rate per condition across methods are shown in bold and the highest MCC and best rates per condition for the Bayesian methods are shown in italics.

Table 3: Matthews’ correlation coefficient (MCC) and correct and false inclusion rates based on the optimal credibility
intervals (CIs) selected using the distance criterion.
Bayesian Penalization

Table 3 shows MCC and the correct and false inclusion rates for the
optimal CIs for the shrinkage priors and MCC and the inclusion rates for the
classical penalization methods, which automatically select predictors. The
bold values indicate the best inclusion rates across all methods, whereas the
italic values indicate the best inclusion rates across the Bayesian methods.
Again, for the regularized horseshoe the results were comparable regardless of
whether a correct, incorrect, or no prior guess was used. In the first condition,
the classical penalization methods outperform the Bayesian methods in terms
of correct inclusion rates, but at the cost of higher false inclusion rates. This
is a well known problem of the lasso and elastic net when cross-validation
is used to select the penalty parameter λ. A solution to this problem is to
use stability selection to determine λ (Meinshausen and Bühlmann, 2010).
The optimal Bayesian methods in the first condition based on the highest
value for MCC are the mixture priors, both of which have reasonable correct
and false inclusion rates. Note that, generally, the differences with the other
Bayesian methods are relatively small in condition 1. In condition 3 and 4,
the correct inclusion rates are generally high and the false inclusion rates are
increased as well. As a result, the optimal Bayesian methods in condition 3
show a trade-off between correct and false inclusion rates, with the empirical
Bayes group lasso having the highest value for MCC. However, the differences
in MCC between most Bayesian methods are small and MCC is generally
lower compared to condition 1 due to the increased false inclusion rates. In
condition 4, multiple methods show a correct inclusion rate of 1, combined
with a high false inclusion rate. In terms of MCC, the empirical Bayes ridge
prior performs best. In condition 5, both rates and thus the MCC values are
slightly lower across all methods, which is a result of the optimal CI being
smaller. The full Bayes lasso and regularized horseshoe perform best in terms
of MCC, although the other shrinkage priors show comparable MCC values.
Condition 6 shows the most pronounced differences between the methods and
the greatest trade-off between correct and false inclusion rates. None of the
Bayesian methods attain a value for the MCC greater than 0.51, and some
shrinkage priors (i.e., the empirical Bayes ridge and lasso) result in a MCC
value of only 0.28. In conclusion, although there exist differences between
the methods in terms of variable selection accuracy, there is not one method
that performs substantially better than the other methods in terms of both
correct and false inclusion rates.
Bayesian Penalization

6 Empirical applications
We will now illustrate the shrinkage priors on two empirical data sets. An R
package bayesreg is available online (https://github.com/sara-vanerp/bayesreg)
that can be used to apply the shrinkage priors. The first illustration (math
performance) shows the benefits of using shrinkage priors in a situation where
the number of predictors is smaller than the number of observations. In the
second illustration (communities and crime), the number of predictors is
larger than the number of observations, and it is necessary to use some form
of regularization in order to fit the model.

6.1 Math performance


In this illustration, we aim to predict the final math grade of 395 Portuegese
students in secondary schools (Cortez and Silva, 2008), obtained from the
UCL machine learning repository10 (Lichman, 2013). The data set includes
30 predictors covering demographic, social and school related characteristics,
such as parents’ education and the time spent studying. The continuous pre-
dictors were standardized and dummy variables were used for the categorical
predictors, resulting in a total of 39 predictors. We split the data into an
approximately equal training (n = 197) and test (n = 198) set.
Table 4 presents the computation time in seconds for each method, the
prediction mean squared error, and the number of included predictors. We
have not included the results for the horseshoe prior because this prior re-
sulted in divergent transitions, which in turn led to instable results (specifi-
cally, the PMSE varied greatly when rerunning the analysis).
It is clear that the Bayesian methods are computationally much more
intensive than the classical penalization methods, especially the regularized
horseshoe and mixture priors. The advantage brought by this increased com-
putation time, however, is the more straightforward interpretation of results
such as credibility intervals and the automatic computation of uncertainty
estimates. This can be seen in Figure 7 which shows the posterior density
for one regression coefficient β1 using the lasso prior and its 95% credibility
interval (i.e., the shaded dark blue area). The bootstrapped 95% confidence
interval obtained using the HDCI package (Liu et al., 2017) in R is shown
by the dashed grey lines and can be seen to underestimate the uncertainty.
This problem is often observed using classical lasso estimation (Kyung et al.,
2010). The PMSE clearly illustrates the advantage of penalization, even
10
The data is available at
https://archive.ics.uci.edu/ml/datasets/Student+Performance
Bayesian Penalization

Shrinkage prior Computation time (seconds) PMSE Number of included predictors


Full Bayes
Ridge 179 19.53 22
Local Student’s t 361 19.44 22
Lasso 219 19.25 22
Elastic net 354 19.53 23
Group lasso 342 19.36 22
Hyperlasso 199 19.18 19
Regularized horseshoe with p0 1474 19.12 17
Bernoulli mixture 24524 19.31 9
Uniform mixture 4370 19.28 9
Empirical Bayes
Ridge 341 19.42 22
Local Student’s t 443 19.47 22
Lasso 444 19.26 22
Elastic net 603 19.41 22
Group lasso 534 19.50 22
Hyperlasso 387 19.11 19
Classical penalization
Ordinary least squares (OLS) 0.013 22.56 39
Ridge 0.118 18.95 38
Lasso 0.0.072 19.11 20
Elastic net 0.0.053 19.25 20
Group lasso 0.187 19.43 21
Forward selection
BIC 0.006 21.13 4
Mallows’ Cp 0.006 21.42 13
Adjusted R2 0.006 22.44 25

Table 4: Computation time in seconds (with a 2.8 GHz Intel Core i7 pro-
cessor), prediction mean squared error (PMSE), and number of included
predictors for the different methods for the math performance application

though the number of predictors is not greater than the sample size. Com-
pared to regression using OLS, all penalization methods show lower PMSEs.
Moreover, all penalization methods outperform forward selection in terms of
PMSE. Between the different penalization methods, differences in PMSE are
small.
Bayesian Penalization

−2 −1 0 1
β1

Figure 7: Posterior density for β1 in the math performance application using


the Bayesian lasso. The dark blue line depicts the posterior median, the
shaded dark blue area depicts the 95% credibility interval. The black dashed
lines depict the bootstrapped 95% confidence interval of the classical lasso.
Bayesian Penalization

The last column in Table 4 reports the number of included predictors


for each method. Figure 8 shows which predictors are included for each
method. Each point indicates an included predictor, based on the optimal
CI from condition 5 in the simulation study. OLS does not exclude any
predictors, neither does the classical ridge generally although in this data
set one coefficient was estimated to be zero. Most shrinkage priors included
22 predictors, with the hyperlasso and regularized horseshoe resulting in a
slightly sparser solution. The mixture priors selected much less predictors (9)
compared to the other methods. The number of included predictors for the
forward selection method ranged from 4 to 25, depending on the criterium
used to select the best model.
Based on the predicted errors and the number of included predictors, we
conclude that essentially all Bayesian methods and the classical penalization
methods performed best. The computation time for the Bayesian meth-
ods was considerably larger than for the classical methods. However, this
increased computation time results in automatic availability of uncertainty
estimates which were generally larger compared to classical bootstrapped
confidence intervals.

6.2 Communities and crime


We illustrate the shrinkage priors on a data set containing 125 predictors of
the number of violent crimes per 100,000 residents in different communities
in the US (Redmond and Baveja, 2002) obtained from the UCL machine
learning repository11 (Lichman, 2013). The predictor variables include com-
munity characteristics, such as the median family income and the percentage
of housing that is occupied, as well as law enforcement characteristics, such
as the number of police officers and the police operating budget. We created
dummy variables for the two nominal predictors in the data set, resulting
in a total of 172 predictors. For the group lasso, all dummy variables corre-
sponding to one predictor make up a group. The number of observations is
319, after removing all cases with at least one missing value on any of the
predictors.12 We split the data into approximately equal training (n = 159)
and test (n = 160) sets. All predictors were normalized to have zero mean
and unit variance and the outcome variable was log transformed.
Table 5 reports the computation time in seconds for each method, as
11
We used the unnormalized data, available at
https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized
12
Although the Bayesian framework allows for straightforward imputation of missing
values, we removed all cases with missing values to provide an illustration of the shrinkage
methods in a sparse data set.
Bayesian Penalization

14
5 ●

38 ●

36 ●

31 ●

13 ●

25 ● ●

23 ● ●

30 ● ● ●

28 ● ● ●

18 ● ● ●

16 ● ● ●

24 ● ● ● ● ● ●

27 ● ● ● ● ● ● ●

26 ● ● ● ● ● ● ●

22 ● ● ● ● ● ● ●

20 ● ● ● ● ● ● ● ●

39 ● ● ● ● ● ● ● ● ● ● ● ●
Predictor

34 ● ● ● ● ● ● ● ● ● ● ● ●

11 ● ● ● ● ● ● ● ● ● ● ● ● ●

37 ● ● ● ● ● ● ● ● ● ● ● ● ● ●

33 ● ● ● ● ● ● ● ● ● ● ● ● ● ●

19 ● ● ● ● ● ● ● ● ● ● ● ● ● ●

35 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

32 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

29 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

17 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

12 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

21 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

9 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
IC

re

re

oe

B)

ic)

ic)

)
B)

B)

R2

ic)
p

B)

B)
EB
FB

sic

(FB

FB

FB

(FB
'C
u

xtu
nB

(E

(E

(E

(E

t (E

t (F
ss

ss

ss
sh
ixt

as

dj.
o(

o(

e(
o(
s

cla

cla
rse
mi

so

et
ow

so

et

so

ge
im
tio

(cl

(cl

na
ss

ss

t's
ss

t's

cn
cn

as
as

o(

Rid

e(
as

Rid
all
lec

rm

ho
ull

rla

La

en
La

en
et

so

tio
pl
erl

pl

sti
sti
ss

g
nM
ifo
r no

cn
Se

tud
pe
ed

tud
as

Rid
lec
ou

Ela
Ela

ou
La
p
Un

Hy
riz

pl
Hy

sti
tio
Be

lS
lS

Se
Gr
Gr
Ela

ou
ula
lec

ca
ca
Gr
g

Lo
Se

Lo
Re

Method

Figure 8: Overview of the included predictors for each method in the math
performance application. Points indicate that a predictor is included based
on the optimal credibility interval (CI) from condition 5 in the simulation
study. The methods on the x-axis are ordered such that the method that
includes the least predictors is on the left and the method that includes the
most predictors is on the right. The predictors on the y-axis are ordered
with the predictor being included the least on top and the predictor being
included the most at the bottom.
Bayesian Penalization

well as the PMSE and the number of selected variables. Again, the horse-
shoe prior resulted in divergent transitions and is therefore excluded from
the results. The posterior density using the lasso prior for β15 is shown in
Figure 9, with the dark blue shaded area depicting the 95% credibility inter-
vals and the dashed black lines depicting the bootstrapped 95% confidence
interval of the classical lasso. Again, the bootstrapped confidence interval is
much smaller than the Bayesian credibility interval and located far from the
posterior median estimate (i.e., the dark blue line).

Shrinkage prior Computation time (seconds) PMSE Number of included predictors


Full Bayes
Ridge 677 0.217 61
Local Student’s t 1973 0.216 60
Lasso 2068 0.216 46
Elastic net 242 0.216 62
Group lasso 3044 0.216 61
Hyperlasso 1066 0.215 46
Regularized horseshoe with p0 15803 0.226 31
Bernoulli mixture 60006 1.706 54
Uniform mixture 26080 1.683 54
Empirical Bayes
Ridge 1195 0.218 60
Local Student’s t 1912 0.216 57
Lasso 4207 0.215 46
Elastic net 417 0.217 57
Group lasso 3992 0.217 62
Hyperlasso 2016 0.215 46
Classical penalization
Ridge 0.376 0.258 160
Lasso 0.200 0.508 33
Elastic net 0.164 0.460 26
Group lasso 0.408 0.663 55
Forward selection
BIC 0.023 1.500 17
Mallows’ Cp 0.023 0.276 1
Adjusted R2 0.023 4.093 141

Table 5: Computation time in seconds (with a 2.8 GHz Intel Core i7 pro-
cessor), prediction mean squared error (PMSE), and number of included
predictors for the different methods for the crime application

In addition, most Bayesian methods resulted in a lower PMSE than


the classical methods, except for the mixture priors. The forward selection
method resulted in much larger PMSEs, except when Mallows’ Cp was used
to find the best model, however, this model retained only 1 predictor. On
the other hand, using the Adjusted R2 criterion led to a model that included
141 predictors. This illustrates the arbitrariness of using forward selection.
Figure 10 shows which predictors are included for each method. Each point
Bayesian Penalization

indicates an included predictor, based on the optimal CI from condition 6 in


the simulation study. Apart from the forward selection methods, the classical
elastic net excludes most predictors. Interestingly, the Bayesian elastic net
and lasso retain many more predictors than the classical elastic net and lasso.
However, not all predictors that are retained by the classical lasso and elas-
tic net are also retained by the Bayesian lasso and elastic net. Specifically,
the predictors included by the classical methods but not by the Bayesian
methods all correspond to dummy variables for State. The hyperlasso and
lasso methods all include 46 predictors, whereas the ridge, local Student’s t,
elastic net, and group lasso priors all retain around 60 predictors. The mix-
ture priors both include 54 predictors. The regularized horseshoe retains the
least predictors of all Bayesian methods, only 31. The classical ridge retains
almost all predictors, but estimated some coefficients to be equal to zero in
this data set.
Based on this illustration, we conclude that the Bayesian penalization
methods outperform the classical penalization methods in terms of predic-
tion error. The prediction errors of the Bayesian penalization methods do
not differ substantially, except for the mixture priors which showed larger
PMSEs. The shrinkage priors differ in how much shrinkage they perform
and thus in the number of predictors that are selected.
Bayesian Penalization

−0.2 −0.1 0.0 0.1


β15

Figure 9: Posterior density for β15 in the crime application using the Bayesian
lasso. The dark blue line depicts the posterior median, the shaded dark blue
area depicts the 95% credibility interval. The black dashed lines depict the
bootstrapped 95% confidence interval of the classical lasso.
Bayesian Penalization

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
170
● ● ●
● ● ●

165 ● ● ● ● ●





● ●
● ● ●

160 ● ● ●

● ● ● ● ●

● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
155 ●



● ● ●
● ● ●
150 ● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
145
● ● ●
● ●
● ● ● ● ●
140 ● ● ●
● ●






● ● ●
● ● ● ● ●

135 ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●

130 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
125 ● ● ● ● ●





● ●
● ● ● ● ● ●
● ●
120 ● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ● ●




● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
115 ● ● ● ● ●



















● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
110 ● ● ● ● ● ● ● ● ● ● ● ● ● ●




● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
105 ●












● ● ●






















● ●
● ●
● ●
100 ● ● ●



● ● ● ●
● ● ● ●
● ●
95 ● ● ● ● ● ●



● ● ● ● ● ● ● ● ●



● ●
Predictor


● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
90 ● ●
● ●

● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
85 ● ● ● ● ● ● ●


















● ● ● ● ●
● ●
● ●
80 ●


● ●
● ●
● ● ● ●
75 ●

● ●



● ●

● ●
● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●
70 ● ● ● ● ● ● ● ● ●














● ●


● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
65 ● ● ● ● ● ● ● ● ● ● ● ●




● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ●
60 ●



● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
55 ● ● ● ● ● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
50 ● ●



● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
45 ● ●











● ● ● ●



















● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
40 ● ● ● ●
● ●
● ● ● ● ● ● ● ●




● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
35 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



● ●
● ●
● ●
30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●




● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
25 ●



● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
20 ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●



● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
15 ●
● ●








● ●





















● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
10 ●

● ● ● ● ● ● ● ● ●



● ● ●
● ● ●
● ● ●
5 ●
● ●

● ●







● ● ● ●




















● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
C

B)

B)
e

ic)

B)

re

re

B)

R2

)
p

B)

B)
sic

EB

FB

sic

(FB

(FB

(FB

sic
ho
s' C

xtu
B

(E

(E
t (E

t (F
ss

ixt
as

as

dj.

as
o(

o(
o(

o(

(
es
ion

cla

mi

so

ge

et
ow

et

ge

so
im
(cl

(cl

na

(cl
ors

ss

ss

t's
ss

ss

t's

cn
cn
t

as
o(

Rid

as
Rid
all

lec

rm
rla

La

ull

en
rla

La

en
et

so

ge
tio
dh

pl

pl

sti
sti
ss
nM

ifo
cn

rno
Se

tud
pe

tud
pe

as

Rid
lec
ou

Ela
Ela

ou
La
ize

Un
Hy

pl
Hy
sti
tio

Be

lS
lS

Se
Gr

Gr
lar
Ela

ou
lec

ca
ca
gu

Gr

Lo
Se

Lo
Re

Method

Figure 10: Overview of the included predictors for each method in the crime
application. Points indicate that a predictor is included based on the optimal
credibility interval (CI) from condition 6 in the simulation study. The meth-
ods on the x-axis are ordered such that the method that includes the least
predictors is on the left and the method that includes the most predictors is
on the right.
Bayesian Penalization

7 Discussion
The aim of this paper was to provide insights about the different shrinkage
priors that have been proposed for Bayesian penalization to avoid overfitting
of regression models in the case of many predictors. We have reviewed the
literature on shrinkage priors and presented them in a general framework
of scale mixtures of normal distributions to enable theoretical comparisons
between the priors. To model the penalty parameter λ, which is a central
part of the penalized regression model, a full Bayes and an empirical Bayes
approach were employed.
Although the various prior distributions differ substantially from each
other, e.g., regarding their tails or convexity, the priors performed very sim-
ilarly in the simulation study in those conditions where p < n. Overall,
the performance was comparable to the classical penalization approaches.
The math performance example clearly showed the advantage of using pe-
nalization to avoid overfitting when p < n. As in the simulation study, the
prediction errors in the math example were comparable across penalization
methods, although the number of included predictors varied across methods.
Finally, although classical penalization is much faster than Bayesian penal-
ization, it does not automatically provide accurate uncertainty estimates and
the bootstrapped confidence intervals obtained for the classical methods were
generally much smaller compared to the Bayesian credibility intervals.
The differences between the methods became more pronounced when
p > n. In condition 6 of the simulation study, the (regularized) horseshoe
and hyperlasso priors performed substantially better then most of the other
shrinkage priors in terms of PMSE. This is most likely due to the fact that the
hyperlasso and (regularized) horseshoe are non-convex global-local shrinkage
priors and are therefore particularly adept at keeping large coefficients large,
while shrinking the small coefficients enough towards zero. Future research
should consider various high-dimensional simulation conditions to further ex-
plore the performance of the shrinkage priors in such settings, for example
by varying the correlations between the predictors. The crime example il-
lustrated the use of the penalization methods further in a p > n situation.
In this example, most Bayesian approaches resulted in smaller prediction er-
rors than the classical approaches (except for the mixture priors). Also in
terms of the predictors that were included there were considerable differences
between the various approaches.
An important goal of the shrinkage methods discussed in this paper is
the ultimate selection of relevant variables. Throughout this paper, we have
focused on the use of marginal credibility intervals to do so. However, the use
of marginal credibility intervals to perform variable selection can be prob-
Bayesian Penalization

lematic, since the marginal intervals can behave differently compared to joint
credibility intervals. This is especially the case for global shrinkage priors,
such as the (regularized) horseshoe prior since these priors induce shrink-
age on all variables jointly (Piironen et al., 2017). Future research should
investigate whether the variable selection accuracy can be further improved
by using methods that jointly select relevant variables (for example, projec-
tion predictive variable selection; Piironen and Vehtari, 2016, or decoupled
shrinkage and selection; Hahn and Carvalho, 2015).
Throughout this paper, we focused on the linear regression model. Hope-
fully, the results presented in this paper and the corresponding R package
bayesreg available at https://github.com/sara-vanerp/bayesreg will lead to
an increased use of penalization methods in psychology, because of the im-
proved performance in terms of prediction error and variable selection accu-
racy compared to forward subset selection. The shrinkage priors investigated
here can be applied in more complex models in a straightforward manner.
For example, in generalized linear regression models such as logistic and
Poisson regression models, the only necessary adaptation is to incorporate
a link function in the model. Although not currently available in the R-
package, the available Stan modelfiles can be easily adapted to generalized
linear models (GLMs). Additionally, packages such as brms (Bürkner, 2017)
and rstanarm (Stan Development Team, 2016) include several of the shrink-
age priors described here, or allow the user to specify them manually. Both
packages support (multilevel) GLMs, although rstanarm relies on precom-
piled models and is therefore less flexible than brms. Currently, an active
area of research employs Bayesian penalization in latent variable models,
such as factor models (see e.g., Lu et al., 2016; Jacobucci and Grimm, 2018)
and quantile structural equation models (see e.g., Feng et al., 2017). The
characteristics and behaviors of the shrinkage priors presented in this paper
can be a useful first step in solving these more challenging problems.
Bayesian Penalization

Acknowledgements
This research was supported by a Research Talent Grant from the Nether-
lands Organisation for Scientific Research. We would like to thank Aki Ve-
htari, Carlos Carvalho, Tuomas Kaseva, and Charlie Strauss for providing
helpful comments on an earlier version of this manuscript and pointing out
relevant references.

References
Alhamzawi, R., Yu, K., and Benoit, D. F. (2012). Bayesian adaptive lasso
quantile regression. Statistical Modelling, 12(3):279–297.

Andersen, M. R., Vehtari, A., Winther, O., and Hansen, L. K. (2017).


Bayesian inference for spatio-temporal spike-and-slab priors. Journal of
Machine Learning Research, 18(139):1–58.

Armagan, A., Dunson, D. B., and Lee, J. (2013). Generalized double pareto
shrinkage. Statistica Sinica.

Azmak, O., Bayer, H., Caplin, A., Chun, M., Glimcher, P., Koonin, S., and
Patrinos, A. (2015). Using big data to understand the human condition:
The kavli HUMAN project. Big Data, 3(3):173–188.

Bae, K. and Mallick, B. K. (2004). Gene selection using a two-level hierar-


chical bayesian model. Bioinformatics, 20(18):3423–3430.

Berger, J. O. (2006). The case for objective bayesian analysis. Bayesian


Analysis, 3:385–402.

Betancourt, M. (2017). A conceptual introduction to hamiltonian monte


carlo. arXiv preprint arXiv:1701.02434.

Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2016). The horseshoe+
estimator of ultra-sparse signals. Bayesian Analysis.

Bhadra, A., Datta, J., Polson, N. G., and Willard, B. T. (2017). Lasso meets
horseshoe: A survey. arXiv preprint arXiv:1706.10179.

Bhattacharya, A., Pati, D., Pillai, N. S., and Dunson, D. B. (2012). Bayesian
shrinkage. arXiv preprint arXiv:1212.6088.

Bornn, L., Gottardo, R., and Doucet, A. (2010). Grouping priors and the
Bayesian elastic net. arXiv preprint arXiv:1001.4083.
Bayesian Penalization

Breheny, P. and Huang, J. (2015). Group descent algorithms for nonconvex


penalized linear and logistic regression models with grouped predictors.
Statistics and Computing, 25:173–187.

Bürkner, P.-C. (2017). brms: An R package for Bayesian multilevel models


using Stan. Journal of Statistical Software, 80(1):1–28.

Caron, F. and Doucet, A. (2008). Sparse bayesian nonparametric regression.


In Proceedings of the 25th international conference on Machine learning -
ICML ’08. Association for Computing Machinery (ACM).

Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe


estimator for sparse signals. Biometrika, 97(2):465–480.

Cortez, P. and Silva, A. M. G. (2008). Using data mining to predict sec-


ondary school student performance. In In A. Brito and J. Teixeira Eds.,
Proceedings of 5th FUture BUsiness TEChnology Conference, pages 5–12.

Derksen, S. and Keselman, H. J. (1992). Backward, forward and stepwise


automated subset selection algorithms: Frequency of obtaining authen-
tic and noise variables. British Journal of Mathematical and Statistical
Psychology, 45(2):265–282.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likeli-
hood and its oracle properties. Journal of the American Statistical Asso-
ciation, 96(456):1348–1360.

Fawcett, T. (2015). Mining the quantified self: Personal knowledge discovery


as a challenge for data science. Big Data, 3(4):249–266.

Feng, X.-N., Wang, Y., Lu, B., and Song, X.-Y. (2017). Bayesian regularized
quantile structural equation models. Journal of Multivariate Analysis,
154:234–248.

Feng, X.-N., Wu, H.-T., and Song, X.-Y. (2015). Bayesian adaptive lasso for
ordinal regression with latent variables. Sociological Methods & Research.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths


for generalized linear models via coordinate descent. Journal of Statistical
Software, 33(1):1–22.

Gelman, A. (2006). Prior distributions for variance parameters in hierarchical


models (comment on article by browne and draper). Bayesian Analysis,
1(3):515–534.
Bayesian Penalization

Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation


using multiple sequences. Statistical Science, 7(4):457–472.

George, E. I. and McCulloch, R. E. (1993). Variable selection via gibbs


sampling. Journal of the American Statistical Association, 88(423):881.

Ghosh, J., Li, Y., and Mitra, R. (2017). On the use of cauchy prior distribu-
tions for bayesian logistic regression. Bayesian Analysis.

Griffin, J. and Brown, P. (2017). Hierarchical shrinkage priors for regression


models. Bayesian Analysis, 12(1):135–159.

Griffin, J. E. and Brown, P. J. (2005). Alternative prior distributions for vari-


able selection with very many more variables than observations. University
of Warwick. Centre for Research in Statistical Methodology.

Griffin, J. E. and Brown, P. J. (2011). Bayesian hyper-lassos with non-convex


penalization. Australian & New Zealand Journal of Statistics, 53(4):423–
442.

Hahn, P. R. and Carvalho, C. M. (2015). Decoupling shrinkage and selection


in bayesian linear models: A posterior summary perspective. Journal of
the American Statistical Association, 110(509):435–448.

Hans, C. (2009). Bayesian lasso regression. Biometrika, 96(4):835–845.

Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical learning


with sparsity. CRC press.

Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation


for nonorthogonal problems. Technometrics, 12(1):55–67.

Hsiang, T. C. (1975). A bayesian view on ridge regression. The Statistician,


24(4):267.

Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Fre-
quentist and bayesian strategies. The Annals of Statistics, 33(2):730–773.

Jacobucci, R. and Grimm, K. J. (2018). Comparison of frequentist and


bayesian regularization in structural equation modeling. Structural Equa-
tion Modeling: A Multidisciplinary Journal, pages 1–11.

Kaseva, T. (2018). Convergence diagnosis and comparison of shrinkage pri-


ors. Github repository.
Bayesian Penalization

Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010). Penalized regression,
standard errors, and bayesian lassos. Bayesian Analysis, 5(2):369–411.

Li, Q. and Lin, N. (2010). The bayesian elastic net. Bayesian Analysis,
5(1):151–170.

Lichman, M. (2013). UCI machine learning repository.

Liu, H., Xu, X., and Li, J. J. (2017). HDCI: High Dimensional Confidence
Interval Based on Lasso and Bootstrap. R package version 1.0-2.

Lu, Z.-H., Chow, S.-M., and Loken, E. (2016). Bayesian factor analysis as
a variable-selection problem: Alternative priors and consequences. Multi-
variate Behavioral Research, 51(4):519–539.

Lumley, T. (2017). leaps: Regression Subset Selection. R package version


3.0.

Matthews, B. (1975). Comparison of the predicted and observed secondary


structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) -
Protein Structure, 405(2):442–451.

McNeish, D. M. (2015). Using lasso for predictor selection and to assuage


overfitting: A method long overlooked in behavioral sciences. Multivariate
Behavioral Research, 50(5):471–484.

Meinshausen, N. and Bühlmann, P. (2010). Stability selection. Journal of


the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–
473.

Meuwissen, T. H., Hayes, B. J., and Goddard, M. E. (2001). Prediction


of total genetic value using genome-wide dense marker maps. Genetics,
157(4):1819–1829.

Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection


in linear regression. Journal of the American Statistical Association,
83(404):1023–1032.

Monnahan, C. C., Thorson, J. T., and Branch, T. A. (2016). Faster estima-


tion of bayesian models in ecology using hamiltonian monte carlo. Methods
in Ecology and Evolution, 8(3):339–348.

Mulder, J. and Pericchi, L. R. (2018). The matrix-f prior for estimating and
testing covariance matrices. Bayesian Analysis, 13(4):1189–1210.
Bayesian Penalization

Park, T. and Casella, G. (2008). The bayesian lasso. Journal of the American
Statistical Association, 103(482):681–686.

Peltola, T., Havulinna, A. S., Salomaa, V., and Vehtari, A. (2014). Hi-
erarchical bayesian survival analysis and projective covariate selection in
cardiovascular event risk prediction. In Proceedings of the Eleventh UAI
Conference on Bayesian Modeling Applications Workshop-Volume 1218,
pages 79–88. CEUR-WS. org.

Perkins, N. J. and Schisterman, E. F. (2006). The inconsistency of “opti-


mal” cutpoints obtained using two criteria based on the receiver operating
characteristic curve. American Journal of Epidemiology, 163(7):670–675.

Piironen, J., Betancourt, M., Simpson, D., and Vehtari, A. (2017). Con-
tributed comment on article by van der Pas, Szabó, and van der Vaart.
Bayesian Analysis, 12(4):1264–1266.

Piironen, J. and Vehtari, A. (2016). Comparison of bayesian predictive meth-


ods for model selection. Statistics and Computing, 27(3):711–735.

Piironen, J. and Vehtari, A. (2017). Sparsity information and regulariza-


tion in the horseshoe and other shrinkage priors. Electronic Journal of
Statistics, 11(2):5018–5051.

Polson, N. G. and Scott, J. G. (2011). Shrink globally, act locally: Sparse


bayesian regularization and prediction. In Bayesian Statistics 9, pages
501–538. Oxford University Press (OUP).

Polson, N. G. and Scott, J. G. (2012). On the half-cauchy prior for a global


scale parameter. Bayesian Analysis, 7(4):887–902.

Polson, N. G., Scott, J. G., and Windle, J. (2014). The bayesian bridge.
Journal of the Royal Statistical Society: Series B (Statistical Methodology),
76(4):713–733.

Redmond, M. and Baveja, A. (2002). A data-driven software tool for en-


abling cooperative information sharing among police departments. Euro-
pean Journal of Operational Research, 141(3):660–678.

Roy, V. and Chakraborty, S. (2016). Selection of tuning parameters, solution


paths and standard errors for bayesian lassos. Bayesian Analysis.

Stan Development Team (2016). rstanarm: Bayesian applied regression mod-


eling via Stan. R package version 2.13.1.
Bayesian Penalization

Stan development team (2017a). RStan: The R interface to Stan, R package


version 2.16.2.

Stan development team (2017b). Stan Modeling Language Users Guide and
Reference Manual, version 2.17.0.

Stan development team (2017c). The Stan Core Library, version 2.16.0.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Jour-
nal of the Royal Statistical Society. Series B (Methodological), pages 267–
288.

Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a


retrospective. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 73(3):273–282.

van de Wiel, M. A., Beest, D. E. t., and Münch, M. (2017). Learning


from a lot: Empirical bayes in high-dimensional prediction settings. arXiv
preprint arXiv:1709.04192.

van Erp, S., Mulder, J., and Oberski, D. L. (2018). Prior sensitivity analysis
in default bayesian structural equation modeling. Psychological Methods,
23(2):363–388.

Vehtari, A., Gabry, J., Yao, Y., and Gelman, A. (2018). loo: Efficient leave-
one-out cross-validation and waic for bayesian models. R package version
2.0.0.

West, M. (1987). On scale mixtures of normal distributions. Biometrika,


pages 646–648.

Wolpert, D. H. and Strauss, C. E. M. (1996). What bayes has to say about


the evidence procedure. In Maximum Entropy and Bayesian Methods,
pages 61–78. Springer Netherlands.

Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression


with grouped variables. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 68(1):49–67.

Zhao, S., Gao, C., Mukherjee, S., and Engelhardt, B. E. (2016). Bayesian
group factor analysis with structured sparsity. Journal of Machine Learn-
ing Research, 17(196):1–47.

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the
American Statistical Association, 101(476):1418–1429.
Bayesian Penalization

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the
elastic net. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 67(2):301–320.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy