Shrinkage Priors For Bayesian Penalized Regression
Shrinkage Priors For Bayesian Penalized Regression
Regression.
1 Introduction
Regression analysis is one of the main statistical techniques often used in
the field of psychology to determine the effect of a set of predictors on an
outcome variable. The number of predictors is often large, especially in the
current “Age of Big Data”. For example, the Kavli HUMAN project (Az-
mak et al., 2015) aims to collect longitudinal data on all aspects of human
life for 10,000 individuals. Measurements include psychological assessments
(e.g., personality, IQ), health assessments (e.g., genome sequencing, brain
activity scanning), social network assessment, and variables related to edu-
cation, employment, and financial status, resulting in an extremely large set
of variables. Furthermore, personal tracking devices allow the collection of
large amounts of data on various topics, including for example mood, in a
longitudinal manner (Fawcett, 2015). The problem with regular regression
techniques such as ordinary least squares (OLS) is that they quickly lead to
overfitting as the ratio of predictor variables to observations increases (see
for example, McNeish, 2015, for an overview of the problems with OLS).
Penalized regression is a statistical technique widely used to guard against
overfitting in the case of many predictors. Penalized regression techniques
have the ability to select variables out of a large set of variables that are rele-
vant for predicting some outcome. Therefore, a popular setting for penalized
regression is in high-dimensional data, where the number of predictors p is
larger than the sample size n. Furthermore, in settings where the number
of predictors p is smaller than the sample size n (but still relatively large),
penalized regression can offer advantages in terms of avoiding overfitting and
achieving model parsimony compared to traditional variable selection meth-
ods such as null-hypothesis testing or stepwise selection methods (Derksen
and Keselman, 1992; Tibshirani, 1996). The central idea of penalized regres-
sion approaches is to add a penalty term to the minimization of the sum of
squared residuals, with the goal of shrinking small coefficients towards zero
while leaving large coefficients large, i.e.,
{ }
1
minimize ||yy − β01 − X β ||2 + λc ||β
2
β ||q , (1)
β0 , β 2n
( p ) 1q
∑
where ||β β ||q = |βj |q ,
j=1
To obtain λEB , first note that the marginal likelihood is the product
of the likelihood and prior integrated over the model parameters, i.e.,
∫∫∫
p(yy |λ) = p(yy |X, β0 , β , σ 2 )p(β0 )p(β
β |σ 2 , λ)p(σ 2 ) dβ0 dβ
β dσ 2 . (5)
is the mode of the marginal posterior for λ, i.e., p(λ|yy ). This corre-
sponds to the maximum of the marginal likelihood p(yy |λ) because of
the noninformative prior for λ.
Given that the loss function is the negative of the log likelihood, this
is equivalent to:
Finally, λCV is used to fit the model on the test set. Generally, the
prediction mean squared error (PMSE) is used to determine λCV , which
corresponds to a quadratic loss function.
In practice k-fold cross-validation is often used. k-fold cross-validation
is a specific implementation of cross-validation in which the data is
split in only a training and a test set. The training set is split in K
parts (usually K = 5 or K = 10) and the range of λ values is applied
K times on K − 1 parts of the training set, each time with a different
part as validation set. The K estimates of the PMSE are then averaged
and a standard error is computed.
Frequentist penalization approaches often rely on cross-validation. In the
Bayesian literature, full and empirical Bayes are often employed, although
cross-validation is also possible in a Bayesian approach (see for example the
loo package in R; Vehtari et al., 2018). The intuition behind empirical Bayes
and cross-validation is similar: empirical Bayes aims to choose the value for
λ that is best in predicting the full data set, while cross-validation aims to
choose the value for λ that is best in predicting the validation set given a
Bayesian Penalization
2.0
1.5
1.0
0.5
0.0
2.0
1.5
1.0
0.5
0.0
−5.0 −2.5 0.0 2.5 5.0
Horseshoe Normal mixture
2.0
1.5
1.0
0.5
0.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
3.1 Ridge
The ridge prior corresponds to normal priors centered around 0 on the re-
gression coefficients, i.e., (see e.g., Hsiang, 1975)
σ2
βj |λ, σ 2 ∼ Normal(0, ), for j = 1, . . . , p. (8)
λ
λ ∼ half-Cauchy(0, 1)
The posterior mean estimates under this prior will correspond to estimates
obtained using the ridge penalty or l2 norm, i.e., q = 2 in Equation (1) (Hoerl
and Kennard, 1970). The penalty parameter λ determines the amount of
shrinkage, with larger values resulting in smaller prior variation and thus
more shrinkage of the coefficients towards zero.
When integrating τj2 out, the following conditional prior distribution for the
regression coefficients is obtained:
σ2
βj |ν, λ, σ 2 ∼ Student(ν, 0, ), (10)
λ
2
where Student(ν, 0, σλ ) denotes a non-standardized Student’s t distribution
2
centered around 0 with ν degrees of freedom and scale parameter σλ . A
smaller value for ν results in a distribution with heavier tails, with ν = 1
implying a Cauchy prior for βj . Larger (smaller) values for λ result in more
(less) shrinkage towards m. This prior has been considered, among others,
by Griffin and Brown (2005) and Meuwissen et al. (2001). Compared to the
ridge prior in (8), the local Student’s t prior has heavier tails. Throughout
this paper, we will consider ν = 1, such that the prior has Cauchy-like tails.
Bayesian Penalization
3.3 Lasso
The Bayesian counterpart of the lasso penalty was first proposed by Park
and Casella (2008). The Bayesian lasso can be obtained as a scale mixture
of normals with an exponential mixing density, i.e.,
σ
βj |λ, σ ∼ Double-exponential(0, ), for j = 1, . . . , p. (12)
λ
With this prior, the posterior mode estimates are similar to estimates
obtained under the lasso penalty or l1 norm, i.e., q = 1 in Equation (1)
(Tibshirani, 1996). In addition to the overall shrinkage parameter λ, the lasso
prior has an additional predictor-specific shrinkage parameter τj . Therefore,
the lasso prior is more flexible than the ridge prior which only relies on the
overall shrinkage parameter in (8). Figure 1 clearly shows that the lasso prior
has a sharper peak around zero compared to the ridge prior.
of several of these generalizations, including the elastic net, group lasso, and
hyperlasso. Note that for the Bayesian lasso, coefficients cannot become
exactly zero and thus a criterion is needed to select the relevant variables.
Depending on the criterion used, more predictors than observations could be
selected. However, the Bayesian lasso does not allow a grouping structure to
be included, it overshrinks large coefficients, and it does not have the oracle
property since the tails for the prior on βj are not heavier than exponential
tails (Polson et al., 2014).
( )
λ2 τj −1
βj |λ2 , τj , σ 2
∼ Normal 0, ( 2 ) (13)
σ τj − 1
( )
1 8λ2 σ 2
τj |λ2 , λ1 , σ 2 ∼ Truncated-Gamma , , for j = 1, . . . , p,
2 λ21
λ1 ∼ half-Cauchy(0, 1)
λ2 ∼ half-Cauchy(0, 1)
where the truncated Gamma density has support (1, ∞). This implies the
following conditional prior distributions for the regression coefficients:
{ }
1
p(βj |σ , λ1 , λ2 ) = C(λ1 , λ2 , σ ) exp − 2 (λ1 |βj | + λ2 βj ) ,
2 2 2
(14)
2σ
for j = 1, . . . , p,
{ }
λ ∑G
p(βj |σ 2 , λ) = C exp − √ ||β
β g || , for g = 1, . . . , G, and j = 1, . . . , p,
σ 2 g=1
(16)
1
where ||β β ′g β g ) 2 and C denotes the normalizing constant. Due to
β g || = (β
the simultaneous penalization of all coefficients in one group, all estimated
Bayesian Penalization
3.6 Hyperlasso
Zou (2006) proposes the adaptive lasso as a generalization of the lasso that
enjoys the oracle property (limitation (v) of the lasso), i.e., it performs as
well as if the true underlying model has been given. The central idea of the
adaptive lasso is to separately weigh the penalty for each coefficient based
on the observed data. A Bayesian adaptive lasso has been proposed, among
others, by Alhamzawi et al. (2012) and Feng et al. (2015). However, as noted
by Griffin and Brown (2011), the weights included in the adaptive lasso place
great demands on the data, which can lead to poor performance in terms of
prediction and variable selection when the sample size is small. Therefore,
Griffin and Brown (2011) propose the hyperlasso as a Bayesian alternative
to the adaptive lasso, which is obtained through the following mixture of
normals:
3.7 Horseshoe
A popular shrinkage prior in the Bayesian literature is the horseshoe prior
(Carvalho et al., 2010):
Note that Carvalho et al. (2010) explicitly include the half-Cauchy prior for
λ in their specification, thereby implying a full Bayes approach. This formu-
lation results in a horseshoe prior that is automatically scaled by the error
standard deviation σ. The half-Cauchy prior can be written as a mixture
of inverse Gamma and Gamma densities, so that the horseshoe prior in (19)
can be equivalently specified as:
loosen the amount of shrinkage for truly large coefficients. Many global-local
shrinkage priors (including the horseshoe and hyperlasso) are special cases
of the general class of hypergeometric inverted-beta distributions (Polson
and Scott, 2012). In addition to the full Bayes approach implied by the
specification in (19), we will also consider an empirical Bayes approach to
determine λ.
c2 τj2
βj |τ̃j2 , λ ∼ Normal(0, τ̃j2 λ), with τ̃j2 = (21)
c2 + λ2 τj2
p0 σ
λ|λ20 ∼ half-Cauchy(0, λ20 ), with λ0 = √
p − p0 n
τj ∼ half-Cauchy(0, 1)
c |ν, s2 ∼ inverse Gamma(ν/2, νs2 /2),
2
where τj is given a vague prior so that the variance of the slab is estimated
based on the data and ϕ2j is fixed to a small number, say ϕ2j = 0.001, to
create the spike. By assigning an inverse Gamma(0.5, 0.5) prior on τj2 , the
resulting marginal distribution of the slab component of the mixture is a
Cauchy distribution.
There are several options for the prior on the mixing parameter γj . In
this paper, we will consider the following two options: 1) γj as a Bernoulli
distributed variable taking on the value 0 or 1 with probability 0.5, i.e.,
γj ∼ Bernoulli(0.5); and 2) γj uniformly distributed between 0 and 1, i.e.,
γj ∼ Uniform(0, 1). In the first option, which we label the Bernoulli mixture,
each coefficient βj is given either the slab or the spike as prior. The second
option, labelled the uniform mixture, is more flexible in that each coefficient
is given a prior consisting of a mixture of the spike and slab, with each
component weighted by the uniform probabilities γj
Bayesian Penalization
4 ^ 4 ^
βOLS βOLS
● ●
●
^ ^
2 βLASSO ● 2 βBAYES
β2
β2
0 0
−2 −2
−2 0 2 −2 0 2
β1 β1
Figure 3 shows the contour plots of the different shrinkage priors for two
predictors β1 and β2 , while Figure 4 shows the contour plots for the lasso
and group lasso for three predictors. From a classical penalization perspec-
tive, the lasso and elastic net penalties have sharp corners at β1 = β2 = 0.
As a result, the contour of the sum of squared residuals will meet the con-
tours of these penalties more easily at a point where one of the coefficients
equals zero, which explains why these penalties can shrink coefficients to
exactly zero. The ridge penalty, on the other hand, does not show these
sharp corners and can therefore not shrink coefficients to exactly zero. From
a Bayesian penalization perspective, the bivariate prior contour plots illus-
trate the shrinkage behavior of the priors. For example, the hyperlasso and
horseshoe have a lot of prior mass where at least one element is close to
zero, while the ridge has most prior mass where both elements are close to
zero. Figure 3 also shows that the ridge, local Student’s t, lasso, and elastic
net are convex. This can be seen when drawing a straight line from one
point to another point on a contour. For a convex distribution, the line lies
completely within the contour. The hyperlasso and horseshoe prior are non-
convex, which can be seen from the starlike shape of the contour. Frequentist
penalization has generally focused on convex penalties, due to their computa-
Bayesian Penalization
b3 b3
b1 b1
b2 b2
Figure 4: Contour plots of the lasso (left) and group lasso (right) in R3 , with
β1 and β2 belonging to group 1 and β3 belonging to group 2. For the group
lasso, if we consider only β1 and β2 , which belong to the same group, the
contour resembles that of the ridge with most prior mass if both β1 and β2
are close to zero. On the other hand, if we consider β1 and β3 , which belong
to different groups, the contour is similar to that of the lasso, which has more
prior mass where only one element is close to zero. This illustrates how the
group lasso simultaneously shrinks elements belonging to the same group.
Bayesian Penalization
0 ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ●
● ● ●
● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
−5 ●
●
Posterior mean−observation
●
−1 ● ●
−10
−15
−2
−20
−25 −3
0 10 20 30 40 50 0 1 2 3 4 5
True value True value
Ridge ●
Lasso Hyperlasso Regularized horseshoe
Figure 5: Difference between the estimated and true effect for the shrinkage
priors in a simple normal model with the penalty parameter λ fixed to 1.
shrinkage priors all show some differences between estimated and true means
for small effects, indicating shrinkage of these effects towards zero, but the
difference is practically zero for large effects. The right column of Figure 5
provides the same figure, but zoomed in on the small effects. Note how the
regularized horseshoe shrinks large effects more than the horseshoe prior, but
goes to zero eventually.
0.00 ●
●
● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
●
● ● ● ● ●
●
● ● ● ●
●
● ● ● ●
● ●
●
● ●
● ● ●
●
●
●
●
●
●
●
Ridge
−0.25 ●
●
●
●
●
Local Student's t
Posterior mean−observation
● ●
●
●
Lasso
● ●
−0.50 ●
Elastic net
●
Hyperlasso
●
●
●
●
Horseshoe
−0.75 ●
Regularized horseshoe
●
●
Bernoulli mix.
●
●
−1.25
0 10 20 30 40 50
True value
Figure 6: Difference between the estimated and true effect for the shrinkage
priors in a simple normal model with a half-Cauchy hyperprior specified for
the penalty parameter λ.
occurs even when the true mean equals 50. The mixture priors result in the
largest differences between true and estimated small effects, indicating the
most shrinkage, and the local Student’s t prior shows the smallest difference
for small effects. As the effect grows, the regularized horseshoe prior results
in estimates farthest from the true effects, indicating the most shrinkage for
large effects.
These illustrations indicate that when the penalty parameter is fixed, only
the local Student’s t, hyperlasso, and (regularized) horseshoe priors allow for
shrinkage of small effects while estimating large effects correctly. However,
if a prior is specified for the penalty parameter, so that the uncertainty in
this parameter is taken into account, all shrinkage priors show this desirable
behavior.
5 Simulation study
5.1 Conditions
We conduct a Monte Carlo simulation study to compare the performance of
the shrinkage priors and several frequentist penalization methods. We simu-
late data from the linear regression model, given by: y = β01 + Xβ + ϵ , with
ϵi ∼ Normal(0, σ 2 ). We consider six simulation conditions. Conditions (1)-
(5) are equal to the conditions considered in Li and Lin (2010). In addition,
condition (1) and (2) have also been considered in Kyung et al. (2010); Roy
and Chakraborty (2016); Tibshirani (1996); Zou and Hastie (2005). Con-
dition (6) has been included to investigate a setting in which p > N . The
conditions are as follows6 :
2. β = (0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85)′ ; the other settings are
equal to those in condition (1).
6
We have also considered two additional conditions in which p > n and the predictors
are not highly correlated. Unfortunately, most shrinkage priors resulted in too much
non-convergence to trust the results. A description of these additional conditions and
the available results for the priors that did obtain enough convergence is available at
https://osf.io/nveh3/. Additionally, we would like to refer to Kaseva (2018) where a more
sparse, modified version of condition 1 is considered.
Bayesian Penalization
We simulate 500 data sets per condition. All Bayesian methods have
been implemented in the software package Stan (Stan development team,
2017c), which we call from R using Rstan (Stan development team, 2017a).
We include the classical penalization methods available in the R-packages
glmnet (Friedman et al., 2010) and grpreg (Breheny and Huang, 2015),
i.e., the ridge, lasso, elastic net, and group lasso, for comparison. For the
classical penalization methods, the penalty parameter λ is selected based on
cross-validation using 10 folds. We also include classical forward selection
from the leaps (Lumley, 2017) package and we select the model based on
three different criteria: the adjusted R2 , Mallows’ Cp , and the BIC. For
both the Bayesian and the classical group lasso, a grouping structure should
be supplied for the analysis. We have used the grouping structure under
which the data was simulated. Thus, for conditions 3 until 6, we have four
groups with the following regression coefficient belonging to each group: G1 =
β1 , . . . , β5 , G2 = β6 , . . . , β10 , G3 = β11 , . . . , β15 , and G4 = β16 , . . . , β30 . All
code for the simulation study is available at https://osf.io/bf5up/.
5.2 Outcomes
The two main goals of regression analysis are: (1) to select variables that
are relevant for predicting the outcome, and (2) to accurately predict the
Bayesian Penalization
The credibility interval with the lowest distance is optimal in terms of the
highest correct inclusion rate and lowest false inclusion rate. For the selected
credibility interval, we will report Matthews’ correlation coefficient (MCC;
Matthews, 1975), which is a measure indicating the quality of the classifica-
tion. MCC ranges between -1 and +1 with MCC = -1 indicating complete
disagreement between the observed and predicted classifications and MCC
= +1 indicating complete agreement.
To assess the prediction accuracy of the shrinkage priors, we will consider
the prediction mean squared error (PMSE) for each replication. To compute
the PMSE, we first estimate the regression coefficients β̂ on the training
data only. These estimates are then used to predict the responses on the
outcome variable of the test set, y gen , for which the actual responses, y , are
available. Prediction of y gen occurs within the “generated quantities” block
in Stan, meaning that for each MCMC draw, yigen is generated such that we
obtain the full posterior distribution for each yigen . The mean of this posterior
distribution is used as estimate for yigen . The PMSE for each replication can
i=1 (yi −yi ) . For each condition, this will result in
gen
then be computed as: N1 ΣN 2
500 PMSEs, one for each replication, of which we will compute the median.
Furthermore, to assess the uncertainty in the median PMSE estimate, we
will bootstrap the standard error (SE) by resampling 500 PMSEs from the
7
We have also considered the scaled neighborhood criterion (Li and Lin, 2010) and a
fixed cut-off value to select the predictors. The scaled√neighborhood
√ criterion excludes a
predictor if the posterior probability contained in [− var(βp |y), var(βp |y)] exceeds a
certain threshold. However, this criterion generally performed worse than the credibility
interval criterion. For the fixed cut-off value we excluded predictors when the posterior
estimate |β̂| ≤ 0.1 based on Feng et al. (2015). However, the choice of this threshold is
rather arbitrary and resulted in very high false inclusion rates.
Bayesian Penalization
obtained PMSE values and computing the median. This process is repeated
500 times and the standard deviation of the 500 bootstrapped median PMSEs
is used as SE of the median PMSE.
5.3 Convergence
Convergence will be assessed using split R̂, which is a version of the often
used potential scale reduction factor (PSRF; Gelman and Rubin, 1992) that
is implemented in Stan (Stan development team, 2017b, p. 370-373). Ad-
ditionally, Stan reports the number of divergent transitions. A divergent
transition indicates that the approximation error in the algorithm accumu-
lates (Betancourt, 2017; Monnahan et al., 2016), which can be caused by a
too large step size, or because of strong curvature in the posterior distribu-
tion. As a result, it can be necessary to adapt the settings of the algorithm
or to reparametrize the model. For the simulation, we initially employed a
very small step size (0.001) and high target acceptance rate (0.999), however,
these settings result in much slower sampling. Therefore, in the later con-
ditions we used the default step size (1) and a lower target acceptance rate
(0.85) and only reran the replications that did not converge with the stricter
settings (i.e., smaller step size and higher target acceptance rate). Only if all
parameters had a PSRF < 1.1 and there were no divergent transitions, did
we consider a replication as converged.8 We have only included those condi-
tions in the results with at least 50% convergence (i.e., at least 250 converged
replications). The convergence rates are available at https://osf.io/nveh3/.
condition across methods is shown in bold and the smallest median PMSE
per condition for the Bayesian methods is shown in italics. In condition 1, 3,
and 4 the full Bayesian Bernoulli mixture prior performs best; in condition 2,
the classical ridge performs best; in condition 5, the empirical Bayesian hy-
perlasso performs best; and in condition 6, the empirical Bayesian horseshoe
performs best. However, the differences between the methods are relatively
small. Only in condition 6, where the number of predictors is larger than
the number of observations, the differences between the methods in terms of
PMSE become more pronounced. As expected, forward selection performs
the worst, especially when Mallows’ Cp or the BIC is used to select the
best model. This illustrates the advantage of using penalization, even when
p < n. Overall, we can conclude that in terms of prediction accuracy the
penalization methods perform quite similarly, except when p > n.9
9
We have also computed the PMSE for a large test set with 1,000,000 observations as
an approximation to the theoretical prediction error. In general, the theoretical PMSEs
did not differ substantially from the PMSE in Table 2, except in condition 6 where the
Bayesian Penalization
theoretical PMSE was generally larger. The theoretical PMSEs are available online at
https://osf.io/nveh3/
Condition 1 Condition 3 Condition 4 Condition 5 Condition 6
Selected Correct False Selected Correct False Selected Correct False Selected Correct False Selected Correct False
Prior MCC MCC MCC MCC MCC
CI (%) inclusion inclusion CI (%) inclusion inclusion CI (%) inclusion inclusion CI (%) inclusion inclusion CI (%) inclusion inclusion
Full Bayes
Ridge 90 0.78 0.852 0.087 60 0.66 0.993 0.385 60 0.67 1.000 0.387 40 0.57 0.826 0.259 30 0.50 0.829 0.341
Local Student’s t 80 0.76 0.884 0.132 60 0.67 0.998 0.381 70 0.60 0.875 0.291 40 0.58 0.827 0.246 30 0.51 0.820 0.318
Lasso 80 0.77 0.887 0.126 50 0.64 0.997 0.421 50 0.63 1.000 0.438 30 0.59 0.863 0.283 30 0.50 0.788 0.284
Elastic net 90 0.77 0.851 0.094 60 0.63 0.973 0.386 60 0.67 0.999 0.387 40 0.55 0.824 0.272 30 0.49 0.836 0.361
Group lasso NA1 NA1 NA1 NA1 60 0.67 0.987 0.364 60 0.68 1.000 0.375 40 0.57 0.822 0.246 30 0.51 0.824 0.320
Hyperlasso 80 0.77 0.880 0.117 50 0.64 0.999 0.419 50 0.63 1.000 0.434 40 0.56 0.773 0.202 30 0.50 0.764 0.253
Horseshoe 70 0.78 0.886 0.116 20 0.49 0.999 0.609 20 0.49 1.000 0.603 20 0.56 0.858 0.311 20 0.50 0.795 0.294
Regularized horseshoe 70 0.78 0.889 0.118 40 0.64 0.949 0.351 30 0.61 1.000 0.453 40 0.59 0.794 0.193 30 0.51 0.771 0.247
true p0 2
Bernoulli mixture 50 0.80 0.893 0.099 20 0.66 0.992 0.381 20 0.66 0.993 0.388 20 0.55 0.735 0.159 20 0.48 0.627 0.127
Uniform mixture 50 0.80 0.889 0.100 20 0.66 0.989 0.381 20 0.65 0.990 0.390 20 0.55 0.733 0.160 20 0.48 0.628 0.123
Empirical Bayes
Ridge 90 0.78 0.847 0.081 70 0.52 0.781 0.276 70 0.72 0.982 0.289 40 0.57 0.819 0.240 30 0.28 0.690 0.222
Local Student’s t 90 0.79 0.845 0.080 60 0.67 0.999 0.377 70 0.70 0.949 0.291 40 0.58 0.821 0.241 30 0.49 0.764 0.279
Lasso 80 0.77 0.885 0.118 50 0.64 0.999 0.417 50 0.63 1.000 0.433 30 0.57 0.831 0.270 20 0.28 0.591 0.273
Elastic net 90 0.78 0.848 0.085 60 0.67 0.999 0.377 70 0.71 0.968 0.290 40 0.58 0.834 0.251 30 0.50 0.810 0.314
Group lasso NA1 NA1 NA1 NA1 60 0.68 0.998 0.361 70 0.60 0.880 0.278 40 0.58 0.820 0.240 30 0.46 0.728 0.277
Hyperlasso 80 0.78 0.883 0.109 50 0.65 0.999 0.414 50 0.63 1.000 0.431 40 0.53 0.767 0.199 20 0.32 0.575 0.268
Horseshoe 70 0.78 0.876 0.108 20 0.51 0.997 0.578 20 0.51 0.997 0.576 20 0.53 0.840 0.308 20 0.48 0.756 0.266
Classical penalization
3
Lasso NA3 0.72 0.923 0.210 NA3 0.66 0.642 0.023 NA 0.67 0.632 0.008 NA3 0.33 0.418 0.108 NA3 0.27 0.368 0.116
Elastic net NA3 0.62 0.971 0.361 NA3 0.97 1.000 0.031 NA3 0.99 1.000 0.013 NA3 0.47 0.645 0.161 NA3 0.39 0.548 0.153
Group lasso NA1 NA1 NA1 NA1 NA3 0.52 1.000 0.482 NA3 0.50 1.000 0.500 NA3 0.34 0.893 0.603 NA3 0.37 0.787 0.462
Bayesian Penalization
Forward selection
BIC NA3 0.77 0.843 0.093 NA3 -0.087 0.086 0.139 NA3 -0.092 0.081 0.137 NA3 -0.056 0.171 0.213 NA3 -0.0056 0.252 0.249
Mallows’ Cp NA3 0.72 0.895 0.176 NA3 0.023 0.188 0.164 NA3 0.023 0.186 0.164 NA3 -0.071 0.122 0.172 NA3 -0.074 0.024 0.052
3
Adjusted R2 NA3 0.60 0.938 0.343 NA3 0.14 0.332 0.201 NA 0.15 0.329 0.197 NA3 0.0085 0.338 0.328 NA3 0.053 0.378 0.323
Note.
1
No results are available for the group lasso in condition 1, since no grouping structure is present in these conditions.
2
p0 denotes the prior guess for the number of relevant variables, which was set to the true number of relevant variables, except in condition 2 where all eight variables are relevant, so we set p0 = 7.
3
For the classical penalization methods, the lasso, elastic net, and forward selection automatically shrink some coefficients to exactly zero so that no criterion such as a confidence interval for variable selection is needed. The
ridge is not included, since it does not automatically shrink coefficients to zero and therefore always has a correct and false inclusion rate of 1.
4
The highest MCC and correct inclusion rate and the lowest false inclusion rate per condition across methods are shown in bold and the highest MCC and best rates per condition for the Bayesian methods are shown in italics.
Table 3: Matthews’ correlation coefficient (MCC) and correct and false inclusion rates based on the optimal credibility
intervals (CIs) selected using the distance criterion.
Bayesian Penalization
Table 3 shows MCC and the correct and false inclusion rates for the
optimal CIs for the shrinkage priors and MCC and the inclusion rates for the
classical penalization methods, which automatically select predictors. The
bold values indicate the best inclusion rates across all methods, whereas the
italic values indicate the best inclusion rates across the Bayesian methods.
Again, for the regularized horseshoe the results were comparable regardless of
whether a correct, incorrect, or no prior guess was used. In the first condition,
the classical penalization methods outperform the Bayesian methods in terms
of correct inclusion rates, but at the cost of higher false inclusion rates. This
is a well known problem of the lasso and elastic net when cross-validation
is used to select the penalty parameter λ. A solution to this problem is to
use stability selection to determine λ (Meinshausen and Bühlmann, 2010).
The optimal Bayesian methods in the first condition based on the highest
value for MCC are the mixture priors, both of which have reasonable correct
and false inclusion rates. Note that, generally, the differences with the other
Bayesian methods are relatively small in condition 1. In condition 3 and 4,
the correct inclusion rates are generally high and the false inclusion rates are
increased as well. As a result, the optimal Bayesian methods in condition 3
show a trade-off between correct and false inclusion rates, with the empirical
Bayes group lasso having the highest value for MCC. However, the differences
in MCC between most Bayesian methods are small and MCC is generally
lower compared to condition 1 due to the increased false inclusion rates. In
condition 4, multiple methods show a correct inclusion rate of 1, combined
with a high false inclusion rate. In terms of MCC, the empirical Bayes ridge
prior performs best. In condition 5, both rates and thus the MCC values are
slightly lower across all methods, which is a result of the optimal CI being
smaller. The full Bayes lasso and regularized horseshoe perform best in terms
of MCC, although the other shrinkage priors show comparable MCC values.
Condition 6 shows the most pronounced differences between the methods and
the greatest trade-off between correct and false inclusion rates. None of the
Bayesian methods attain a value for the MCC greater than 0.51, and some
shrinkage priors (i.e., the empirical Bayes ridge and lasso) result in a MCC
value of only 0.28. In conclusion, although there exist differences between
the methods in terms of variable selection accuracy, there is not one method
that performs substantially better than the other methods in terms of both
correct and false inclusion rates.
Bayesian Penalization
6 Empirical applications
We will now illustrate the shrinkage priors on two empirical data sets. An R
package bayesreg is available online (https://github.com/sara-vanerp/bayesreg)
that can be used to apply the shrinkage priors. The first illustration (math
performance) shows the benefits of using shrinkage priors in a situation where
the number of predictors is smaller than the number of observations. In the
second illustration (communities and crime), the number of predictors is
larger than the number of observations, and it is necessary to use some form
of regularization in order to fit the model.
Table 4: Computation time in seconds (with a 2.8 GHz Intel Core i7 pro-
cessor), prediction mean squared error (PMSE), and number of included
predictors for the different methods for the math performance application
though the number of predictors is not greater than the sample size. Com-
pared to regression using OLS, all penalization methods show lower PMSEs.
Moreover, all penalization methods outperform forward selection in terms of
PMSE. Between the different penalization methods, differences in PMSE are
small.
Bayesian Penalization
−2 −1 0 1
β1
14
5 ●
38 ●
36 ●
31 ●
13 ●
25 ● ●
23 ● ●
30 ● ● ●
28 ● ● ●
18 ● ● ●
16 ● ● ●
24 ● ● ● ● ● ●
27 ● ● ● ● ● ● ●
26 ● ● ● ● ● ● ●
22 ● ● ● ● ● ● ●
20 ● ● ● ● ● ● ● ●
39 ● ● ● ● ● ● ● ● ● ● ● ●
Predictor
34 ● ● ● ● ● ● ● ● ● ● ● ●
11 ● ● ● ● ● ● ● ● ● ● ● ● ●
37 ● ● ● ● ● ● ● ● ● ● ● ● ● ●
33 ● ● ● ● ● ● ● ● ● ● ● ● ● ●
19 ● ● ● ● ● ● ● ● ● ● ● ● ● ●
35 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
32 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
29 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
17 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
12 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
21 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
9 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
IC
re
re
oe
B)
ic)
ic)
)
B)
B)
R2
ic)
p
B)
B)
EB
FB
sic
(FB
FB
FB
(FB
'C
u
xtu
nB
(E
(E
(E
(E
t (E
t (F
ss
ss
ss
sh
ixt
as
dj.
o(
o(
e(
o(
s
cla
cla
rse
mi
so
et
ow
so
et
so
ge
im
tio
(cl
(cl
na
ss
ss
t's
ss
t's
cn
cn
as
as
o(
Rid
e(
as
Rid
all
lec
rm
ho
ull
rla
La
en
La
en
et
so
tio
pl
erl
pl
sti
sti
ss
g
nM
ifo
r no
cn
Se
tud
pe
ed
tud
as
Rid
lec
ou
Ela
Ela
ou
La
p
Un
Hy
riz
pl
Hy
sti
tio
Be
lS
lS
Se
Gr
Gr
Ela
ou
ula
lec
ca
ca
Gr
g
Lo
Se
Lo
Re
Method
Figure 8: Overview of the included predictors for each method in the math
performance application. Points indicate that a predictor is included based
on the optimal credibility interval (CI) from condition 5 in the simulation
study. The methods on the x-axis are ordered such that the method that
includes the least predictors is on the left and the method that includes the
most predictors is on the right. The predictors on the y-axis are ordered
with the predictor being included the least on top and the predictor being
included the most at the bottom.
Bayesian Penalization
well as the PMSE and the number of selected variables. Again, the horse-
shoe prior resulted in divergent transitions and is therefore excluded from
the results. The posterior density using the lasso prior for β15 is shown in
Figure 9, with the dark blue shaded area depicting the 95% credibility inter-
vals and the dashed black lines depicting the bootstrapped 95% confidence
interval of the classical lasso. Again, the bootstrapped confidence interval is
much smaller than the Bayesian credibility interval and located far from the
posterior median estimate (i.e., the dark blue line).
Table 5: Computation time in seconds (with a 2.8 GHz Intel Core i7 pro-
cessor), prediction mean squared error (PMSE), and number of included
predictors for the different methods for the crime application
Figure 9: Posterior density for β15 in the crime application using the Bayesian
lasso. The dark blue line depicts the posterior median, the shaded dark blue
area depicts the 95% credibility interval. The black dashed lines depict the
bootstrapped 95% confidence interval of the classical lasso.
Bayesian Penalization
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
170
● ● ●
● ● ●
165 ● ● ● ● ●
●
●
●
●
●
● ●
● ● ●
160 ● ● ●
●
● ● ● ● ●
●
● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
155 ●
●
●
●
● ● ●
● ● ●
150 ● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ●
145
● ● ●
● ●
● ● ● ● ●
140 ● ● ●
● ●
●
●
●
●
●
●
● ● ●
● ● ● ● ●
135 ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
130 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
125 ● ● ● ● ●
●
●
●
●
●
● ●
● ● ● ● ● ●
● ●
120 ● ● ● ● ● ● ●
●
●
●
● ● ● ● ● ● ● ● ● ●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
115 ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
110 ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
105 ●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
● ●
100 ● ● ●
●
●
●
● ● ● ●
● ● ● ●
● ●
95 ● ● ● ● ● ●
●
●
●
● ● ● ● ● ● ● ● ●
●
●
●
● ●
Predictor
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
90 ● ●
● ●
●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
85 ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ●
● ●
● ●
80 ●
●
●
● ●
● ●
● ● ● ●
75 ●
●
● ●
●
●
●
● ●
●
● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ● ●
70 ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
65 ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ●
60 ●
●
●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
55 ● ● ● ● ● ● ● ● ● ● ●
●
●
●
● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
50 ● ●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
45 ● ●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
40 ● ● ● ●
● ●
● ● ● ● ● ● ● ●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
35 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
● ●
● ●
● ●
30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
25 ●
●
●
●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
20 ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
●
●
●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
15 ●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
10 ●
●
● ● ● ● ● ● ● ● ●
●
●
●
● ● ●
● ● ●
● ● ●
5 ●
● ●
●
● ●
●
●
●
●
●
●
●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ●
C
B)
B)
e
ic)
B)
re
re
B)
R2
)
p
B)
B)
sic
EB
FB
sic
(FB
(FB
(FB
sic
ho
s' C
xtu
B
(E
(E
t (E
t (F
ss
ixt
as
as
dj.
as
o(
o(
o(
o(
(
es
ion
cla
mi
so
ge
et
ow
et
ge
so
im
(cl
(cl
na
(cl
ors
ss
ss
t's
ss
ss
t's
cn
cn
t
as
o(
Rid
as
Rid
all
lec
rm
rla
La
ull
en
rla
La
en
et
so
ge
tio
dh
pl
pl
sti
sti
ss
nM
ifo
cn
rno
Se
tud
pe
tud
pe
as
Rid
lec
ou
Ela
Ela
ou
La
ize
Un
Hy
pl
Hy
sti
tio
Be
lS
lS
Se
Gr
Gr
lar
Ela
ou
lec
ca
ca
gu
Gr
Lo
Se
Lo
Re
Method
Figure 10: Overview of the included predictors for each method in the crime
application. Points indicate that a predictor is included based on the optimal
credibility interval (CI) from condition 6 in the simulation study. The meth-
ods on the x-axis are ordered such that the method that includes the least
predictors is on the left and the method that includes the most predictors is
on the right.
Bayesian Penalization
7 Discussion
The aim of this paper was to provide insights about the different shrinkage
priors that have been proposed for Bayesian penalization to avoid overfitting
of regression models in the case of many predictors. We have reviewed the
literature on shrinkage priors and presented them in a general framework
of scale mixtures of normal distributions to enable theoretical comparisons
between the priors. To model the penalty parameter λ, which is a central
part of the penalized regression model, a full Bayes and an empirical Bayes
approach were employed.
Although the various prior distributions differ substantially from each
other, e.g., regarding their tails or convexity, the priors performed very sim-
ilarly in the simulation study in those conditions where p < n. Overall,
the performance was comparable to the classical penalization approaches.
The math performance example clearly showed the advantage of using pe-
nalization to avoid overfitting when p < n. As in the simulation study, the
prediction errors in the math example were comparable across penalization
methods, although the number of included predictors varied across methods.
Finally, although classical penalization is much faster than Bayesian penal-
ization, it does not automatically provide accurate uncertainty estimates and
the bootstrapped confidence intervals obtained for the classical methods were
generally much smaller compared to the Bayesian credibility intervals.
The differences between the methods became more pronounced when
p > n. In condition 6 of the simulation study, the (regularized) horseshoe
and hyperlasso priors performed substantially better then most of the other
shrinkage priors in terms of PMSE. This is most likely due to the fact that the
hyperlasso and (regularized) horseshoe are non-convex global-local shrinkage
priors and are therefore particularly adept at keeping large coefficients large,
while shrinking the small coefficients enough towards zero. Future research
should consider various high-dimensional simulation conditions to further ex-
plore the performance of the shrinkage priors in such settings, for example
by varying the correlations between the predictors. The crime example il-
lustrated the use of the penalization methods further in a p > n situation.
In this example, most Bayesian approaches resulted in smaller prediction er-
rors than the classical approaches (except for the mixture priors). Also in
terms of the predictors that were included there were considerable differences
between the various approaches.
An important goal of the shrinkage methods discussed in this paper is
the ultimate selection of relevant variables. Throughout this paper, we have
focused on the use of marginal credibility intervals to do so. However, the use
of marginal credibility intervals to perform variable selection can be prob-
Bayesian Penalization
lematic, since the marginal intervals can behave differently compared to joint
credibility intervals. This is especially the case for global shrinkage priors,
such as the (regularized) horseshoe prior since these priors induce shrink-
age on all variables jointly (Piironen et al., 2017). Future research should
investigate whether the variable selection accuracy can be further improved
by using methods that jointly select relevant variables (for example, projec-
tion predictive variable selection; Piironen and Vehtari, 2016, or decoupled
shrinkage and selection; Hahn and Carvalho, 2015).
Throughout this paper, we focused on the linear regression model. Hope-
fully, the results presented in this paper and the corresponding R package
bayesreg available at https://github.com/sara-vanerp/bayesreg will lead to
an increased use of penalization methods in psychology, because of the im-
proved performance in terms of prediction error and variable selection accu-
racy compared to forward subset selection. The shrinkage priors investigated
here can be applied in more complex models in a straightforward manner.
For example, in generalized linear regression models such as logistic and
Poisson regression models, the only necessary adaptation is to incorporate
a link function in the model. Although not currently available in the R-
package, the available Stan modelfiles can be easily adapted to generalized
linear models (GLMs). Additionally, packages such as brms (Bürkner, 2017)
and rstanarm (Stan Development Team, 2016) include several of the shrink-
age priors described here, or allow the user to specify them manually. Both
packages support (multilevel) GLMs, although rstanarm relies on precom-
piled models and is therefore less flexible than brms. Currently, an active
area of research employs Bayesian penalization in latent variable models,
such as factor models (see e.g., Lu et al., 2016; Jacobucci and Grimm, 2018)
and quantile structural equation models (see e.g., Feng et al., 2017). The
characteristics and behaviors of the shrinkage priors presented in this paper
can be a useful first step in solving these more challenging problems.
Bayesian Penalization
Acknowledgements
This research was supported by a Research Talent Grant from the Nether-
lands Organisation for Scientific Research. We would like to thank Aki Ve-
htari, Carlos Carvalho, Tuomas Kaseva, and Charlie Strauss for providing
helpful comments on an earlier version of this manuscript and pointing out
relevant references.
References
Alhamzawi, R., Yu, K., and Benoit, D. F. (2012). Bayesian adaptive lasso
quantile regression. Statistical Modelling, 12(3):279–297.
Armagan, A., Dunson, D. B., and Lee, J. (2013). Generalized double pareto
shrinkage. Statistica Sinica.
Azmak, O., Bayer, H., Caplin, A., Chun, M., Glimcher, P., Koonin, S., and
Patrinos, A. (2015). Using big data to understand the human condition:
The kavli HUMAN project. Big Data, 3(3):173–188.
Bhadra, A., Datta, J., Polson, N. G., and Willard, B. (2016). The horseshoe+
estimator of ultra-sparse signals. Bayesian Analysis.
Bhadra, A., Datta, J., Polson, N. G., and Willard, B. T. (2017). Lasso meets
horseshoe: A survey. arXiv preprint arXiv:1706.10179.
Bhattacharya, A., Pati, D., Pillai, N. S., and Dunson, D. B. (2012). Bayesian
shrinkage. arXiv preprint arXiv:1212.6088.
Bornn, L., Gottardo, R., and Doucet, A. (2010). Grouping priors and the
Bayesian elastic net. arXiv preprint arXiv:1001.4083.
Bayesian Penalization
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likeli-
hood and its oracle properties. Journal of the American Statistical Asso-
ciation, 96(456):1348–1360.
Feng, X.-N., Wang, Y., Lu, B., and Song, X.-Y. (2017). Bayesian regularized
quantile structural equation models. Journal of Multivariate Analysis,
154:234–248.
Feng, X.-N., Wu, H.-T., and Song, X.-Y. (2015). Bayesian adaptive lasso for
ordinal regression with latent variables. Sociological Methods & Research.
Ghosh, J., Li, Y., and Mitra, R. (2017). On the use of cauchy prior distribu-
tions for bayesian logistic regression. Bayesian Analysis.
Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Fre-
quentist and bayesian strategies. The Annals of Statistics, 33(2):730–773.
Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010). Penalized regression,
standard errors, and bayesian lassos. Bayesian Analysis, 5(2):369–411.
Li, Q. and Lin, N. (2010). The bayesian elastic net. Bayesian Analysis,
5(1):151–170.
Liu, H., Xu, X., and Li, J. J. (2017). HDCI: High Dimensional Confidence
Interval Based on Lasso and Bootstrap. R package version 1.0-2.
Lu, Z.-H., Chow, S.-M., and Loken, E. (2016). Bayesian factor analysis as
a variable-selection problem: Alternative priors and consequences. Multi-
variate Behavioral Research, 51(4):519–539.
Mulder, J. and Pericchi, L. R. (2018). The matrix-f prior for estimating and
testing covariance matrices. Bayesian Analysis, 13(4):1189–1210.
Bayesian Penalization
Park, T. and Casella, G. (2008). The bayesian lasso. Journal of the American
Statistical Association, 103(482):681–686.
Peltola, T., Havulinna, A. S., Salomaa, V., and Vehtari, A. (2014). Hi-
erarchical bayesian survival analysis and projective covariate selection in
cardiovascular event risk prediction. In Proceedings of the Eleventh UAI
Conference on Bayesian Modeling Applications Workshop-Volume 1218,
pages 79–88. CEUR-WS. org.
Piironen, J., Betancourt, M., Simpson, D., and Vehtari, A. (2017). Con-
tributed comment on article by van der Pas, Szabó, and van der Vaart.
Bayesian Analysis, 12(4):1264–1266.
Polson, N. G., Scott, J. G., and Windle, J. (2014). The bayesian bridge.
Journal of the Royal Statistical Society: Series B (Statistical Methodology),
76(4):713–733.
Stan development team (2017b). Stan Modeling Language Users Guide and
Reference Manual, version 2.17.0.
Stan development team (2017c). The Stan Core Library, version 2.16.0.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Jour-
nal of the Royal Statistical Society. Series B (Methodological), pages 267–
288.
van Erp, S., Mulder, J., and Oberski, D. L. (2018). Prior sensitivity analysis
in default bayesian structural equation modeling. Psychological Methods,
23(2):363–388.
Vehtari, A., Gabry, J., Yao, Y., and Gelman, A. (2018). loo: Efficient leave-
one-out cross-validation and waic for bayesian models. R package version
2.0.0.
Zhao, S., Gao, C., Mukherjee, S., and Engelhardt, B. E. (2016). Bayesian
group factor analysis with structured sparsity. Journal of Machine Learn-
ing Research, 17(196):1–47.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the
American Statistical Association, 101(476):1418–1429.
Bayesian Penalization
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the
elastic net. Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 67(2):301–320.