0% found this document useful (0 votes)
66 views31 pages

Journal of Statistical Software: Elastic Net Regularization Paths For All Generalized Linear Models

The document describes an extension of the glmnet package to compute elastic net regularization paths for all generalized linear models (GLMs), Cox proportional hazards models with interval-censored data and strata. The glmnet package uses coordinate descent to efficiently solve regularization problems with l1 and l2 penalties. It allows fitting sparse models to large datasets and selecting penalties using cross-validation. The paper details how glmnet implements elastic net regularization for new model families and discusses related R packages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views31 pages

Journal of Statistical Software: Elastic Net Regularization Paths For All Generalized Linear Models

The document describes an extension of the glmnet package to compute elastic net regularization paths for all generalized linear models (GLMs), Cox proportional hazards models with interval-censored data and strata. The glmnet package uses coordinate descent to efficiently solve regularization problems with l1 and l2 penalties. It allows fitting sparse models to large datasets and selecting penalties using cross-validation. The paper details how glmnet implements elastic net regularization for new model families and discusses related R packages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

JSS Journal of Statistical Software

March 2023, Volume 106, Issue 1. doi: 10.18637/jss.v106.i01

Elastic Net Regularization Paths for All


Generalized Linear Models

J. Kenneth Tay Balasubramanian Narasimhan Trevor Hastie


Stanford University Stanford University Stanford University

Abstract
The lasso and elastic net are popular regularized regression models for supervised
learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient
algorithm for computing the elastic net regularization path for ordinary least squares
regression, logistic regression and multinomial logistic regression, while Simon, Friedman,
Hastie, and Tibshirani (2011) extended this work to Cox models for right-censored data.
We further extend the reach of the elastic net-regularized regression to all generalized
linear model families, Cox models with (start, stop] data and strata, and a simplified
version of the relaxed lasso. We also discuss convenient utility functions for measuring
the performance of these fitted models.

Keywords: lasso, elastic net, ℓ1 penalty, regularization path, coordinate descent, generalized
linear models, survival, Cox model.

1. Introduction
Consider the standard supervised learning framework. We have data of the form (x1 , y1 ), . . . ,
(xn , yn ), where yi ∈ R is the target and xi = (xi,1 , . . . , xi,p )⊤ ∈ Rp is a vector of potential
predictors. The ordinary least squares (OLS) model assumes that the response can be modeled
as a linear combination of the covariates, i.e., yi = β0 + x⊤ i β for some coefficient vector β ∈ R
p

and intercept β0 ∈ R. The parameters are estimated by minimizing the residual sum of squares
(RSS):
n
1 X
(β̂0 , β̂) = argmin (yi − β0 − x⊤ 2
i β) .
(β0 ,β)∈Rp+1 2n i=1

There has been a lot of research on regularization methods in the last two decades. We
focus on the elastic net (Zou and Hastie 2005) which minimizes the sum of the RSS and a
2 Elastic Net Regularization Paths for All GLMs

regularization term which is a mixture of ℓ1 and ℓ2 penalties:


n
1 X 1−α
"  #
(β̂0 , β̂) = argmin (yi − β0 − x⊤
i β) + λ
2
∥β∥22 + α∥β∥1 . (1)
(β0 ,β)∈Rp+1 2n i=1 2

In the above, λ ≥ 0 is a tuning parameter and α ∈ [0, 1] is a higher level hyperparameter1 .


We always fit a path of models in λ, but set a value of α depending on the type of prediction
model we want. For example, if we want ridge regression (Hoerl and Kennard 1970) we set
α = 0 and if we want the lasso (Tibshirani 1996) we set α = 1. If we want a sparse model
but are worried about correlations between features, we might set α close to but not equal
to 1. The final value of λ is usually chosen via cross-validation: we select the coefficients
corresponding to the λ value giving smallest cross-validated error as the final model.
The elastic net can be extended easily to generalized linear models (GLMs, Nelder and Wed-
derburn 1972) and Cox proportional hazards models (Cox 1972). Instead of solving the
minimization problem (1), the RSS term in the objective function is replaced with a negative
log-likelihood term or a negative log partial likelihood term respectively.
The glmnet R package (Friedman et al. 2010) contains efficient functions for computing the
elastic net solution for an entire path of values λ1 > · · · > λm . The minimization problems are
solved via cyclic coordinate descent (Van der Kooij 2007), with the core routines programmed
in Fortran for computational efficiency. Earlier versions of the package contained specialized
Fortran subroutines for a handful of popular GLMs and the Cox model for right-censored
survival data. The package includes functions for performing K-fold cross-validation (CV),
plotting coefficient paths and CV errors, and predicting on future data. The package can
also accept the predictor matrix in sparse matrix format: this is especially useful in certain
applications where the predictor matrix is both large and sparse. In particular, this means
that we can fit unpenalized GLMs with sparse predictor matrices, something the glm function
in the stats package cannot do.
From version 4.1 and later, glmnet (Friedman, Hastie, Tibshirani, Narasimhan, Tay, Simon,
and Yang 2023) is able to compute the elastic net regularization path for all GLMs, Cox
models with (start, stop] data and strata, and a simplified version of the relaxed lasso (Hastie,
Tibshirani, and Tibshirani 2020). The aim of this paper is to give details on how glmnet fits
these new models, and how the user can leverage such functionality. (We note that this
paper builds on two earlier works: Friedman et al. (2010) which gives details on how the
glmnet package computes the elastic net solution for ordinary least squares regression, logistic
regression and multinomial logistic regression, and Simon et al. (2011) which explains how the
package fits regularized Cox models for right-censored data.) In Section 2, we give an overview
of alternative implementations of the elastic net. We then explain how the elastic net penalty
can be applied to all GLMs and how we implement it in software in Section 3. In Section 4, we
detail extensions to Cox models with (start, stop] data and strata. In Section 5, we describe an
implementation of the relaxed lasso implemented in the package, and in Section 6 we describe
the package’s functionality for assessing fitted models. We conclude with a summary and
1
If the square were removed from the ℓ2 -norm penalty, it would be more natural to have 1 − α instead of
(1 − α)/2 as its mixing parameter. The factor of 1/2 compensates for the fact that a squared ℓ2 -norm penalty
is used, in the sense that the gradient of the penalty with respect to β can be seen as a convex combination
of the ℓ1 and ℓ2 penalty terms. We note also that there is a one-to-one correspondence between these two
parameterizations for the penalty.
Journal of Statistical Software 3

discussion. For more code examples and greater detail on how to use functions in the glmnet
package, see the vignettes on the official glmnet website https://glmnet.stanford.edu/.

2. Related packages
Several other packages currently exist in R (R Core Team 2022) for computing the lasso and
elastic net regularization paths, although it is not always the main focus of the package.
The model families and penalties supported varies widely from package to package: see
Tables 1 and 2 for a summary. The glmnet package effectively supersedes the elasticnet
(Zou and Hastie 2020), glmpath (Park and Hastie 2018) and lars (Hastie and Efron 2022)
packages. The lasso2 package (Lokhorst, Venables, Turlach, and Maechler 2021) solves the
constrained version of the lasso for a single constraint value (as opposed to a path), and is
the only other package that supports GLMs other than the Gaussian, binomial and Poisson
GLMs with canonical links. The ncvreg package (Breheny and Huang 2011) can compute
the solution paths for the lasso penalty, minimax concave penalty (MCP, Zhang 2010) and
smoothly clipped absolute deviation (SCAD) penalty (Fan and Li 2001), but not for the elastic
net penalty. The package which is closest to glmnet in functionality is the penalized package
(Goeman, Meijer, and Chaturvedi 2022). It has the ability to use R’s formula syntax to define
the model and it can compute the fused lasso solution (Tibshirani, Saunders, Rosset, Zhu,
and Knight 2005) as well. However, it is more limited in the model families it supports. The
biglasso package (Zeng and Breheny 2021) specializes in fitting the elastic net regularization
path for big data that cannot be loaded into memory; this is something that glmnet is unable
to do. The bmrm package (Prados 2019) computes the solution for a wide variety of loss
functions with ℓ1 or ℓ2 regularization, but not both ℓ1 and ℓ2 regularization at the same time
and only for a single point on the regularization path. For survival analysis, the ahaz package

GLMs Cox models


Gaussian Binomial Poisson All Right- (start, Strata
other censored stop]
GLMs data data
glmnet ✓ ✓ ✓ ✓ ✓ ✓ ✓
ahaz ✓ ✓
biglasso ✓ ✓ ✓
bmrm ✓ ✓
elasticnet ✓
glmpath ✓ ✓ ✓ ✓
lars ✓
lasso2 ✓ ✓ ✓ ✓
ncvreg ✓ ✓ ✓ ✓
penalized ✓ ✓ ✓ ✓
relaxo ✓

Table 1: Table of model families that each R package can fit. Only the glmnet and lasso2
packages can fit GLMs other than the Gaussian, binomial and Poisson GLMs with canonical
links. Only the glmnet and ahaz packages can fit Cox models with (start, stop] data, and
only glmnet can fit stratified Cox models.
4 Elastic Net Regularization Paths for All GLMs

Lasso (ℓ1 ) Ridge (ℓ2 -squared) Elastic Other penalties Formula


net syntax?
glmnet ✓ ✓ ✓
ahaz ✓ SCAD
biglasso ✓ ✓ ✓
bmrm ✓ ✓
elasticnet ✓ ✓ ✓
glmpath ✓
lars ✓
lasso2 ✓ ✓
ncvreg ✓ MCP, SCAD
penalized ✓ ✓ ✓ Fused lasso ✓
relaxo ✓

Table 2: Table of penalties that each R package can fit, as well as whether the package works
with R’s formula syntax. MCP refers to the minimax concave penalty, and SCAD refers to
the smoothly clipped absolute deviation penalty. glmnet does not support formula syntax as
the primary use case for regularized models is when the design matrix X is “wide”, i.e., many
more features than observations. Formula syntax also does not mix well with column-specific
function arguments such as exclude, penalty.factor, lower.limits and upper.limits.

(Gorst-Rasmussen and Scheike 2012) can fit semiparametric additive hazards models (which
include the Cox model as a special case) with the lasso penalty but not the elastic net penalty.
It works with both right-censored and (start, stop] data, but not with tied survival times.
Finally, the relaxo package (Meinshausen 2012) computes the solution path for the original
relaxed lasso (Meinshausen 2007) for the OLS model only.
Implementations of the lasso and elastic net regularization paths exist in other programming
languages as well: we mention some implementations in Python (Van Rossum et al. 2011)
and Julia (Bezanson, Edelman, Karpinski, and Shah 2017), two other languages popular with
statisticians and data scientists. In Python, the popular scikit-learn package (Pedregosa et al.
2011) can fit elastic net regularization paths for the linear and logistic regression models, but
not for generic GLMs. Building on top of scikit-learn, the scikit-survival package (Pölsterl
2020) can fit regularized Cox models for right-censored data, while the relaxed_lasso package
(Vial and Estermann 2020) can fit the relaxed lasso for the linear regression model. The
celer package (Massias, Gramfort, and Salmon 2018) can fit the lasso model for the linear
and logistic regression models (but not the elastic net model), and can also fit the group
lasso (Yuan and Lin 2006) and multi-task lasso (Obozinski, Taskar, and Jordan 2010). In
Julia, the Lasso.jl package (JuliaStats 2022) has the ability to fit an elastic regularization
path for any GLM, and also has support for other penalties such as the fused lasso. The
MLJLinearModels.jl package (JuliaAI 2023) has the ability to fit models for a variety of loss
and penalty functions, but only for a single value of the λ hyperparameter each time. At the
time of writing, we do not know of any implementation of the relaxed lasso in Julia.
We note in summary that the glmnet R package has the most comprehensive coverage of
models with elastic net regularization, especially with regard to new functionality covered in
this paper. We also note that there are Python and Julia packages which are simply wrappers
to the glmnet R package, allowing users who are more familiar with these programming
languages than R to leverage glmnet for their projects.
Journal of Statistical Software 5

3. Regularized generalized linear models

3.1. Overview of generalized linear models


Generalized linear models (GLMs, Nelder and Wedderburn 1972) are a simple but powerful
extension of OLS. A GLM consists of 3 parts:

• A linear predictor: ηi = x⊤
i β,

• A link function: ηi = g(µi ), and

• A variance function as a function of the mean: V = V (µi ).

The user gets to specify the link function g and the variance function V . For one-dimensional
exponential families, the family determines the variance function, which, along with the link,
are sufficient to specify a GLM. More generally, modeling can proceed once the link and
variance functions are specified via a quasi-likelihood approach (see McCullagh and Nelder
(1983) for details); this is the approach taken by the quasi-binomial and quasi-Poisson models.
The OLS model is a special case, with link g(x) = x and constant variance function V (µ) = σ 2
for some constant σ 2 . More examples of GLMs are listed in Table 3.
The GLM parameter β is determined by maximum likelihood estimation. Unlike OLS, there
is no closed form solution for β̂. Rather, it is typically computed via an iteratively reweighted
least squares (IRLS) algorithm known as Fisher scoring. In each iteration of the algorithm
we make a quadratic approximation to the negative log-likelihood (NLL), reducing the min-
imization problem to a weighted least squares (WLS) problem. For GLMs with canonical
link functions, the negative log-likelihood is convex in β, Fisher scoring is equivalent to the
Newton-Raphson method and is guaranteed to converge to a global minimum. For GLMs
with non-canonical links, the negative log-likelihood is not guaranteed to be convex2 . Also,
2
It is not true that the negative log-likelihood is always non-convex for non-canonical links. For example,
it can be shown via direct computation that the negative log-likelihood for probit regression is convex in β.

GLM family / Response type Representation in R


Regression type
Gaussian R gaussian()
Logistic {0, 1} binomial()
Probit {0, 1} binomial(link = "probit")
Quasi-Binomial {0, 1} quasibinomial()
Poisson N0 = {0, 1, . . . } poisson()
Quasi-Poisson N0 quasipoisson()
Negative binomial N0 MASS::negative.binomial(theta = 3)
Gamma R+ = [0, ∞) Gamma()
Inverse Gaussian R+ inverse.gaussian()
Tweedie Depends on variance statmod::tweedie()
power parameter

Table 3: Examples of generalized linear models (GLMs) and their representations in R.


6 Elastic Net Regularization Paths for All GLMs

Fisher scoring is no longer equivalent to the Newton-Raphson method and is only guaranteed
to converge to a local minimum.
It is easy to fit GLMs in R using the glm function from the stats package; the user can specify
the GLM to be fit using family objects. These objects capture details of the GLM such as
the link function and the variance function. For example, the code below shows the family
object associated with probit regression model:

R> class(binomial(link = "probit"))

[1] "family"

R> str(binomial(link = "probit"))

List of 12
$ family : chr "binomial"
$ link : chr "probit"
$ linkfun : function (mu)
$ linkinv : function (eta)
$ variance : function (mu)
$ dev.resids: function (y, mu, wt)
$ aic : function (y, n, mu, wt, dev)
$ mu.eta : function (eta)
$ initialize: language { if (NCOL(y) == 1) { ...
$ validmu : function (mu)
$ valideta : function (eta)
$ simulate : function (object, nsim)
- attr(*, "class")= chr "family"

where initialize contains code to set up objects needed for the family. The linkfun,
linkinv, variance and mu.eta functions are used in fitting the GLM, and the dev.resids
function is used in computing the deviance of the resulting model. By passing a class ‘family’
object to the family argument of a glm call, glm has all the information it needs to fit the
model. Here is an example of how one can fit a probit regression model in R:

R> library("glmnet")
R> data("BinomialExample", package = "glmnet")
R> fit <- glm(y ~ x, data = BinomialExample,
+ family = binomial(link = "probit"))

3.2. Extending the elastic net to all GLM families


To extend the elastic net to GLMs, we replace the RSS term in (1) with an NLL term:
n
1X 1−α
"  #
 
(β̂0 , β̂) = argmin − ℓ yi , β0 + x⊤
i β +λ ∥β∥22 + α∥β∥1 , (2)
(β0 ,β)∈Rp+1 n i=1 2
Journal of Statistical Software 7

Algorithm 1 Fitting GLMs with elastic net penalty.


1. Select a value of α ∈ [0, 1] and a sequence of λ values λ1 > . . . > λm .

2. For k = 1, . . . , m:
(0)
(a) Initialize (β̂0 (λk ), β̂ (0) (λk )) = (β̂0 (λk−1 ), β̂(λk−1 )). For k = 1, initialize
(0)
(β̂0 (λk ), β̂ (0) (λk )) = (0, 0). (Here, (β̂0 (λk ), β̂(λk )) denotes the elastic net solu-
tion at λ = λk .)
(b) For t = 0, 1, . . . until convergence:
 
(t) (t) (t) (t)
i. For i = 1, . . . , n, compute ηi = β̂0 (λk ) + β̂ (t) (λk )⊤ xi and µi = g −1 ηi .
ii. For i = 1, . . . , n, compute working responses and weights
!2
   dµ(t) dµi
(t)  
(t) (t) (t) (t) (t)
= + yi − i
= (3)

zi ηi µi (t)
, wi (t)
V µi .
dηi dηi

iii. Solve the penalized WLS problem


(t+1)
(β̂0 (λk ), β̂ (t+1) (λk ))
n
1 X 1−α
"  #
 2
(t) (t)
= argmin wi z i − β 0 − x ⊤
i β + λk ∥β∥22 + α∥β∥1 .
(β0 ,β)∈Rp+1 2n i=1 2
(4)

 
where ℓ yi , β0 + x⊤
j β is the log-likelihood term associated with observation i. We can apply
the same strategy as for GLMs to minimize this objective function. The key difference is that
instead of solving a WLS problem in each iteration, we solve a penalized WLS problem.
The algorithm for solving (2) for a path of λ values is described in Algorithm 1. Note that in
Step 2(a), we initialize the solution for λ = λk at the solution obtained for λ = λk−1 . This is
known as a warm start: since we expect the solution at these two λ values to be similar, the
algorithm will likely require fewer iterations than if we initialized the solution at zero.

3.3. Implementation details


There are two main approaches we can take in implementing Algorithm 1. In the original
implementation of glmnet, the entire algorithm was implemented in Fortran for specific GLM
families. In version 4.0 and later, we added a second implementation which implemented just
the computational bottleneck, the penalized WLS problem in Step 2(b)iii, in Fortran, with
the rest of the algorithm implemented in R. Here are the relative merits and disadvantages
of the second approach compared to the first:
✓ Because the formulas for the working weights and responses in (3) are specific to each
GLM, the first approach requires a new Fortran subroutine for each GLM family. This
is tedious to manage, and also means that users cannot fit regularized models for their
bespoke GLM families. The second approach allows the user to pass a class ‘family’
object to glmnet: the working weights and responses can then be computed in R before
the Fortran subroutine solves the resulting penalized WLS problem.
8 Elastic Net Regularization Paths for All GLMs

✓ As written, Algorithm 1 is a proximal Newton algorithm with a constant step size


of 1, and hence it may not converge in certain cases. To ensure convergence, we can
implement step-size halving after Step 2(b)iii:h as long as the objectivei function (2) is
not decreasing, set β̂ (t+1) (λk ) ← β̂ (λk ) + β̂
(t) (t+1) (λk ) − β̂ (λk ) /2 (with a similar
(t)

formula for the intercept). Since the objective function involves a log-likelihood term,
the formula for the objective function differs across GLMs, and the first approach has
to maintain different subroutines for step-size halving. For the second approach, we
can write a single function that takes in the class ‘family’ object (along with other
necessary parameters) and returns the objective function value.

× It is computationally less efficient than the first approach because (i) R is generally slower
than Fortran, and (ii) there is overhead associated with constant switching between R
and Fortran. Some timing comparisons for Gaussian and logistic regression with the
default parameters are presented in Figure 1. The second approach is 10 to 15 times as
slow than the first approach.

× Since each GLM family has its own set of Fortran subroutines in the first approach, it
allows for special computational tricks to be employed in each situation. For example,
with family = "gaussian", the predictors can be centered once upfront to have zero
mean and Algorithm 1 can be run ignoring the intercept term.

We stress that both approaches have been implemented in glmnet. Users should use the first
implementation for the most popular GLM families including OLS (Gaussian regression),
logistic regression and Poisson regression (see glmnet’s documentation for the full list of such
families), and use the second implementation for all other GLM families. For example, the
code below shows two equivalent ways to fit a regularized Poisson regression model:

R> data("PoissonExample", package = "glmnet")


R> x <- PoissonExample$x
R> y <- PoissonExample$y
R> fit1 <- glmnet(x, y, family = "poisson")
R> fit2 <- glmnet(x, y, family = poisson())
R> cbind(coef(fit1, s = 0.1), coef(fit2, s = 0.1))

21 x 2 sparse Matrix of class "dgCMatrix"


1 1
(Intercept) 0.097117743 0.096014186
V1 0.600969943 0.601083898
V2 -0.963561440 -0.963849601
...
V18 . .
V19 -0.016139939 -0.016215082
V20 0.011030660 0.010915409

The first call specifies the GLM family as a character string to the family argument, invoking
the first implementation. The second call passes a class ‘family’ object to the family ar-
gument instead of a character string, invoking the second implementation. One would never
Journal of Statistical Software 9

Figure 1: The top plot compares model fitting times for family = "gaussian" and family
= gaussian() for a range of problem sizes, while the plot below compares that for family =
"binomial" and family = binomial(). Each point is the mean of 5 simulation runs. Note
that both the x and y axes are on the log scale.

run the second call in practice though, as it returns the same result as the first call but takes
longer to fit. (We note that the coefficients returned for the two models differ slightly because
they use different fitting algorithms and have different convergence criteria. We can tighten
the convergence criteria by lowering the thresh argument: when we do this we will have
greater agreement between the two models.)
The example below fits a regularized quasi-Poisson model that allows for overdispersion, a
family that is only available via the second approach:

R> fit <- glmnet(x, y, family = quasipoisson())

3.4. Details on the penalized WLS subroutine


Since the penalized WLS problem in Step 2(b)iii of Algorithm 1 is the computational bottle-
neck, we elected to implement it in Fortran. Concretely, the subroutine solves the problem
n p
1 X 1−α 2
 2  
minimize wi zi − β0 − x⊤ + λk βj + α|βj | (5)
X
β γj
(β0 ,β)∈Rp+1 2n i=1 i
j=1
2
subject to Lj ≤ βj ≤ Uj , j = 1, . . . , p.
10 Elastic Net Regularization Paths for All GLMs

Algorithm 2 Solving penalized WLS (5) with strong rules.


Assume that we are trying to solve for β̂(λk ) for some k = 1, . . . , m, and that we have already
computed β̂(λk−1 ). (If k = 1, set β̂(λk−1 ) = 0.)

1. Initialize the strong set Sλk = {j : β̂(λk−1 )j ̸= 0}.

2. Check the strong rules: for j = 1, . . . , p, include j in Sλk if


n o

xj y − X β̂(λk−1 ) > α [λk − (λk−1 − λk )] γj .

3. Perform cyclic coordinate descent only for features in Sλk .

4. Check that the KKT conditions hold for each j = 1, . . . , p. If the conditions hold for all
j, we have the exact solution. If the conditions do not hold for some features, include
them in the strong set Sλk and go back to Step 3.

This is the same problem as (4) except for two things. First, the penalty placed on each
coefficient βj has its own multiplicative factor γj . (Note that (5) reduces to (4) if γj = 1
for all j, which is the default value for the glmnet function.) This allows the user to place
different penalty weights on the coefficients. An instance where this is especially useful is
when the user always wants to include feature j in the model: in that case the user could
set γj = 0 so that βj is unpenalized. Second, the coefficient βj is constrained to lie in the
interval [Lj , Uj ]. (glmnet’s default is Lj = −∞ and Uj = ∞ for all j, i.e., no constraints on
the coefficients.) One example where these constraints are useful is when we want a certain
βj to always be non-negative or always non-positive.
The Fortran subroutine solves (5) by cyclic coordinate descent: see Friedman et al. (2010)
for details. Here we describe one major computational trick that was not covered in that
paper: the application of strong rules (Tibshirani, Bien, Friedman, Hastie, Simon, Taylor,
and Tibshirani 2012).
In each iteration of cyclic coordinate descent, the solver has to loop through all p features to
update the corresponding model coefficients. This can be time-consuming if p is large, and is
potentially wasteful if the solution is sparse: most of the βj would remain at zero. If we know
a priori which predictors will be “active” at the solution (i.e., have βj ̸= 0), we could perform
cyclic coordinate descent on just those coefficients and leave the others untouched. The set of
“active” predictors is known as the active set. Strong rules are a simple yet powerful heuristic
for guessing what the active set is, and can be combined with the Karush-Kuhn-Tucker (KKT)
conditions to ensure that we get the exact solution. (The set of predictors determined by the
strong rules is known as the strong set.) We describe the use of strong rules in solving (5)
fully in Algorithm 2.
Finally, we note that in some applications, the design matrix X is sparse. In these settings,
computational savings can be reaped by representing X in a sparse matrix format and per-
forming matrix manipulations with this form. To leverage this property of the data, we have
a separate Fortran subroutine that solves (5) when X is in sparse matrix format.
Journal of Statistical Software 11

3.5. Other useful functionality


In this section, we mention other functionality that the glmnet package provides for fitting
elastic net models.
For fixed α, glmnet solves (2) for a path of λ values. While the user has the option of
specifying this path of values using the lambda option, it is recommended that the user let
glmnet compute the sequence on its own. glmnet uses the arguments passed to it to determine
the value of λmax , defined to be the smallest value of λ such that the estimated coefficients
would be all equal to zero3 . The program then computes λmin such that the ratio λmin /λmax
is equal to lambda.min.ratio (default 10−2 if the number of variables exceeds the number
of observations, 10−4 otherwise). Model (2) is then fit for nlambda λ values (default 100)
starting at λmax and ending at λmin which are equally spaced on the log scale.
In practice, it common to choose the value of λ via cross-validation (CV). The cv.glmnet
function is a convenience function that runs CV for the λ tuning parameter. The returned
object has class ‘cv.glmnet’, which comes equipped with plot, coef and predict methods.
The plot method produces a plot of CV error against λ (see Figure 2 for an example.)
As mentioned earlier, we prefer to think of α as a higher level hyperparameter whose value
depends on the type of prediction model we want. Nevertheless, the code below shows how
the user can perform CV for α manually using a for loop and extract the value of α giving
the smallest CV error. Care must be taken to ensure that the same CV folds are used across
runs for the CV errors to be comparable.

R> data("QuickStartExample", package = "glmnet")


R> x <- QuickStartExample$x
R> y <- QuickStartExample$y
R> alphas <- c(1, 0.8, 0.5, 0.2, 0)
R> fits <- list()
R> set.seed(1)
R> fits[[1]] <- cv.glmnet(x, y, keep = TRUE)
R> foldid <- fits[[1]]$foldid
R> for (i in 2:length(alphas)) {
+ fits[[i]] <- cv.glmnet(x, y, alpha = alphas[i], foldid = foldid)
+ }
R> best_cvm <- unlist(lapply(fits, function(fit) min(fit$cvm)))
R> alphas[which.min(best_cvm)]

[1] 1

The returned cv.glmnet object contains estimated standard errors for the model CV error
at each λ value. (We note that the method for obtaining these estimates is crude, and the
estimates are generally too small due to correlations across CV folds; see Bates, Hastie, and
Tibshirani (2021).) By default, the predict method returns predictions for the model at the
"lambda.1se" value, i.e., the value of λ that gives the most regularized model such that the
CV error is within one standard error of the minimum. To get predictions at the λ value
which gives the minimum CV error, the s = "lambda.min" argument is passed.
3
We note that when α = 0, λmax is infinite, i.e., all coefficients will always be non-zero for finite λ. To avoid
such extreme values of λmax , if α < 0.001 we return the λmax value for α = 0.001.
12 Elastic Net Regularization Paths for All GLMs

20 20 19 19 19 19 17 15 10 9 8 8 8 7 7 6 6 5 2 2

8
Mean−Squared Error

6
4
2

−5 −4 −3 −2 −1 0

Log(λ)

Figure 2: Example output for plotting a cv.glmnet object: a plot of CV error against
log(λ). The error bars correspond to ±1 standard error. The left vertical line corresponds to
the minimum error while the right vertical line corresponds to the largest value of λ such that
the CV error is within one standard error of the minimum. The top of the plot is annotated
with the size of the models, i.e., the number of predictors with non-zero coefficient.

R> set.seed(1)
R> cfit <- cv.glmnet(x, y)
R> predict(cfit, x)

1
[1,] -1.33820168
[2,] 2.50786936
[3,] 0.56371947
...
[98,] -2.59959545
[99,] 4.68516614
[100,] -0.75782622

R> predict(cfit, x, s = "lambda.min")

1
[1,] -1.36474902
[2,] 2.56860130
[3,] 0.57058790
...
[98,] -2.74205637
[99,] 5.12461067
[100,] -1.07513903
Journal of Statistical Software 13

The glmnet and cv.glmnet functions have an exclude argument which accepts a vector of
indices, indicating which variables should be excluded from the fit, i.e., get zeros for their
coefficients. More intriguingly, the exclude argument can also accept a function. The idea is
that variables can be filtered based on some property before any models are fit. One use case
for this is in genomics, where we often want to exclude features which are too sparse, being
non-zero for the vast majority of observations. The code below shows the two different ways
of using the exclude argument. They return the same result except for the function call that
produced the fit.

R> set.seed(1)
R> x[sample(seq(length(x)), 1600)] <- 0
R> filter <- function(x, ...) which(colMeans(x == 0) > 0.8)
R> exclude <- filter(x)
R> fit1 <- glmnet(x, y, exclude = exclude)
R> fit2 <- glmnet(x, y, exclude = filter)
R> all.equal(fit1, fit2)

[1] "Component "call": target, current do not match when deparsed"

The real power of assigning a function to the exclude argument is in performing CV. If
a filtering function is given as the exclude argument, cv.glmnet will apply that function
separately to each training fold, hence accounting for any bias that may be incurred by the
variable filtering. This is not possible when assigning a vector of indices to exclude: in that
case, the excluded variables are forced to be the same across folds.
In large data settings, it may take some time to fit the entire sequence of elastic net models.
glmnet and cv.glmnet come equipped with a progress bar which can be displayed with the
argument trace.it = TRUE. This gives the user a sense of how model fitting is progressing.
The glmnet package provides a convenience function bigGlm for fitting a single unpenalized
GLM but allowing all the options of glmnet. In particular, the user can set upper and/or
lower bounds on the coefficients, and can provide the x matrix in sparse matrix format:
options that are not available for the stats::glm function.

R> data("BinomialExample", package = "glmnet")


R> x <- BinomialExample$x
R> y <- BinomialExample$y
R> fit <- bigGlm(x, y, family = "binomial")
R> which(coef(fit) < -1)

[1] 4 5 6 7 9 11 12 13 14 18 19 20 21 22 25 27 28 30

R> fit <- bigGlm(x, y, family = "binomial", lower.limits = -1)


R> which(coef(fit) < -1)

integer(0)
14 Elastic Net Regularization Paths for All GLMs

4. Regularized Cox proportional hazards models


We assume the usual survival-analysis framework. Instead of having yi ∈ R as a response, we
have (yi , δi ) ∈ R+ ×{0, 1}. Here yi is the observed time for observation i, and δi = 1 if yi is the
failure time and δi = 0 if it is the right-censoring time. The Cox proportional hazards model
(Cox 1972) is a commonly used model for the relationship between the predictor variables
and survival time. It assumes a semi-parametric form for the hazard function

hi (t) = h(t)exi β ,

where hi (t) is the hazard for observation i at time t, h is the baseline hazard for the entire
population of observations, and β ∈ Rp is the vector of coefficients to be estimated. Let
t1 < · · · < tm denote the unique failure times and let j(i) denote the index of the observation
failing at time ti . (Assume for the moment that the yi ’s are unique.) If yj ≥ ti , we say that
observation j is at risk at time ti . Let Ri denote the risk set at time ti . The coefficient vector

Algorithm 3 Fitting Cox models with elastic net penalty.


1. Select a value of α ∈ [0, 1] and a sequence of λ values λ1 > . . . > λm . Define β̂(λ0 ) = 0.

2. For ℓ = 1, . . . , m:

(a) Initialize β̂(λℓ ) = β̂(λℓ−1 ).


(b) For t = 0, 1, . . . until convergence (outer loop):
(t)
i. For k = 1, . . . , n, compute ηk = β̂(λℓ )⊤ xk .
ii. For k = 1, . . . , n, compute
 
  (t) 1
ℓ′ η (t) = δk − e (6)
X
ηk  ,
(t)
k P ηj
i∈Ck j∈Ri e

(t)
 2 
(t) (t)
ηk ηj ηk
− e
P
  X  e j∈Ri e 
ℓ′′ η (t) = , (7)
 
(t) 2
  
k,k  ηj 
i∈Ck
P
j∈Ri e
 
(t)
wk = −ℓ′′ η (t) , (8)
k,k
 
(t) (t)
ℓ′ η (t)
zk = ηk + k , (9)
ℓ′′ η (t) k,k

where Ck is the set of failure times i such that ti < yk (i.e., times for which
observation k is still at risk.)
iii. Solve the penalized WLS problem (inner loop):
n
1X 1−α
"  #
 2
(t) (t)
β̂(λℓ ) = argmin wk z k − x ⊤
kβ + λℓ ∥β∥22 + α∥β∥1 .
β∈Rp 2 k=1
2
Journal of Statistical Software 15

β is estimated by maximizing the partial likelihood


m x⊤ β
ej(i)
L(β) = (10)
Y
⊤ .
P xj β
i=1 j∈Ri e

It is the conditional likelihood that the failure occurs for observation j(i) given all the ob-
servations at risk. Maximizing the partial likelihood is equivalent to minimizing the negative
log partial likelihood
  
m
2X −x⊤ β + log 
X x⊤ β
−ℓ(β) = j(i) e j  . (11)
n i=1 j∈R i

We put a negative sign in front of ℓ so that ℓ denotes the log partial likelihood, and the scale
factor 2/n is included for convenience. Note also that the model does not have an intercept
term β0 , as it cancels out in the partial likelihood. Simon et al. (2011) proposed an elastic-
net regularization path version for the Cox model, as well as Algorithm 3 for solving the
minimization problem.
Algorithm 3 has the same structure as Algorithm 1 except for different formulas for computing
the working responses and weights. (We note that these formulas implicitly approximate the
Hessian of the log partial likelihood by a diagonal matrix with the Hessian’s diagonal entries.)
This means that we can leverage the fast implementation of the penalized WLS problem in
Section 3.4 for an efficient implementation of Algorithm 3. (As a small benefit, it also means
that we can fit regularized Cox models when the design matrix X is sparse.) Such a model
can be fit with glmnet by specifying family = "cox". The response provided needs to be a
‘Surv’ object from the survival package (Therneau 2023).

R> data("CoxExample", package = "glmnet")


R> glmnet(CoxExample$x, CoxExample$y, family = "cox")

Call: glmnet(x = CoxExample$x, y = CoxExample$y, family = "cox")

Df %Dev Lambda
1 0 0.00 0.236800
2 1 0.18 0.215700
3 3 0.60 0.196600
...
47 24 5.87 0.003279
48 24 5.87 0.002988
49 24 5.87 0.002722

The computation of these wk ’s and zk ’s can be a computational bottleneck if not implemented


carefully: since the Ck and Ri have O(n) elements, a naive implementation takes O(n2 ) time.
Simon et al. (2011) exploit the fact that, once the observations are sorted in order of the
observed times yi , the risk sets are nested (Ri+1 ⊆ Ri for all i) and the wk ’s and zk ’s can be
computed in O(n) time.
If our data contains tied observed times, glmnet uses the Breslow approximation of the
partial likelihood for ties (Breslow 1972) and maximizes the elastic net-regularized version of
this approximation instead. See Simon et al. (2011) for details.
16 Elastic Net Regularization Paths for All GLMs

4.1. Extending regularized Cox models to (start, stop] data


Instead of working with right-censored responses, the Cox model can be extended to work
with responses which are a pair of times (called the “start time” and “stop time”), with the
possibility of the stop time being censored. This is an instantiation of the counting process
framework proposed by Andersen and Gill (1982), and the right-censored data set-up is a
special case with the start times all being equal to zero.
As noted in Therneau and Grambsch (2000), (start, stop] responses greatly increase the
flexibility of the Cox model, allowing for
• Time-dependent covariates,
• Time-dependent strata,
• Left truncation,
• Multiple time scales,
• Multiple events per subject,
• Independent increment, marginal, and conditional models for correlated data, and
• Various forms of case-cohort models.
From a data analysis viewpoint, this extension amounts to requiring just one more variable:
the time variable is replaced by (start, stop] variables, with (start, stop] indicating
the interval where the unit is at risk. The survival package provides the function tmerge to
aid in the creation of such datasets.
For this more general setup, inference for β can proceed as before. The formulas for the
partial likelihood and negative log partial likelihood ((10) and (11)) remain the same; what
changes is the definition of what it means for an observation to be at risk at time ti . If we
let (y1j , y2j ] denote the (start, stop] times for observation j, then observation j is at risk at
time ti if and only if ti ∈ (y1j , y2j ]. Similarly, the elastic net-regularized version of the Cox
model for (start, stop] data can be fitted using Algorithm 3 with this new definition of what
it means for an observation to be at risk at a failure time.
With (start, stop] data, it is no longer true that the risk sets are nested. For example, if
ti < y1j < ti+1 < y2j , then j ∈ Ri+1 but j ∈
/ Ri . However, as Algorithm 4 shows, it is still
possible to compute the working responses and weights in O(n log n) time. In fact, only the
ordering of observations (Step 1) requires O(n log n) time: the rest of the algorithm requires
just O(n) time. Since the ordering of observations never changes, the results of Step 1 can be
cached, meaning that only the first run of Algorithm 4 requires O(n log n) time, and future
runs just need O(n) time.
The differences between right-censored data and (start, stop] data for Cox models are hidden
from the user, in that the function call for (start, stop] data is exactly the same as that
for right-censored data. The difference is in the type of ‘Surv’ object that is passed for the
response y. glmnet checks for the ‘Surv’ object type before routing to the correct internal
subroutine.
We illustrate this new functionality on the bladder2 dataset from the survival package. We
will fit a regularized Cox model for the time to recurrence based on the treatment type (rx),
the initial number of tumors (number) and the size of the largest initial tumor (size).
Journal of Statistical Software 17

Algorithm 4 Computing working responses and weights for Algorithm 3.


Input: ηj = x⊤ j β̂ (where β̂ is the current estimate for β), (y1j , y2j ], δj for j = 1, . . . , n. For
simplicity, assume that the observations are ordered by ascending stop time, i.e., y21 < · · · <
y2n . As before, let t1 < · · · < tm denote the failure times in increasing order.

1. Get the ordering for the observations according to start times. Let start(j) denote the
index for the observation with the jth earliest start time.

2. Compute the risk set sums RSSi = eηj , i = 1, . . . , m using the following steps:
P
j∈Ri

(a) For j = 1, . . . , n, set RSSj ← eηℓ .


Pn
ℓ=j
(b) Set curr ← 0, i ← m, start_idx ← n.
(c) While i > 0 and start_idx > 0:
i. If y1start(start_idx) < ti , set RSSj(i) ← RSSj(i) − curr and i ← i − 1.
ii. If not, set curr ← curr + eηstart(start_idx) and start_idx ← start_idx − 1.
(d) Take just the elements of RSS corresponding to death times, i.e., set RSSi ←
RSSj(i) .

3. Compute the partial sums RSKk = 1/RSSi , k = 1, . . . , n using the following


P
i∈Ck
steps:

(a) For i = 1, . . . , m, set RDi ← ℓ=1 1/RSSi . Set RD0 ← 0.


Pi

(b) For k = 1, . . . , n, set Dk ← ℓ=1 δℓ .


Pk

(c) For k = 1, . . . , n, set RSKk ← RDDk .


(d) Set curr ← 0, i ← 1, start_idx ← 1.
(e) While i ≤ m and start_idx ≤ n:
i. If y1start(start_idx) < ti , set RSKstart(start_idx) ← RSKstart(start_idx) − curr and
start_idx ← start_idx + 1.
ii. If not, set curr ← curr + 1/RSSi and i ← i + 1.

4. Compute the partial sums RSKSQk = 1/RSS2i , k = 1, . . . , n in a similar manner


P
i∈Ck
as Step 3.

5. Compute ℓ′ (η)k and ℓ′′ (η)k,k using the formulas (6) and (7):

ℓ′ (η)k = δk − eηk · RSKk , ℓ′′ (η)k,k = (eηk )2 · RSKSQk − eηk · RSKk .

6. Compute the working responses and weights using the formulas (8) and (9).

R> library("survival")
R> head(bladder2)

id rx number size start stop event enum


1 1 1 1 3 0 1 0 1
2 2 1 2 1 0 4 0 1
18 Elastic Net Regularization Paths for All GLMs

3 3 1 1 1 0 7 0 1
4 4 1 5 1 0 10 0 1
5 5 1 4 1 0 6 1 1
6 5 1 4 1 6 10 0 2

We start by creating the (start, stop] response with survival::Surv:

R> y <- with(bladder2, Surv(start, stop, event))


R> head(y)

[1] (0, 1+] (0, 4+] (0, 7+] (0,10+] (0, 6] (6,10+]

The code for fitting the regularized Cox model is then exactly the same as the code we would
have used if the response was right-censored:

R> x <- as.matrix(bladder2[, 2:4])


R> glmnet(x, y, family = "cox")

Call: glmnet(x = x, y = y, family = "cox")

Df %Dev Lambda
1 0 0.00 0.194800
2 1 0.34 0.177500
3 1 0.61 0.161700
...
41 3 2.67 0.004715
42 3 2.67 0.004296
43 3 2.68 0.003914

4.2. Stratified Cox models


An extension of the Cox model is to allow for strata. These strata divide the units into
disjoint groups, with each group having its own baseline hazard function but having the same
values of β. Specifically, if the units are divided into K strata, then the stratified Cox model
assumes that a unit in stratum k has the hazard function

hi (t) = hk (t)exi β ,

where hk (t) is the shared baseline hazard for all units in stratum k. In several applications,
allowing different subgroups to have different baseline hazards approximates reality more
closely. For example, it might be reasonable to have different baseline hazards based on
gender in clinical trials, or a separate baseline for each center in multi-center trials.
In this setting, the negative log partial likelihood is
K
ℓ(β) = ℓk (β),
X

k=1
Journal of Statistical Software 19

where ℓk (β) is exactly (11) but considering just the units in stratum k. Since the negative log
partial likelihood decouples across strata (conditional on β), regularized versions of stratified
Cox models can be fit using a slightly modified version of Algorithm 3.
To fit an unpenalized stratified Cox model, the survival package has a special strata function
that allows users to specify the strata variable in formula syntax. Since glmnet does not work
with formulas, we needed a different approach for specifying strata. To fit regularized stratified
Cox models in glmnet, the user needs to add a strata attribute to the response y. glmnet
checks for the presence of this attribute and if it is present, it fits a stratified Cox model.
We note that the user cannot simply add the attribute manually because R drops attributes
when subsetting vectors. Instead, the user should use the stratifySurv function to add
the strata attribute. (stratifySurv creates an object of class ‘stratifySurv’ that inherits
from the class ‘Surv’, ensuring that glmnet can reassign the strata attribute correctly after
any subsetting.) The code below shows an example of how to fit a regularized stratified Cox
model with glmnet; there are a total of 1, 000 observations, with the first 500 belonging to
the first strata and the rest belonging to the second strata.
R> data("CoxExample", package = "glmnet")
R> x <- CoxExample$x
R> y <- CoxExample$y
R> strata <- c(rep(1, 500), rep(2, 500))
R> y2 <- stratifySurv(y, strata)
R> glmnet(x, y2, family = "cox")

Call: glmnet(x = x, y = y2, family = "cox")

Df %Dev Lambda
1 0 0.00 0.235500
2 2 0.23 0.214600
3 3 0.69 0.195500
...
47 24 6.57 0.003262
48 24 6.57 0.002972
49 25 6.58 0.002708

4.3. Plotting survival curves


The beauty of the Cox partial likelihood is that the baseline hazard, h0 (t), is not required for
inference on the model coefficients β. However, the estimated hazard is often of interest to
users. The survival package already has a well-established survfit method that can produce
estimated survival curves from a fitted Cox model. glmnet implements a survfit method
for regularized Cox models fit by glmnet by creating the ‘coxph’ object corresponding to the
model and calling survival::survfit.
The code below is an example of calling survfit for ‘coxnet’ objects for a particular value
of the λ tuning parameter (in this case, λ = 0.05). Note that we had to pass the original
design matrix x and response y to the survfit call: they are needed for survfit.coxnet to
reconstruct the required ‘coxph’ object. The survival curves are computed for the individuals
represented in newx: we get one curve per individual, as seen in Figure 3.
20 Elastic Net Regularization Paths for All GLMs

1.0
2
1
2
12
1
2
1
2
12
1 2
12
0.8

1
2
2
1
1 2
1 2 2
1 1 2
0.6

1
2
1
2
1 2
0.4

1 2
1
2
1
0.2

2
1
0.0

0.0 0.5 1.0 1.5 2.0

Figure 3: An illustration of the plotted ‘survfit’ object. One survival curve is plotted for
each individual represented in the newx argument.

R> set.seed(1)
R> nobs <- 100; nvars <- 15
R> x <- matrix(rnorm(nobs * nvars), nrow = nobs)
R> ty <- rep(rexp(nobs / 5), each = 5)
R> tcens <- rbinom(n = nobs, prob = 0.3, size = 1)
R> y <- Surv(ty, tcens)
R> fit <- glmnet(x, y, family = "cox")
R> sf_obj <- survfit(fit, s = 0.05, x = x, y = y, newx = x[1:2, ])
R> plot(sf_obj, col = 1:2, mark.time = TRUE, pch = "12")

The survfit method is available for Cox models fitted by cv.glmnet as well. By default, the
survival curves are computed for the lambda.1se value of the λ hyperparameter. The user
can use the code below to compute the survival curve at the lambda.min value:

R> set.seed(1)
R> cfit <- cv.glmnet(x, y, family = "cox", nfolds = 5)
R> survfit(cfit, s = "lambda.min", x = x, y = y, newx = x[1:2, ])

Call: survfit.cv.glmnet(formula = cfit, s = "lambda.min", x = x, y = y,


newx = x[1:2, ])

n events median
1 100 33 1.6
2 100 33 1.6
Journal of Statistical Software 21

5. The relaxed lasso


Due to the regularization penalty, the lasso tends to shrink the coefficient vector β̂ toward
zero. The relaxed lasso (Meinshausen 2007) was introduced as a way to undo the shrinkage
inherent in the lasso estimator. Through extensive simulations, Hastie et al. (2020) conclude
that the relaxed lasso performs well in terms of predictive performance across a range of
scenarios. It was found to perform just as well as the lasso in low signal-to-noise (SNR)
scenarios and nearly as well as best subset selection in high SNR scenarios. It also has a
considerable advantage over best subset and forward stepwise regression when the number of
variables, p, is large. In this section, we describe the simplified version of the relaxed lasso
proposed by Hastie et al. (2020) and give details on how it is implemented in glmnet.
For simplicity, we describe the method for the OLS setting (family = "gaussian") and for
the lasso (α = 1). For a given tuning parameter λ, let β̂ lasso (λ) ∈ Rp denote the lasso
estimator for this value of λ. Let Aλ denote the active set of the lasso estimator, and let
LS ∈ R|Aλ | denote the OLS coefficients obtained by regressing y on X
β̂A λ Aλ (i.e., the subset of
columns of X which correspond to features in the active set Aλ ). Let β̂ LS (λ) ∈ Rp denote
the OLS coefficients β̂ALS padded with zeros to match the zeros of the lasso solution. The
λ
(simplified version of the) relaxed lasso estimator is given by

β̂ relax (λ, γ) = γ β̂ lasso (λ) + (1 − γ)β̂ LS (λ),

where γ ∈ [0, 1] is a hyperparameter, similar to α. In other words, the relaxed lasso estimator
is a convex combination of the lasso estimator and the OLS estimator for the lasso’s active
set.
The relaxed lasso can be fit with the glmnet function in glmnet by setting the argument
relax = TRUE:

R> data("QuickStartExample", package = "glmnet")


R> x <- QuickStartExample$x
R> y <- QuickStartExample$y
R> fitr <- glmnet(x, y, relax = TRUE)
R> fitr

Call: glmnet(x = x, y = y, relax = TRUE)


Relaxed

Df %Dev %Dev R Lambda


1 0 0.00 0.00 1.63100
2 2 5.53 58.90 1.48600
3 2 14.59 58.90 1.35400
...
65 20 91.32 91.32 0.00423
66 20 91.32 91.32 0.00386
67 20 91.32 91.32 0.00351

When called with this option, glmnet first runs the lasso (Algorithm 1 with α = 1) to obtain
the lasso estimates β̂ lasso (λk ) and the active sets Aλk for a path of hyperparameter values
22 Elastic Net Regularization Paths for All GLMs

λ1 > · · · > λm . It then goes down this sequence of hyperparameter values again, fitting the
unpenalized model of y on each XAλk to obtain β̂ LS (λk ). The refitting is done in an efficient
manner. For example, if Aλℓ = Aλk , glmnet does not fit the OLS model for λℓ but sets
β̂ LS (λℓ ) = β̂ LS (λk ).
The returned object has a predict method which the user can use to make predictions on
future data. As an example, the code below returns the relaxed lasso predictions for the
training data at λ = 1 and γ = 0.5 (the default value is gamma = 1, i.e., the lasso estimator):

R> predict(fitr, x, s = 1, gamma = 0.5)

1
[1,] 0.07173775
[2,] 2.11579157
[3,] 0.89895772
...
[98,] -0.92082539
[99,] 1.92925201
[100,] 1.11307696

The cv.glmnet function works with the relaxed lasso as well. When cross-validating a relaxed
lasso model, cv.glmnet provides optimal values for both the lambda and gamma parameters.
We note that we can consider as many values of the γ hyperparameter as we like in CV. Most
of the computational time is spent obtaining β̂ lasso (λ) and β̂ LS (λ); once we have computed
them β̂ relax (λ, γ) is simply a linear combination of the two. By default, cv.glmnet performs
CV for gamma = c(0, 0.25, 0.5, 0.75, 1).
As with non-relaxed fits, the output of cv.glmnet has a plot method which plots the CV error
against λ (see Figure 4). Each line corresponds to one value of the gamma hyperparameter.
In the exposition above we have focused on the family = "gaussian" case. Relaxed fits
are also available for the rest of the other model families, i.e., any other family argument.
Instead of fitting the OLS model of the response on the active set to obtain the relaxed fit,
glmnet fits the unpenalized model for that model family on the active set.
We note that while the relaxation can be applied for α values smaller than 1, we do not
recommend doing this. Relaxation is typically applied to obtain sparser models. It achieves
this by undoing shrinkage of coefficients in the active set toward zero, allowing the model to
have more freedom to fit the response. Together with CV on λ and γ, this often gives us a
model that is sparser than the lasso. Selecting α smaller than 1 results in a larger active set
than that for the lasso, working against the goal of obtaining a sparser model.

5.1. Application to forward stepwise regression


One use case for the relaxed fit is to provide a faster version of forward stepwise regression.
When the number of variables p is large, forward stepwise regression can be tedious since it
only adds one variable at a time and at each step, it needs to try all predictor variables that
are not already included in the model to find the best one to be added. On the other hand,
because the lasso solves a convex problem, it can identify good candidate sets of variables
over 100 values of the λ hyperparameter even when p is in the tens of thousands. In a case
Journal of Statistical Software 23

20 20 19 19 19 16 11 9 8 8 7 7 6 5 2 0

γ
8 1.00
Mean−Squared Error

0.75
6

0.50
4

0.25

0.00
2

−5 −4 −3 −2 −1 0

Log(λ)

Figure 4: Example output for plotting a cv.glmnet object when relax = TRUE: a plot of
CV error against log(λ). Each line corresponds to one value of the gamma hyperparameter.

like this, one can run cv.glmnet and fit the OLS model for a sequence of selected variable
sets.

R> set.seed(1)
R> cfitr <- cv.glmnet(x, y, gamma = 0, relax = TRUE)
R> cfitr

Call: cv.glmnet(x = x, y = y, gamma = 0, relax = TRUE)

Measure: Mean-Squared Error

Gamma Index Lambda Index Measure SE Nonzero


min 0 1 0.3354 18 1.007 0.1269 7
1se 0 1 0.4433 15 1.112 0.1298 7

6. Assessing models
After fitting elastic net models with glmnet, we often want to assess their performance on a
set of evaluation or test data. After deciding on the performance measure, for each model
in the fitted sequence (indexed by the value of λ and possibly γ for relaxed fits) we have to
build a matrix of predictions and compute the performance measure for it.
cv.glmnet does some of this evaluation automatically. In performing CV, cv.glmnet com-
putes the pre-validated fits (Tibshirani and Efron 2002), that is the model’s predictions of
the linear predictor on the held-out fold, and then computes the performance measure with
24 Elastic Net Regularization Paths for All GLMs

these pre-validated fits. The performance measures are recorded in the cvm element of the
returned cv.glmnet and are used to make the CV plot when the plot method is called.
glmnet supports a variety of performance measures depending on the model family: the full
list of measures can be seen via the call glmnet.measures(). We demonstrate some of this
functionality using the spam dataset from the UCI Machine Learning Repository (Dua and
Graff 2019):

R> data_url <- paste0("https://archive.ics.uci.edu/ml/",


+ "machine-learning-databases/spambase/spambase.data")
R> df <- read.csv(data_url, header = FALSE)
R> x <- as.matrix(df[, 1:57])
R> y <- df[, 58]

The user can change the performance measure computed in CV by specifying the type.measure
argument. For example, the code below computes the area under the curve (AUC) of the
pre-validated fits instead of the deviance which is the default for family = "binomial":

R> set.seed(1)
R> cv.glmnet(x, y, family = "binomial", type.measure = "auc")

Call: cv.glmnet(x = x, y = y, type.measure = "auc", family = "binomial")

Measure: AUC

Lambda Index Measure SE Nonzero


min 0.0000475 90 0.9726 0.002275 56
1se 0.0028463 46 0.9704 0.002531 52

More generally, model assessment can be performed using the assess.glmnet function. The
user can pass a class ‘glmnet’ object, a class ‘cv.glmnet’ object, or a matrix of predictions
to assess.glmnet along with the true response values. The code below shows how one
can use assess.glmnet with a class ‘glmnet’ object. We train the model on the design
matrix and response is x[itrain, ] and y[itrain], and evaluate the model’s performance
on x[-itrain, ] and y[-itrain].

R> set.seed(1)
R> itrain <- sample(nrow(x), size = nrow(x) / 2)
R> fit <- glmnet(x[itrain, ], y[itrain], family = "binomial")
R> err <- assess.glmnet(fit, newx = x[-itrain, ], newy = y[-itrain])
R> names(err)

[1] "deviance" "class" "auc" "mse" "mae"

err$deviance

s0 s1 s90 s91
1.3343171 1.3084006 ... 0.4625480 0.4625240
attr(,"measure")
[1] "Binomial Deviance"
Journal of Statistical Software 25

By default assess.glmnet will return all possible performance measures for the model family.
If a class ‘glmnet’ object is passed to assess.glmnet, it returns one performance measure
value for each model in the λ sequence. The user can get the performance measure values
at other values of the λ and γ hyperparameters using the s and gamma arguments as in the
predict method:

R> assess.glmnet(fit, newx = x[-itrain, ], newy = y[-itrain], s = 0.1)$auc

[1] 0.884751
attr(,"measure")
[1] "AUC"

If a class ‘cv.glmnet’ object is passed to assess.glmnet, it returns the performance mea-


sure value at the lambda.1se value of the λ hyperparameter. (Again, the user can get the
performance measure values at other values of the hyperparameters using the s and gamma
arguments.)

R> set.seed(1)
R> cfit <- cv.glmnet(x[itrain, ], y[itrain], family = "binomial")
R> cerr <- assess.glmnet(cfit, newx = x[-itrain, ], newy = y[-itrain])
R> cerr$auc

[1] 0.9700061
attr(,"measure")
[1] "AUC"

Finally, the code below shows how to use assess.glmnet with a matrix of predictions. Note
that if a matrix of predictions is passed, the user has to specify the model family via family
argument since assess.glmnet cannot infer that from the inputs. (If no family argument is
passed, the default value is family = "gaussian".)

R> pred <- predict(fit, newx = x[-itrain, ])


R> err <- assess.glmnet(pred, newy = y[-itrain], family = "binomial")
R> err$auc

[1] 0.5000000 0.7789992 0.7789992 ...


[91] 0.9719540 0.9719580
attr(,"measure")
[1] "AUC"

One major use of assess.glmnet is to avoid running CV multiple times to get the values for
different performance measures. By default, cv.glmnet will only return a single performance
measure. However, if the user specifies keep = TRUE in the cv.glmnet call, the pre-validated
fits are returned as well. The user can then pass the pre-validated matrix to assess.glmnet.
The code below is an example of how to do this. (The keep argument is FALSE by default
as the pre-validated matrix is large when the number of training observations is large, thus
inflating the size of the returned object.)
26 Elastic Net Regularization Paths for All GLMs

ROC Curve on Test Data

1.0
0.8
0.6
TPR

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

FPR

Figure 5: An example of a receiver operating characteristic (ROC) curve that can be plotted
with the output of roc.glmnet.

R> set.seed(1)
R> cfit <- cv.glmnet(x, y, family = "binomial", keep = TRUE)
R> err <- assess.glmnet(cfit$fit.preval, newy = y, family = "binomial")
R> err$mae

s0 s1 s2 s88 s89 s90


0.9548564 0.9431652 0.9274209 ... 0.2415963 0.2415069 0.2414905
attr(,"measure")
[1] "Mean Absolute Error"

We have two additional functions that provide test performance which are unique to bino-
mial data. As the function names suggest, roc.glmnet and confusion.glmnet produce the
receiver operating characteristic (ROC) curve and the confusion matrix respectively for the
test data. Here is an example of how to use roc.glmnet to produce an ROC curve for test
data (output in Figure 5):

R> set.seed(1)
R> cfit <- cv.glmnet(x[itrain, ], y[itrain], family = "binomial")
R> roc_val <- roc.glmnet(cfit, newx = x[-itrain, ], newy = y[-itrain],
+ s = "lambda.min")
R> plot(roc_val, main = "ROC Curve on Test Data")

Here is an example of the output the user gets from confusion.glmnet:

R> data("MultinomialExample", package = "glmnet")


R> x <- MultinomialExample$x
R> y <- MultinomialExample$y
R> set.seed(101)
R> itrain <- sample(1:500, 400, replace = FALSE)
Journal of Statistical Software 27

R> cfit <- cv.glmnet(x[itrain, ], y[itrain], family = "multinomial")


R> cnf <- confusion.glmnet(cfit, newx = x[-itrain, ], newy = y[-itrain])
R> print(cnf)

True
Predicted 1 2 3 Total
1 13 6 4 23
2 7 25 5 37
3 4 3 33 40
Total 24 34 42 100

Percent Correct: 0.71

7. Discussion
We have shown how to extend the use of the elastic net penalty to all GLM model families,
Cox models with (start, stop] data and with strata, and to a simplified version of the relaxed
lasso. We have also discussed how users can use the glmnet package to assess the fit of
these elastic net models. These new capabilities are available in version 4.1 and later of the
glmnet package (Friedman et al. 2023) on the Comprehensive R Archive Network (CRAN) at
https://CRAN.R-project.org/package=glmnet.

Acknowledgments
We would like to thank Robert Tibshirani for helpful discussions and comments. Balasub-
ramanian Narasimhan’s work is funded by Stanford Clinical & Translational Science Award
grant 5UL1TR003142-02 from the NIH National Center for Advancing Translational Sciences
(NCATS). Trevor Hastie was partially supported by grants DMS-2013736 And IIS 1837931
from the National Science Foundation, and grant 5R01 EB 001988-21 from the National
Institutes of Health.

References

Andersen PK, Gill RD (1982). “Cox’s Regression Model for Counting Processes: A Large
Sample Study.” The Annals of Statistics, 10(4), 1100–1120. doi:10.1214/aos/1176345976.
Bates S, Hastie T, Tibshirani R (2021). “Cross-Validation: What Does It Estimate and
How Well Does It Do It?” Technical Report 2104.00673, arXiv.org E-Print Archive. doi:
10.48550/arxiv.2104.00673.
Bezanson J, Edelman A, Karpinski S, Shah VB (2017). “Julia: A Fresh Approach to Numerical
Computing.” SIAM Review, 59(1), 65–98. doi:10.1137/141000671.
Breheny P, Huang J (2011). “Coordinate Descent Algorithms for Nonconvex Penalized Regres-
sion, with Applications to Biological Feature Selection.” The Annals of Applied Statistics,
5(1), 232–253. doi:10.1214/10-aoas388.
28 Elastic Net Regularization Paths for All GLMs

Breslow NE (1972). “Contribution to ‘Discussion on Professor Cox’s Paper’.” Journal of the


Royal Statistical Society B, 34(2), 216–217. doi:10.1111/j.2517-6161.1972.tb00900.x.

Cox DR (1972). “Regression Models and Life-Tables.” Journal of the Royal Statistical Soci-
ety B, 34(2), 187–202. doi:10.1111/j.2517-6161.1972.tb00899.x.

Dua D, Graff C (2019). “UCI Machine Learning Repository.” URL https://archive.ics.


uci.edu/ml.

Fan J, Li R (2001). “Variable Selection via Nonconcave Penalized Likelihood and Its Oracle
Properties.” Journal of the American Statistical Association, 96(456), 1348–1360. doi:
10.1198/016214501753382273.

Friedman J, Hastie T, Tibshirani R (2010). “Regularization Paths for Generalized Linear


Models via Coordinate Descent.” Journal of Statistical Software, 33(1), 1–24. doi:10.
18637/jss.v033.i01.

Friedman J, Hastie T, Tibshirani R, Narasimhan B, Tay JK, Simon N, Yang J (2023). glmnet:
Lasso and Elastic-Net Regularized Generalized Linear Models. R package version 4.1-7, URL
https://CRAN.R-project.org/package=glmnet.

Goeman JJ, Meijer RJ, Chaturvedi N (2022). penalized: L1 (Lasso and Fused Lasso) and L2
(Ridge) Penalized Estimation in GLMs and in the Cox Model. R package version 0.9-52,
URL https://CRAN.R-project.org/package=penalized.

Gorst-Rasmussen A, Scheike TH (2012). “Coordinate Descent Methods for the Penalized


Semiparametric Additive Hazards Model.” Journal of Statistical Software, 47(9), 1–17.
doi:10.18637/jss.v047.i09.

Hastie T, Efron B (2022). lars: Least Angle Regression, Lasso and Forward Stagewise. R pack-
age version 1.3, URL https://CRAN.R-project.org/package=lars.

Hastie T, Tibshirani R, Tibshirani R (2020). “Best Subset, Forward Stepwise or Lasso?


Analysis and Recommendations Based on Extensive Comparisons.” Statistical Science,
35(4), 579–592. doi:10.1214/19-sts733.

Hoerl AE, Kennard RW (1970). “Ridge Regression: Biased Estimation for Nonorthogonal
Problems.” Technometrics, 12(1), 55–67. doi:10.1080/00401706.1970.10488634.

JuliaAI (2023). MLJLinearModels.jl. Julia package version 0.9.1, URL https://github.


com/JuliaAI/MLJLinearModels.jl.

JuliaStats (2022). Lasso.jl. Julia package version 0.7.0, URL https://github.com/


JuliaStats/Lasso.jl.

Lokhorst J, Venables B, Turlach B, Maechler M (2021). lasso2: L1 Constrained Estimation


aka ‘Lasso’. R package version 1.2-22, URL https://CRAN.R-Project.org/src/contrib/
Archive/lasso2/.

Massias M, Gramfort A, Salmon J (2018). “Celer: A Fast Solver for the Lasso with Dual
Extrapolation.” In Proceedings of the 35th International Conference on Machine Learning,
volume 80, pp. 3321–3330.
Journal of Statistical Software 29

McCullagh P, Nelder JA (1983). Generalized Linear Models. Chapman and Hall, London.

Meinshausen N (2007). “Relaxed Lasso.” Computational Statistics & Data Analysis, 52(1),
374–393. doi:10.1016/j.csda.2006.12.019.

Meinshausen N (2012). relaxo: Relaxed Lasso. R package version 0.1-2, URL https://CRAN.
R-Project.org/src/contrib/Archive/relaxo/.

Nelder JA, Wedderburn RWM (1972). “Generalized Linear Models.” Journal of the Royal
Statistical Society A, 135(3), 370–384. doi:10.2307/2344614.

Obozinski G, Taskar B, Jordan MI (2010). “Joint Covariate Selection and Joint Subspace
Selection for Multiple Classification Problems.” Statistics and Computing, 20(2), 231–252.
doi:10.1007/s11222-008-9111-x.

Park MY, Hastie T (2018). glmpath: L1 Regularization Path for Generalized Linear Mod-
els and Cox Proportional Hazards Model. R package version 0.98, URL https://CRAN.
R-project.org/package=glmpath.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Pret-


tenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M,
Perrot M, Duchesnay E (2011). “scikit-learn: Machine Learning in Python.” Journal of
Machine Learning Research, 12(85), 2825–2830. URL https://jmlr.org/papers/v12/
pedregosa11a.html.

Pölsterl S (2020). “scikit-survival: A Library for Time-to-Event Analysis Built on Top of


scikit-learn.” Journal of Machine Learning Research, 21(212), 1–6. URL https://jmlr.
org/papers/v21/20-729.html.

Prados J (2019). bmrm: Bundle Methods for Regularized Risk Minimization Package. R pack-
age version 4.1, URL https://CRAN.R-project.org/package=bmrm.

R Core Team (2022). R: A Language and Environment for Statistical Computing. R Founda-
tion for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

Simon N, Friedman J, Hastie T, Tibshirani R (2011). “Regularization Paths for Cox’s Pro-
portional Hazards Model via Coordinate Descent.” Journal of Statistical Software, 39(5),
1–13. doi:10.18637/jss.v039.i05.

Therneau TM (2023). A Package for Survival Analysis in R. R package version 3.5-3, URL
https://CRAN.R-project.org/package=survival.

Therneau TM, Grambsch PM (2000). Modeling Survival Data: Extending the Cox Model.
Springer-Verlag. doi:10.1007/978-1-4757-3294-8.

Tibshirani R (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal
Statistical Society B, 58(1), 267–288. doi:10.1111/j.2517-6161.1996.tb02080.x.

Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ (2012). “Strong


Rules for Discarding Predictors in Lasso-Type Problems.” Journal of the Royal Statistical
Society B, 74(2), 245–266. doi:10.1111/j.1467-9868.2011.01004.x.
30 Elastic Net Regularization Paths for All GLMs

Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005). “Sparsity and Smoothness


via the Fused Lasso.” Journal of the Royal Statistical Society B, 67(1), 91–108. doi:
10.1111/j.1467-9868.2005.00490.x.

Tibshirani RJ, Efron B (2002). “Pre-Validation and Inference in Microarrays.” Statistical


Applications in Genetics and Molecular Biology, 1(1), 1–19. doi:10.2202/1544-6115.
1000.

Van der Kooij AJ (2007). Prediction Accuracy and Stability of Regression with Optimal Scaling
Transformations. Ph.D. thesis, Leiden University.

Van Rossum G, et al. (2011). Python Programming Language. URL https://www.python.


org/.

Vial G, Estermann F (2020). Relaxed Lasso. URL https://github.com/continental/


RelaxedLasso.

Yuan M, Lin Y (2006). “Model Selection and Estimation in Regression with Grouped
Variables.” Journal of the Royal Statistical Society B, 68(1), 49–67. doi:10.1111/j.
1467-9868.2005.00532.x.

Zeng Y, Breheny P (2021). “The biglasso Package: A Memory- and Computation-Efficient


Solver for Lasso Model Fitting with Big Data in R.” The R Journal, 12(2), 6–19. doi:
10.32614/rj-2021-001.

Zhang CH (2010). “Nearly Unbiased Variable Selection under Minimax Concave Penalty.”
The Annals of Statistics, 38(2), 894–942. doi:10.1214/09-aos729.

Zou H, Hastie T (2005). “Regularization and Variable Selection via the Elastic Net.” Journal of
the Royal Statistical Society B, 67(2), 301–320. doi:10.1111/j.1467-9868.2005.00503.
x.

Zou H, Hastie T (2020). elasticnet: Elastic-Net for Sparse Estimation and Sparse PCA.
R package version 1.3, URL https://CRAN.R-project.org/package=elasticnet.

Affiliation:
J. Kenneth Tay
Department of Statistics
Stanford University
390 Jane Stanford Way
Stanford, California 94305, United States of America
E-mail: kjytay@stanford.edu
URL: https://kjytay.github.io/
Journal of Statistical Software 31

Balasubramanian Narasimhan, Trevor Hastie


Department of Biomedical Data Sciences
and
Department of Statistics
Stanford University
390 Jane Stanford Way
Stanford, California 94305, United States of America
E-mail: naras@stanford.edu, hastie@stanford.edu
URL: https://web.stanford.edu/~naras/
https://web.stanford.edu/~hastie/

Journal of Statistical Software https://www.jstatsoft.org/


published by the Foundation for Open Access Statistics https://www.foastat.org/
March 2023, Volume 106, Issue 1 Submitted: 2021-03-04
doi:10.18637/jss.v106.i01 Accepted: 2023-02-13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy