0% found this document useful (0 votes)
18 views20 pages

On The Selection Stability of Stability Selection

Uploaded by

ahrenwadia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views20 pages

On The Selection Stability of Stability Selection

Uploaded by

ahrenwadia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

On the Selection Stability of Stability Selection and Its

Applications
Mahdi Nouraie 1 and Samuel Muller ∗1,2

1 School of Mathematical and Physical Sciences, Macquarie University


2 School of Mathematics and Statistics, The University of Sydney
arXiv:2411.09097v1 [stat.ME] 14 Nov 2024

Abstract
Stability selection is a widely adopted resampling-based framework for high-dimensional
structure estimation and variable selection. However, the concept of ‘stability’ is often nar-
rowly addressed, primarily through examining selection frequencies, or ‘stability paths’.
This paper seeks to broaden the use of an established stability estimator to evaluate the
overall stability of the stability selection framework, moving beyond single-variable anal-
ysis. We suggest that the stability estimator offers two advantages: it can serve as a
reference to reflect the robustness of the outcomes obtained and help identify an optimal
regularization value to improve stability. By determining this value, we aim to calibrate
key stability selection parameters, namely, the decision threshold and the expected num-
ber of falsely selected variables, within established theoretical bounds. Furthermore, we
explore a novel selection criterion based on this regularization value. With the asymp-
totic distribution of the stability estimator previously established, convergence to true
stability is ensured, allowing us to observe stability trends over successive sub-samples.
This approach sheds light on the required number of sub-samples addressing a notable
gap in prior studies. The stabplot package is developed to facilitate the use of the plots
featured in this manuscript, supporting their integration into further statistical analysis
and research workflows.

Keywords: Bioinformatics, Feature Selection, Model Selection, Structure Estimation, Variable


Selection

1 Introduction
The estimation of discrete structures, including graphs, clusters, or selection of variables, rep-
resents a long-standing and fundamental problem in statistics (Meinshausen and Bühlmann,
2010). Variable selection is a critical phase in the modeling pipeline and a fundamental step in
data preprocessing, where raw data is refined to ensure its suitability for analysis or modeling.
The objective is to eliminate variables that are noisy, redundant, or irrelevant (Tibshirani, 1996;
Müller and Welsh, 2010), addressing a key challenge in contemporary data science (Nogueira
et al., 2018).

Address for correspondence: samuel.muller@mq.edu.au

1
Stability selection, a widely recognized resampling-based variable selection framework, per-
forms variable selection by examining how frequently variables are chosen across multiple
randomly selected resamples (Meinshausen and Bühlmann, 2010). The selection frequency
indicates how often a variable is included in the estimated models over various resamples.
Meinshausen and Bühlmann (2010) analyzed the selection frequencies for individual variables
over a range of regularization values, presenting the results in a plot known as the ‘stability
path’. This plot facilitates the identification of variables that are consistently selected over the
regularization grid.
In this paper, we shift the focus from examining the stability of individual variables across
a grid of regularization values to evaluating the overall stability of the entire stability selection
framework across the regularization grid. Employing an established stability estimator, we
assess the stability of the framework in its entirety and introduce what we call ‘Stable Stability
Selection’, which identifies the smallest regularization value that yields highly stable outcomes.
We illustrate the stable stability selection methodology through variable selection in the
context of linear regression. However, the applicability of this approach extends beyond linear
regression, as stability selection is not confined to this specific problem. Although detailed
exploration is beyond the scope of this paper, stable stability selection is equally suitable for
more complex regression-type models.
The stability of a variable selection method refers to the consistency with which it selects
variables across different training sets drawn from the same underlying distribution (Kalousis
et al., 2007). Evaluation of the stability of a statistical model by random resampling has been
a well-established practice for decades. For example, Altman and Andersen (1989) applied this
approach in Cox’s proportional hazards regression model (Cox, 1972) to identify stable vari-
ables, which were defined as those consistently selected with higher frequencies across multiple
resamples.
In the seminal work of Meinshausen and Bühlmann (2010), the stability selection framework
was introduced, which offers a transformative method to identify stable variables in structure
estimation tasks. The stability selection approach described by Meinshausen and Bühlmann
(2010) uses sub-samples that are half the size of the sample size of the original dataset to
carry out variable selection. Stability selection enables the identification of variables that are
consistently recognized as important across the majority of sub-samples given a regularization
value, considering them as stable variables. Furthermore, Meinshausen and Bühlmann (2010)
established an asymptotic upper-bound for the Per-Family Error Rate, which represents the
expected number of falsely selected variables. This bound relies on the assumptions of ex-
changeability in the distribution of falsely selected variables and that the selection algorithm
performs no worse than random guessing.
Shah and Samworth (2013) introduced the complementary pairs stability selection method,
which computes selection frequencies by counting how often a variable is included in models
fitted to complementary 50% sub-samples. Shah and Samworth (2013) introduced more rigorous
upper-bounds for the expected number of variables selected with low selection probabilities.
It is somewhat surprising that, despite the framework being designated as ‘stability’ se-
lection and the considerable attention stability selection has received since its inception, the
overall stability of the framework itself has been largely overlooked, with no dedicated studies

2
specifically addressing this critical aspect.
There is a substantial body of research on the stability of variable selection methods, where
numerous stability measures have been proposed. For more comprehensive information, we
refer to Kuncheva (2007), Nogueira et al. (2018), and Sen et al. (2021).
Nogueira et al. (2018) conducted an extensive literature review on the topic, consolidating
the desirable properties of stability measures into five mathematical conditions: fully defined,
strict monotonicity, constant bounds, maximum stability occurs if and only if the selection is
deterministic and correction for chance. A brief overview of these properties will be provided
in the following paragraph. Nogueira et al. (2018) showed that none of the stability measures
previously introduced satisfied all five conditions. In response, Nogueira et al. (2018) pro-
posed a novel stability measure, which was the first in the literature to meet all these criteria,
generalizing several previous works such as in Kuncheva (2007).
Here, we give a concise review of the five conditions proposed by Nogueira et al. (2018).
The first property stipulates that the stability measure should function effectively regardless of
variations in the number of selected variables, meaning that it should not be limited to scenarios
where the number of selected variables remains constant across resamples. The second property
asserts that the stability measure should be a strictly decreasing function of the sample variance
of the binary selection status of each variable across random resamples; as the variance in the
selection status of each variable increases, the stability of the framework diminishes. The third
property states that the stability measure should be bounded by specific constants, which
facilitates the comparison of the stability of different procedures. The fourth property implies
that maximum stability is achieved when the selection results across all resamples are identical,
and vice versa. Finally, the fifth property posits that if all possible selection results are equally
likely to occur across random resamples, the expected value of the stability measure should
remain constant. This indicates that the stability measure should not be influenced by random
patterns that may arise during the resampling process.
Nogueira et al. (2018) conceptualized the stability of a variable selection procedure as an
underlying population parameter that requires estimation. In addition, Nogueira et al. (2018),
linking their work with Gwet (2008), demonstrated that as the number of resamples approaches
infinity, their stability measure, Φ̂, converges to a Normal distribution with a location parameter
𝜇 = Φ, where Φ denotes population stability. Consequently, Φ̂ is a consistent estimator of the
population stability Φ; that is, as the number of resamples approaches infinity, Φ̂ converges in
probability to the true population stability Φ. This property allows us to construct confidence
intervals and perform hypothesis testing on the stability values.
Nogueira et al. (2018) compared the stability of the stability selection with that of the
Least Absolute Shrinkage and Selection Operator (LASSO; Tibshirani, 1996), concluding that
the former exhibits greater stability than the latter in most of the scenarios considered.
In this paper, our aim is to apply the stability estimator proposed by Nogueira et al.
(2018) to evaluate the stability of stability selection. Furthermore, we proxify the stability
values to find the optimal regularization value to achieve high stability with the least possible
loss in accuracy, if applicable. The optimal regularization value can be employed to calibrate
two critical parameters of stability selection: the decision-making threshold and the expected
number of falsely selected variables, balancing them with respect to one another. We also

3
use the optimal regularization value to define a new selection criterion, called stable stability
selection, prioritizing the selection of variables associated with highly stable outcomes, rather
than those mostly selected in their best-case scenarios over the regularization grid, which is the
primary criterion in the traditional stability selection framework.
In addition, the convergence value of the stability estimator serves as a reference to assess
the reliability of the results obtained. If the stability selection process demonstrates instability,
it is unreasonable to expect that its selections and preferences will be reliable. The point at
which convergence occurs provides valuable insight into the required number of sub-samples,
an aspect for which we found no dedicated research addressing its determination.
Nogueira et al. (2018) suggested that stability and accuracy can be interpreted through
the lens of a Pareto front Pareto (1896). We refer to this methodology as stability-accuracy
selection. Building on this perspective, we demonstrate that, given two reasonable assumptions,
our proposed regularization value constitutes a Pareto-optimal solution within this context.
Alternative stability estimators, aside from that proposed by Nogueira et al. (2018), may
also be applicable; however, a detailed comparison and evaluation of different stability measures
in the context of stability selection falls outside the scope of this paper.
The structure of this paper is organized as follows. Section 2 outlines the proposed method-
ology. Section 3 describes the real and synthetic datasets used in the paper. The results of
the application of the method to real and synthetic data are presented in Section 4. Finally,
Section 5 summarizes and concludes the paper.

2 Methodology
In this section, we present a methodology to evaluate the selection stability of the stability
selection framework and to determine the optimal regularization value accordingly. Although
we demonstrate the proposed method within the context of stability selection outlined by
Meinshausen and Bühlmann (2010), it is important to note that the methodology is adaptable
to other formulations of stability selection, such as those described by Shah and Samworth
(2013) and Beinrucker et al. (2016). We will introduce our methodology following a brief
overview of the foundational approach established by Meinshausen and Bühlmann (2010).

Stability selection

We consider a dataset D = {(𝒙 𝑖⊤ , 𝑦𝑖 )}𝑖=


𝑛 , where each element consists of a univariate response
1
𝑦𝑖 ∈ R and a 𝑝-dimensional vector of fixed covariates 𝒙 𝑖⊤ ∈ R 𝑝 . In the context of linear regression,
it is typically assumed that pairs (𝒙 𝑖⊤ , 𝑦𝑖 ) are independent and identically distributed (i.i.d.);
when these are assumed to be random, but for simplicity, we assume that the covariate vector
is fixed.
The linear regression model is formally expressed as 𝒀 = 𝛽0 + 𝑋 𝜷 + 𝜺, where 𝑋 ∈ mat(𝑛 ×
𝑝) represents the design matrix, 𝒀 ∈ R𝑛 is the 𝑛-dimensional vector corresponding to the
univariate response variable, and 𝛽0 ∈ R denotes the intercept term. The vector of regression
coefficients for the 𝑝 non-constant covariates is denoted by 𝜷 ∈ R 𝑝 , while 𝜺 ∈ R𝑛 represents the
𝑛-dimensional vector of random errors. It is assumed that the components of 𝜺 are i.i.d. and

4
independent of the covariates.
Variable selection, in accordance with the terminology of Meinshausen and Bühlmann
(2010), typically involves categorizing covariates into two distinct groups: the signal group
𝑆 = {𝑘 ≠ 0 | 𝛽 𝑘 ≠ 0} and the noise group 𝑁 = {𝑘 ≠ 0 | 𝛽 𝑘 = 0}, where 𝑆 ∩ 𝑁 = ∅. The primary
objective of variable selection is to accurately identify the signal group 𝑆.
The stability selection framework requires the implementation of an appropriate selection
method. One commonly used technique for variable selection in the linear regression context
is the LASSO estimator, formally defined as
𝑝
!
∑︁
ˆ
𝛽ˆ0 (𝜆), 𝜷(𝜆) = arg min ∥𝒀 − 𝛽0 − 𝑋 𝜷∥ 22 + 𝜆 |𝛽 𝑘 | ,
𝛽0 ∈R,𝜷∈R 𝑝 𝑘=1

where 𝜆 ∈ R+ represents the LASSO regularization parameter. The set of non-zero coefficients
ˆ
can be identified as 𝑆(𝜆) = {𝑘 ≠ 0 | 𝛽ˆ𝑘 (𝜆) ≠ 0}, derived from solving the LASSO equation
through convex optimization.
Although alternative methods exist for variable selection in linear regression models, a
detailed exploration of techniques for determining the regularization method and penalty term
in stability selection lies beyond the scope of this paper.
Meinshausen and Bühlmann (2010) defined the stable set as

𝑆ˆstable = { 𝑗 | max ( Π̂ 𝜆𝑗 ) ≥ 𝜋thr }; 𝑗 = 1, . . . , 𝑝, (1)


𝜆∈Λ

where Λ represents the set of regularization values, 0 < 𝜋thr < 1 is the threshold for decision-
making in variable selection, and Π̂ 𝜆𝑗 denotes the selection frequency of the 𝑗th variable given
the regularization parameter 𝜆.
Equation 1 identifies variables whose selection frequencies exceed 𝜋thr under the regulariza-
tion value that maximizes their selection frequencies; therefore, it considers best-case of each
variable for decision-making.

Stable stability selection

We now move forward to stable stability selection. To implement this, we first define a grid
of regularization values, Λ, from which the optimal 𝜆 for LASSO is to be identified. For this
purpose, we use as the default choice the 𝜆 values generated by the cv.glmnet function from the
glmnet package in R (Friedman et al., 2010), applying a 10-fold cross-validation to the complete
dataset D. Alternative regularization values can also be employed. However, a comparison of
different methods for generating the regularization grid falls outside the scope of this paper.
The goal is to identify the optimal regularization value 𝜆 stable ∈ Λ that yields highly stable
outcomes with the least possible loss in terms of predictive ability.
The following procedure is applied to each regularization value 𝜆 ∈ Λ. In each iteration of
the stability selection process, a random sub-sample, comprising half the size of the original
dataset D, is randomly selected. The LASSO model is then fitted to this sub-sample using the
fixed value of 𝜆, and the binary selection outcomes are recorded as a row in the binary selection
matrix 𝑀 (𝜆) ∈ mat(𝐵 × 𝑝), where 𝐵 denotes the number of sub-samples. Therefore, in the end,

5
we obtain |Λ| distinct selection matrices 𝑀 (𝜆).
In this context, 𝑀 (𝜆) 𝑏 𝑗 = 1 indicates that the 𝑗th variable is identified as part of the signal
ˆ
set 𝑆(𝜆) when LASSO is applied to the 𝑏th sub-sample given 𝜆. In contrast, 𝑀 (𝜆) 𝑏 𝑗 = 0 signifies
that the 𝑗th variable is classified as belonging to the noise set 𝑁ˆ (𝜆) after applying LASSO to
the 𝑏th sub-sample given 𝜆.
The stability estimator proposed by Nogueira et al. (2018) is defined as

1
𝑠2𝑗
Í𝑝
𝑝 𝑗=1
Φ̂(𝑀 (𝜆)) = 1 −  ; 𝜆 ∈ Λ, (2)
𝑞(𝜆) 𝑞(𝜆)
𝑝 1− 𝑝

where 𝑠2𝑗 denotes the unbiased sample variance of the binary selection statuses of the 𝑗th
variable, while 𝑞(𝜆) denotes the average number of variables selected under the regularization
parameter 𝜆. The stability estimator Φ̂ is bounded by − 𝐵−1 1 , 1 (Nogueira et al., 2018); the
 

larger the Φ̂, the more stable is 𝑀 (𝜆).


To interpret the stability values of Equation (2), Nogueira et al. (2018) adopted the guide-
lines proposed by Fleiss et al. (2004). According to Nogueira et al. (2018), stability values that
exceed 0.75 indicate excellent agreement between selections beyond what would be expected
by random chance, while values below 0.4 signify poor agreement between them. Based on
Nogueira et al. (2018), stability values that reside between these thresholds are categorized as
indicative of intermediate to good stability.
By obtaining all |Λ| selection matrices, we can estimate the stability of each using the
formulation given in Equation (2). We propose that the optimal regularization value, denoted
as 𝜆 stable , is the smallest regularization value at which the stability measure surpasses 0.75,
that is,

𝜆 stable = min 𝜆 ∈ Λ | Φ̂(𝑀 (𝜆)) ≥ 0.75 . (3)

The threshold value of 0.75 is somewhat arbitrary and may not be attainable in certain situa-
tions. We address this limitation in the following paragraph.
As we will see in Section 4, 𝜆 stable may not exist in certain practical applications due to
procedural instability, which prevents the stability values from exceeding 0.75. In such cases,
we propose
  
𝜆 stable-1sd = min 𝜆 ∈ Λ | Φ̂(𝑀 (𝜆)) ≥ max Φ̂(𝑀 (𝜆)) − SD𝜆∈Λ Φ̂(𝑀 (𝜆)) , (4)
𝜆∈Λ

where SD represents the standard deviation of the stability values.


Both upper-bounds of the Per-Family Error Rate, mentioned in Section 1, provided by
Meinshausen and Bühlmann (2010) and Shah and Samworth (2013), depend on the values of
𝜆 and 𝜋thr . The calibration of these two values requires the arbitrary selection of one of these
two parameters, which can be challenging to justify. By adopting the optimal regularization
value, that is 𝜆 = 𝜆 stable or 𝜆 = 𝜆 stable-1sd , the upper-bound can be explicitly tailored to the
data owner’s preferences. If a pre-determined 𝜋thr is chosen, the Per-Family Error Rate is
deterministically obtained; alternatively, if a fixed pre-determined Per-Family Error Rate value
is desired, the corresponding 𝜋thr is enforced.

6
The asymptotic upper-bound provided by Meinshausen and Bühlmann (2010) for the Per-
Family Error Rate is given by

1 𝑞 2Λ
PFER(Λ, 𝜋thr ) = , (5)
2𝜋thr − 1 𝑝

where 𝑞 Λ represents the average number of selected variables over the regularization grid Λ.
In a manner consistent with the approach of Bodinier et al. (2023), we can derive a point-wise
control version of Equation 5 by restricting Λ to a single value, 𝜆, without affecting the validity
of the original equation. Consequently, we can replace 𝑞 Λ with 𝑞(𝜆), as introduced in Equation
2. By employing 𝜆 stable within this formula and utilizing 𝑞(𝜆 stable ), we establish a two-way
control: fixing 𝜋thr allows us to determine the upper-bound, and vice versa. A similar approach
can be applied to the upper-bounds proposed by Shah and Samworth (2013).
Having 𝜆 stable or 𝜆 stable-1sd allows a fresh perspective on variable selection through stabil-
ity selection. Meinshausen and Bühlmann (2010) introduces Equation (1) to identify stable
variables, that is, those with selection frequencies that exceed 𝜋thr in the best-case scenario.
Instead, we propose n o
𝑆ˆstable = 𝑗 | Π̂ 𝜆𝑗 stable ≥ 𝜋thr ; 𝑗 = 1, . . . .𝑝. (6)

Equation (6) represents the variables with selection frequencies higher than 𝜋thr under the
𝜆stable , which corresponds to the highly stable outcomes. We term this selection criterion stable
stability selection because it leverages highly stable outcomes to facilitate variable selection. If
𝜆stable does not exist, 𝑆ˆstable-1sd can be defined similarly.
As outlined in Section 1, as the number of sub-samples increases, the convergence of Φ̂ to
the true stability Φ is ensured. To illustrate this, we define 𝑀 (𝑡) (𝜆) as the matrix containing
the selection outcomes from the first 𝑡 sub-samples for a given 𝜆 ∈ Λ, that is, the first 𝑡 rows of
𝑀 (𝜆). The objective is to evaluate the stability of 𝑀 (𝑡) (𝜆) across successive sub-samples in order
to determine the appropriate cut-off point for the process, that is, the number of sub-samples
required to achieve convergence in stability. Beyond this threshold, additional sub-samples
do not significantly change the stability of 𝑀 (𝜆). We propose plotting the stability values
of the selection matrix over the sequential sub-sampling, that is Φ̂(𝑀 (𝑡) (𝜆)); 𝑡 = 2, 3, . . . , 𝐵,
against 𝑡 to monitor the convergence status of the stability values. Given that the asymptotic
distribution of Φ̂ is established, a confidence bound can be drawn along the curve to reflect the
inherent uncertainty of the stability estimator.
In Section 4, we present the results obtained from the application of the proposed approach
to synthetic and real datasets, which are introduced in Section 3.

3 Datasets
We evaluate the proposed methodology by conducting evaluations on synthetic and two real
bioinformatics datasets, as detailed below.

7
Synthetic Data

As synthetic data, we consider a dataset with a sample size of 𝑛 = 50 that includes 𝑝 = 500
predictor variables. In this scenario, the predictor variables 𝒙 𝑖⊺ are independently drawn from
the Normal distribution N (0, Σ), where Σ ∈ mat( 𝑝, 𝑝) has a diagonal of ones and 𝜎 𝑗 𝑘 = 𝜌 | 𝑗−𝑘 |
for 𝑗 ≠ 𝑘 where 𝜌 ∈ {0.2, 0.5, 0.8}. The response variable is linearly dependent solely on the
first two predictor variables, characterized by the coefficient vector 𝜷 = (1.5, 1.1, 0, . . . , 0) ⊤ , and
the error term 𝜺 is an i.i.d. sample from the standard Normal distribution N (0, 1).

Riboflavin Data

For the first real example, we use the well-established ‘riboflavin’ dataset, which focuses on the
production of riboflavin (vitamin B2) from various Bacillus subtilis. This dataset, provided by
the Dutch State Mines Nutritional Products, is accessible via hdi R package (Dezeure et al.,
2015). It comprises a single continuous response variable that represents the logarithm of the
riboflavin production rate, alongside 𝑝 = 4, 088 covariates, which correspond to the logarithm
of expression levels for 4,088 bacterial genes. The primary objective of analyzing this dataset
is to identify genes that influence riboflavin production, with the ultimate goal of genetically
engineering bacteria to enhance riboflavin yield. Data were collected from 𝑛 = 71 relatively
homogeneous samples, which were repeatedly hybridized during a feed-batch fermentation pro-
cess involving different engineered strains and varying fermentation conditions. Bühlmann et al.
(2014) employed stability selection with LASSO and identified three genes—LYSC at, YOAB at,
and YXLD at—as having a significant impact on the response variable.

Affymetrix Rat Genome 230 2.0 Array

As an additional real-world example, we investigate ‘Affymetrix Rat Genome 230 2.0 Array’
microarray data introduced by Scheetz et al. (2006). This dataset comprises 𝑛 = 120 twelve-
week-old male rats, with expression levels recorded for nearly 32,000 gene probes for each rat.
The primary objective of this analysis is to identify the probes most strongly associated with the
expression level of the TRIM32 probe (1389163 at), which has been linked to the development
of Bardet-Biedl syndrome (Chiang et al., 2006). This genetically heterogeneous disorder affects
multiple organ systems, including the retina.
In accordance with the preprocessing steps outlined by Huang et al. (2008), we excluded
gene probes with a maximum expression level below the 25th percentile and those exhibiting
an expression range smaller than 2. This filtering process yielded a refined set of 𝑝 = 3, 083
gene probes that demonstrated sufficient expression and variability for further analysis.

4 Results
In this section, we illustrate the efficacy of the proposed methodology through its application
to the synthetic and real datasets introduced in Section 3.

8
Synthetic Data

Here, we present the results obtained from the application of stable stability selection to syn-
thetic data. We used the GitHub repository associated with Nogueira et al. (2018) to implement
their stability estimator and to obtain confidence intervals1 .
Figure 1 illustrates the selection stability of the stability selection on a grid of regularization
values, applied to the synthetic datasets introduced in Section 3, using three different values of
𝜌 and choosing the number of sub-samples 𝐵 = 500. The horizontal dashed red lines indicate
the stability thresholds of 0.4 and 0.75 as discussed in Section 2.
As illustrated in Figure 1, the regularization value that minimizes the cross-validation error,
𝜆min , falls within the poorly stable region. In addition, the regularization value that maintains
the cross-validation error within one standard error of 𝜆 min , referred to as 𝜆 1se (Hastie et al.,
2009), lies within the poor to intermediate stability regions. In contrast, 𝜆 stable , introduced
in Equation (3), is specifically designed to fall within the region of excellent stability, where
applicable.
Notably, in all three correlation scenarios, the three regularization selection methods suc-
cessfully identify relevant variables with high selection frequencies, with a minimum selection
frequency of 0.994. However, since 𝜆 stable is larger than the two other regularization values, it
applies a stronger shrinkage to variables. To further examine this in terms of model accuracy,
for each correlation scenario, we generate 𝑛′ = 25 additional test samples from the correspond-
ing distribution described in Section 3. At each iteration of stability selection, we predict the
response variable for the test samples using the estimated model and, ultimately, we aggregate
all the mean squared error (MSE) values obtained over the corresponding 𝜆 by averaging.
As demonstrated in Figure 2, across all three correlation scenarios, the improvement in
stability when transitioning from 𝜆 1se to 𝜆 stable far outweighs the reduction in accuracy. This
outcome aligns closely with the findings of Nogueira et al. (2018), who noted that “All these
observations show that stability can potentially be increased without loss of predictive power”
and “We also observe that pursuing stability may help identifying the relevant set of features”.
To demonstrate the convergence of the stability estimator Φ̂, for each correlation scenario,
we used 𝜆 stable and calculated the stability of the selection matrix 𝑀 (𝜆 stable ) through sequential
sub-sampling, using Equation (2). The stability values across iterations are shown in Figure
3. The blue shading around the line represents the 95% confidence interval for Φ̂. As can be
observed, the interval narrows with increasing iterations, indicating a reduction in the estima-
tor’s uncertainty. Figure 3 reveals that, across all three scenarios, after approximately 𝐵∗ ≈ 200
iterations, the stability estimator converges, with no significant change thereafter in its value.
By monitoring the stability trajectory throughout the process, valuable insight can be gained
in determining the optimal cut-off point, that is, the optimal number of sub-samples in terms
of stability of the process.

Riboflavin Data

Next, we apply stable stability selection to the riboflavin dataset. As described in Section 3,
the dataset consists of 𝑝 = 4, 088 genes, the main objective being to identify the key genes that
1
https://github.com/nogueirs/JMLR2018

9
(a) 𝜌 = 0.2

(b) 𝜌 = 0.5

(c) 𝜌 = 0.8

Figure 1: Selection stability of stability selection over the regularization grid on the synthetic
data

10
(a) 𝜌 = 0.2

(b) 𝜌 = 0.5

(c) 𝜌 = 0.8

Figure 2: Selection stability and MSE of stability selection over the regularization grid on the
synthetic data

11
(a) 𝜌 = 0.2

(b) 𝜌 = 0.5

(c) 𝜌 = 0.8

Figure 3: Selection stability of stability selection over sequential sub-sampling on the synthetic
data

12
influence riboflavin production. As before, we choose the number of sub-samples 𝐵 = 500. The
predictor variables are standardized using the scale function prior to being input into LASSO.
In this problem, 𝜆 stable does not exist, so we use 𝜆 stable-1sd instead. After 𝐵 iterations, four
𝜆 stable-1sd 𝜆stable-1sd 𝜆stable-1sd
genes YXLD at (Π̂YXLD at = 0.606), YOAB at (Π̂YOAB at = 0.558), LYSC at (Π̂LYSC at = 0.540),
𝜆 stable-1sd
and YCKE at (Π̂YCKE at = 0.532) have selection frequencies greater than 0.5.
Figure 4 illustrates the stability of stability selection when applied to the riboflavin dataset.
As shown in Figure 4a, 𝜆 stable does not exist in this example, as the stability values do not sur-
pass the 0.75 threshold. Figure 4b, generated with 𝜆 stable-1sd , indicates the convergence slightly
above 0.2 after about 200 iterations. These results imply that stability selection with LASSO
exhibits poor selection stability within this dataset. Consequently, the findings presented by
Bühlmann et al. (2014) may require careful reconsideration, particularly since the selection
frequencies of the three identified genes are relatively low.

Affymetrix Rat Genome 230 2.0 Array

Finally, we apply the proposed methodology to the rat microarray data. As mentioned in
Section 3, the data consists of 𝑝 = 3, 083 gene probes. The main aim for this data is to
identify probes that influence the TRIM32 probe. Again, we choose the number of sub-samples
𝐵 = 500. The predictor variables are standardized using the scale function prior to being
input into LASSO.
Due to the instability of the outcomes, 𝜆 stable does not exist in this problem. After 𝐵 itera-
𝜆stable-1sd
tions, three probes have selection frequencies greater than 0.5: probe 1390539 at (Π̂1390539 at =
𝜆 stable-1sd 𝜆stable-1sd
0.640), probe 1389457 at (Π̂1389457 at = 0.570), and probe 1376747 at (Π̂1376747 at = 0.564).
As shown in Figure 5, similar to the results of the riboflavin dataset, the procedure cannot
be considered stable. Specifically, Figure 5a reveals that 𝜆 stable does not exist, as the stability
values do not exceed the threshold of 0.75. For Figure 5b, we used 𝜆 stable-1sd . Figure 5b shows
that the stability values converged to approximately 0.15.
The results illustrated in figures 4 and 5 indicate that while stability selection is a valuable
approach, more can be revealed by focusing on the stability of stability selection. As the com-
plexity of the data (in terms of interdependencies between variables, number of variables subject
to selection, etc.) increases, the stability values demonstrate a significant decline. Therefore, it
is essential to evaluate the stability of its selections before drawing any conclusions about the
importance of variables based on the outcomes of stability selection. These findings emphasize
the need to carefully assess the stability of the entire process when using stability selection.
Neglecting this evaluation may result in overconfidence in the results obtained, even when they
lack stability. Given that such analyses are often underrepresented in the literature on stability
selection, domain experts may benefit from carefully considering previously published findings.
The convergence plots for both synthetic and real examples indicate that, regardless of
whether the convergence value is high or low, stability values tend to stabilize after 𝐵∗ ≈ 200
iterations. This observation could provide a useful rule of thumb for determining the required
number of sub-samples in the stability selection framework.

13
(a)

(b)

Figure 4: Selection stability of stability selection on riboflavin data

14
(a)

(b)

Figure 5: Selection stability of stability selection on rat microarray data

15
Stability and Accuracy from a Pareto Front Perspective

Nogueira et al. (2018) suggested that stability and accuracy can be analyzed using the concept of
the Pareto front (Pareto, 1896), which identifies regularization values that are not dominated
by any other in terms of both criteria. A regularization value is considered Pareto optimal
if no other value on the regularization grid offers both higher stability and higher accuracy.
We call this approach stability-accuracy selection, which seeks to balance these two metrics.
In this paper, we introduced the stable stability selection, which focuses on identifying the
regularization value that achieves a high stability of the procedure with the least possible loss
in accuracy.
As an experiment, in our synthetic data analysis we consider as a measure of accuracy (-
MSE), that is, the negative of the mean squared error. Given that a Pareto optimal solution
is not necessarily unique, we select the Pareto solution that maximizes the sum of accuracy
and stability. In Figure 6a, 𝜆 Pareto is located close to 𝜆 stable , while in Figures 6b and 6c, they
coincide exactly. Interestingly, 𝜆 stable is also a Pareto solution in Figure 6a. It is now pertinent
to discuss the relationship between 𝜆 stable and Pareto optimality.

Corollary 1. Let 𝜆 stable be defined as in Equation (3) and assume that it exists. If the stability
curve is non-decreasing up to 𝜆 stable , and the loss function is non-decreasing after 𝜆 stable , then
𝜆stable is a Pareto optimal solution.

Proof. Since the stability curve is non-decreasing prior to 𝜆 stable , any regularization value 𝜆 <
𝜆 stable ∈ Λ is at most as stable as 𝜆 stable . Therefore, these values do not dominate 𝜆 stable in
terms of stability.
Similarly, since the loss function is non-decreasing after 𝜆 stable , any regularization value
𝜆 > 𝜆stable ∈ Λ is at most as accurate as 𝜆stable . Hence, these values also do not dominate
𝜆stable in terms of accuracy.
Since 𝜆 stable is not dominated by any values less than it in terms of stability nor by any
values greater than it in terms of accuracy, we conclude that 𝜆 stable constitutes a Pareto optimal
solution.

The two assumptions in Corollary 1 are both natural and justifiable. Increasing the reg-
ularization value is expected to allow the model to achieve an optimal balance of sparsity by
shrinking irrelevant variables, thereby enhancing stability. The regularization value 𝜆 stable is
intended to signify the point where the model attains high stability with minimal loss of accu-
racy. Consequently, we anticipate that the stability curve exhibits a non-decreasing trend prior
to reaching 𝜆 stable . However, since excessive regularization can lead to underfitting, we expect
that the stability curve will flatten or decrease after reaching one or more peaks beyond 𝜆 stable .
Regarding the second assumption, it is expected that increasing regularization helps the
model eliminate irrelevant variables, improving its ability to predict the response variable. The
regularization values 𝜆 min and 𝜆 1se are designed to capture this behavior, marking the points
where regularization is most effective in terms of predictive ability. Beyond these points, we
anticipate that the loss curve will flatten or increase. Since 𝜆 stable is intended to trade a
small amount of accuracy for greater stability, we expect 𝜆 stable to occur after 𝜆 1se , that is,

16
(a) 𝜌 = 0.2

(b) 𝜌 = 0.5

(c) 𝜌 = 0.8

Figure 6: 𝜆min , 𝜆 1se , 𝜆 stable , and 𝜆 Pareto over the regularization grid, the latter two are identical
when 𝜌 is 0.5 or 0.8

17
𝜆 stable > 𝜆1se > 𝜆min , which implies that the loss curve is expected to be non-decreasing after
𝜆 stable .
Thus, according to Corollary 1, 𝜆 stable represents a stability-accuracy solution for the prob-
lem of regularization tuning, ensuring high stability while minimizing loss in accuracy.

5 Conclusion
In this paper, we examined the stability of the stability selection framework, which underpins
the reproducibility of the results and, consequently, our confidence in them. We further applied
this concept to determine the optimal regularization value in terms of stability, facilitating
the calibration of the decision-making threshold and the expected number of falsely selected
variables by leveraging upper-bounds established in the literature. In addition, we introduced
a novel selection criterion based on the optimal regularization value, which is called stable
stability selection, that ensures the selection of variables associated with highly stable results
with the least possible loss in terms of predictive ability. Under two justifiable assumptions, we
demonstrated that the proposed optimal regularization value is Pareto optimal in the Pareto
front of stability and accuracy. Lastly, we discussed the convergence of stability values across
sequential sub-sampling to identify the optimal number of sub-samples based on the stability
of the process.

Competing interests
The authors declare that they have no conflict of interest.

Author contributions statement


Mahdi Nouraie was responsible for drafting the manuscript, the development of the research
methodology and for writing the computer code used throughout. Samuel Muller provided crit-
ical feedback on the content of the manuscript, refining the clarity and scope of the manuscript
and the computer code.

Data Availability
The riboflavin dataset is accessible via the hdi package in R (Dezeure et al., 2015). The
rat microarray data can be obtained from the National Center for Biotechnology Information
(NCBI) website at www.ncbi.nlm.nih.gov, under accession number GSE5680.
The source code used for the paper is accessible through the following GitHub reposi-
tory: https://github.com/MahdiNouraie/Stable-Stability-Selection. Furthermore, the
stabplot package, which facilitates the use of the two plots introduced in this paper, is available
through https://github.com/MahdiNouraie/stabplot.

18
Acknowledgments
Mahdi Nouraie was supported by the Macquarie University Research Excellence Scholarship
(20213605). Samuel Muller was supported by the Australian Research Council Discovery
Project Grant (DP230101908).
We gratefully acknowledge Dr. Connor Smith for his guidance and support as co-supervisor
of Mahdi Nouraie’s PhD research.

References
Douglas G Altman and Per Kragh Andersen. Bootstrap investigation of the stability of a Cox
regression model. Statistics in Medicine, 8(7):771–783, 1989.

Andre Beinrucker, Ürün Dogan, and Gilles Blanchard. Extensions of stability selection using
subsamples of observations and covariates. Statistics and Computing, 26:1059–1077, 2016.

Barbara Bodinier, Sarah Filippi, Therese Haugdahl Nøst, Julien Chiquet, and Marc Chadeau-
Hyam. Automated calibration for stability selection in penalised regression and graphical
models. Journal of the Royal Statistical Society Series C: Applied Statistics, 72(5):1375–
1393, 2023.

Peter Bühlmann, Markus Kalisch, and Lukas Meier. High-dimensional statistics with a view
toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255–278,
2014.

Annie P Chiang, John S Beck, Hsan-Jan Yen, Marwan K Tayeh, Todd E Scheetz, Ruth E
Swiderski, Darryl Y Nishimura, Terry A Braun, Kwang-Youn A Kim, Jian Huang, et al.
Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a
Bardet–Biedl syndrome gene (BBS11). Proceedings of the National Academy of Sciences, 103
(16):6287–6292, 2006.

David R Cox. Regression models and life-tables. Journal of the Royal Statistical Society Series
B: Statistical Methodology, 34(2):187–202, 1972.

Ruben Dezeure, Peter Bühlmann, Lukas Meier, and Nicolai Meinshausen. High-dimensional
inference: Confidence intervals, p-values and R-software hdi. Statistical Science, 30(4):533–
558, 2015.

Joseph L Fleiss, Bruce Levin, and Myunghee Cho Paik. The Measurement of Interrater Agree-
ment. John Wiley & Sons, Inc., 2004.

Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized
linear models via coordinate descent. Journal of Statistical Software, 33(1):1, 2010.

Kilem Li Gwet. Variance estimation of nominal-scale inter-rater reliability with random selec-
tion of raters. Psychometrika, 73(3):407–430, 2008.

19
Trevor Hastie, Robert Tibshirani, and Jerome H Friedman. The Elements of Statistical Learn-
ing: Data Mining, Inference, and Prediction, volume 1. Springer, 2009.

Jian Huang, Shuangge Ma, and Cun-Hui Zhang. Adaptive Lasso for sparse high-dimensional
regression models. Statistica Sinica, 18(4):1603–1618, 2008.

Alexandros Kalousis, Julien Prados, and Melanie Hilario. Stability of feature selection algo-
rithms: a study on high-dimensional spaces. Knowledge and Information Systems, 12:95–116,
2007.

Ludmila I Kuncheva. A stability index for feature selection. In Artificial Intelligence and
Applications, pages 390–395, 2007.

Nicolai Meinshausen and Peter Bühlmann. Stability selection. Journal of the Royal Statistical
Society Series B: Statistical Methodology, 72(4):417–473, 2010.

Samuel Müller and Alan H Welsh. On model selection curves. International Statistical Review,
78(2):240–256, 2010.

Sarah Nogueira, Konstantinos Sechidis, and Gavin Brown. On the stability of feature selection
algorithms. Journal of Machine Learning Research, 18(174):1–54, 2018.

Vilfredo Pareto. Cours D’Économie Politique. Rouge, Lausanne, 1896.

Todd E Scheetz, Kwang-Youn A Kim, Ruth E Swiderski, Alisdair R Philp, Terry A Braun,
Kevin L Knudtson, Anne M Dorrance, Gerald F DiBona, Jian Huang, Thomas L Casavant,
et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease.
Proceedings of the National Academy of Sciences, 103(39):14429–14434, 2006.

Rikta Sen, Ashis Kumar Mandal, and Basabi Chakraborty. A critical study on stability mea-
sures of feature selection with a novel extension of lustgarten index. Machine Learning and
Knowledge Extraction, 3(4):771–787, 2021.

Rajen D Shah and Richard J Samworth. Variable selection with error control: another look at
stability selection. Journal of the Royal Statistical Society Series B: Statistical Methodology,
75(1):55–80, 2013.

Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal
Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996.

20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy