Igann Sparse: Bridging Sparsity and Interpretability With Non-Linear Insight
Igann Sparse: Bridging Sparsity and Interpretability With Non-Linear Insight
Research in Progress
Abstract
Feature selection is a critical component in predictive analytics that significantly affects the prediction
accuracy and interpretability of models. Intrinsic methods for feature selection are built directly into
model learning, providing a fast and attractive option for large amounts of data. Machine learning
algorithms, such as penalized regression models (e.g., lasso) are the most common choice when it comes
to in-built feature selection. However, they fail to capture non-linear relationships, which ultimately affects
their ability to predict outcomes in intricate datasets. In this paper, we propose IGANN Sparse, a novel
machine learning model from the family of generalized additive models, which promotes sparsity through a
non-linear feature selection process during training. This ensures interpretability through improved model
sparsity without sacrificing predictive performance. Moreover, IGANN Sparse serves as an exploratory
tool for information systems researchers to unveil important non-linear relationships in domains that
are characterized by complex patterns. Our ongoing research is directed at a thorough evaluation of
the IGANN Sparse model, including user studies that allow to assess how well users of the model can
benefit from the reduced number of features. This will allow for a deeper understanding of the interactions
between linear vs. non-linear modeling, number of selected features, and predictive performance.
1 Introduction
Predictive analytics is a crucial methodological stream of research in the field of information systems (IS)
that deals with the creation of empirical prediction models (Shmueli and Koppius, 2011). By leveraging
advanced machine learning (ML) techniques, researchers can uncover patterns and relationships within
large datasets, enabling them to anticipate future events, user behaviors, and system performance (Kühl
et al., 2021). In addition to its practical utility, predictive analytics also plays an important role in theory
development and testing, as well as relevance assessment (Shmueli and Koppius, 2011).
In recent years, the IS community has increasingly recognized the importance of ensuring that prediction
models not only provide high predictive performance, but are also comprehensible for explanation
purposes (Bauer, Zahn, and Hinz, 2023; Kim et al., 2023). There are basically two primary approaches
for ensuring model explainability: inherently interpretable models and post-hoc explainability methods.1
Inherently interpretable models, such as shallow decision trees, linear models, and generalized additive
1 It should be noted that the terms interpretability and explainability can have different meanings from a psychological perspective
(cf. Broniatowski, 2021). In this paper, however, we will concentrate on a technical differentiation (cf. Kraus et al., 2023).
models (GAMs) (Hastie and Tibshirani, 1986), are designed to be human-readable without requiring
additional explanation. These models often employ techniques like linearity, and sparsity to enhance
their interpretability (Rudin, 2019). On the other hand, post-hoc explainability tools like Shapley values
(Lundberg and Lee, 2017) or LIME (Ribeiro, Singh, and Guestrin, 2016) aim to approximate the complex
behavior of non-interpretable models (Esteva et al., 2019). However, these post-hoc methods should be
applied cautiously, as they may not fully capture the intricate workings of the original models. For tabular
data, the choice between the two approaches is often guided by the marginal performance gains achievable
with complex models over interpretable ones (Rudin, 2019; Zschech et al., 2022).
GAMs and their various extensions base their final prediction on a number of independent functions that
map the input features to the output space of the model, where they are summed up to generate the final
prediction (Hastie and Tibshirani, 1986). Each of these functions usually only processes an individual
feature or a single interaction between two features, which allows to visualize the function after training
and gaining insights into the effect that a feature has on the model output.
Although GAMs have proven to be very powerful, they lack an important property that limits their
applicability to high-dimensional feature spaces (i.e., datasets characterized by a large number of input
features): They do not easily allow the creation of sparse models, i.e., models that base their predictions
on only a few input features. This is the case because GAMs typically are trained in an iterative fashion
(through backfitting or gradient boosting), where a feature is selected in each iteration to minimize the
remaining loss of the model (Hastie and Tibshirani, 1986; Nori et al., 2019). Although very powerful, this
iterative approach makes the implementation of additional sparsity constraints difficult. Consequently,
commonly used are linear feature selection methods, however data can have non-linear relations, which
simple models such as a linear regressions are unable to detect.
Only recently, novel neural network based GAMs have been proposed which do not rely on the iteration
over features (e.g., Kraus et al., 2023; Yang, Zhang, and Sudjianto, 2021). This paper extends the
Interpretable Generalized Additive Neural Network (IGANN) framework with a focus on sparsity and
interpretability. It also positions the IGANN sparse model as a powerful exploratory tool for exploring
non-linear dependencies in data that traditional GAMs may not readily uncover.
Our contributions are fourfold: First, we propose a novel approach to fast training of sparse neural networks
using extreme learning machines. Second, we incorporate this method into the IGANN model, resulting
in the introduction of IGANN Sparse.2 Third, we validate that IGANN Sparse maintains comparable
predictive accuracy to its non-sparse counterparts while significantly reducing the number of features,
thereby improving interpretability. Finally, we demonstrate the utility of IGANN Sparse in non-linear
feature selection, establishing its role in exploratory data analysis and interpretative modeling.
Our work has implications for both predictive analytics research and practice. We address the issue of
feature selection, which is a common step during pre-processing of machine learning pipelines. To achieve
this, we expand IGANN, a model capable of learning non-linear relations in data, by presenting a sparse
IGANN version. Furthermore, our approach has implications for IS researchers, as training a sparse model
for feature selection is a promising way to keep the logic of prediction models comprehensible. Therefore,
our model offers a promising tool for empirical IS studies that are concerned with popular research
questions related to predictive model building, such as predicting purchase behavior, price dynamics, user
satisfaction or technology acceptance (Kühl et al., 2021; Shmueli and Koppius, 2011).
The remaining paper is structured as follows: In Section 2, we introduce some related work. We propose
the new sparse model in Section 3. We test this model according to experiments described in Section 4,
followed by a presentation and discussion of the results in Section 5, and Section 6, respectively.
When dealing with high dimensional data, it is often beneficial to use methods to decrease the number
of features impacting the final prediction. Sparsity improves model interpretability because the human
mind is not capable of processing the number of information units (i.e., features) that an ML model can
process at a time. This limitation lies around 7 ± 2 information units (Miller, 1956; Rudin, 2019). There
are a number of techniques such as compression, where the model size is decreased, principal component
analysis (PCA), which finds vectors orthogonal to each other, in order to mathematically represent the
most information in the least features, and feature selection (Gui et al., 2017). While compression and
PCA lead to model inputs which humans are unable to comprehend intuitively, feature selection methods
simply reduce the number of features that a ML model bases its prediction on. This allows researchers
and decision-makers to fully comprehend the model behavior (Gui et al., 2017).
One way of determining the relevance of a feature is to use some linear model to determine its predictive
power with regards to the target variable. Lasso used as feature selector does this by fitting a linear
regression with L1 regularization (Hastie, Tibshirani, et al., 2009). Thereby, coefficients are pushed to
zero in order to result in a sparse model where most of the input features can be discarded. These methods,
however, fall short in cases where features naturally have a non-linear effect. For instance in the context
of health analytics, neither a body temperature that is too low nor too high is healthy, thus, linear feature
selectors can easily ignore the importance of such an oftentimes powerful feature.
Increases in complexity of neural networks or models such as gradient boosted decision trees have
improved the predictive power in a trade off for interpretability, leading to so-called black-box models. To
tackle this challenge, recently proposed methods have kept some of the core innovations from black-box
models, but altered core parts to obtain fully interpretable models, such as GAMs (e.g., Kraus et al., 2023;
Nori et al., 2019).
IGANN, a novel ML model belonging to the GAM family, fits shape functions using a boosted ensemble
of neural networks, where each network represents an extreme learning machine (ELM), as illustrated in
Figure 1. ELMs are simple feed-forward neural networks that use a faster learning method than gradient-
based algorithms (Huang, Q.-Y. Zhu, and Siew, 2006). In detail, the training only includes updating the
weights of the output layer and, thus, is equal to fitting a linear model. Overall, IGANN has been shown to
produce smooth shape functions that can be easily comprehended by users. Furthermore, the way in which
IGANN trains the networks makes it an interesting choice to introduce model sparsity in a non-linear
fashion, which we describe in the following.
3 IGANN Sparse
As described above, IGANN uses a sequence of ELMs to compute the GAM. In the following, we
make use of this model choice to incorporate a sparsity-layer in the first ELM which allows to select the
(potentially non-linear) most important features. Figure 1 illustrates this basic idea.
For a fixed number of inputs x(1) , x(2) , . . . , x(m) , the ELM maps each input onto k non-linear hidden
(1) (1) (m) (m)
activations, which we call h1 , . . . , hk , . . ., h1 , . . . , hk . We denote the vectors that store the respective
(i) (i)
hidden activations by h(i) , i.e., h(i) = [h1 , . . . , hk ] and let h represent all hidden activations, i.e., h =
[h(1) , . . . , h(m) ].
In the traditional IGANN model, the ELM is now trained by solving the linear problem
where β denotes the coefficients in the last layer, y the true target variable, and L describes the loss
function, such as mean squared error or cross-entropy. As can be seen in Equation 1, the ELM thus merely
solves a linear problem, yet the non-linear activations allow to capture highly non-linear effects (Kraus
et al., 2023).
Sparsity-Layer
Figure 1. First ELM from IGANN Sparse with three features as input which includes the sparsity-layer.
Each feature is processed by a sub-network of the whole ELM. For input x(1) , the green part of the model
highlights the corresponding sub-network.
This work makes use of this characteristic by introducing a sparsity-layer. Given the non-linear activations
h(1) , . . . , h(m) , where each h(i) represents a block of k values, it’s critical to ensure that the model remains
interpretable and avoids overfitting by using only the most important blocks of activations. To achieve
this, we introduce a sparsity-inducing step using the best-subset selection approach (J. Zhu et al., 2022).
Considering each block h(i) as an individual subset, the best-subset selection aims to find the subset
of blocks which, when used in the ELM model, results in the optimal balance between model fit and
complexity. Mathematically, the problem can be extended from Equation 1 as
where S denotes the selected subset of blocks from h(1) , . . . , h(m) , and hS represents the hidden activations
corresponding to this subset. The objective is to minimize the Bayesian Information Criterion (BIC) which
is defined as
4 Experiment Design
4.1 Datasets and Pre-Processing
Our experiments are based on common, publicly available benchmark datasets presented in Table 1. The
number of categorical features “cat” in these tables are measured after one-hot encoding. For preprocessing,
we removed columns like IDs or for categorical features with more than 25 distinct values as is done
in similar experiments (e.g., Zschech et al., 2022). Furthermore, the standard scaler from scikit-learn
is used for all numerical features. The data is split in a 5-fold cross validation to evaluate the model’s
performance.
Classification Regression
Dataset Samples num cat Dataset Samples num cat
college (Mukti, 2022) 1,000 4 10 bike (Fanaee-T and Gama, 2014) 17,379 7 5
churn (IBM, 2019) 7,043 3 37 wine (Cortez et al., 2009) 4,898 11 0
credit (Fair Isaac Corporation, 2018) 10,459 21 16 productivity (Imran et al., 2019) 1,197 9 26
income (Kohavi, 1996) 32,561 6 59 insurance (Lantz, 2015) 1,338 3 6
bank (Moro, Cortez, and Rita, 2014) 45,211 6 41 crimes (Redmond and Baveja, 2002) 1,994 100 0
airline (Klein, 2020) 103,904 18 6 farming (Sidhu, 2021) 3,893 7 3
recidivism (Angwin et al., 2016) 7,214 7 4 house (Pace and Barry, 1997) 20,640 8 0
Table 1. Overview of selected datasets covering classification (y ∈ {0, 1}) and regression (y ∈ R) tasks.
Samples describes the number of observations recorded in the dataset (number of rows). Numerical
(num) features and categorical (cat) features are the number of input columns representing numerical or
categorical values, respectively
. Cat features are measured after one-hot encoding.
4.2 Experiments
Our first experiment compares the prediction quality of our sparse model versus an unconstrained IGANN
model. This experiment assesses the trade-off between the quality of prediction and the level of sparsity.
Our second experiment tests the performance of the IGANN Sparse model for feature selection by
comparing it to a lasso model for feature selection. The main metrics to evaluate on are the number of
selected features, and the area under the receiver operating characteristic (AUROC) and root mean squared
error (RMSE) for classification and regression, respectively.
Both experiments are repeated for 20 times with different random states in both data split and during
model training, in order to gain a statistical distribution of the results, while maintaining reproducibility of
the results. The statistical analysis is done using a Wilcoxon test (Neuhäuser, 2011). The test assumes the
h0 hypothesis of similar model performance. For the comparison in experiment one we consider similar
model performance within a tolerance of one standard deviation, as a sparser model with comparable
performance is better in fields where comprehensibility is required (Rudin, 2019). The statistical analysis
for the second experiment comparing IGANN Sparse and lasso as feature selectors will be conducted
without tolerance.
5 Results
Table 2 shows the performance of IGANN Full and IGANN Sparse across 20 runs with a 5-fold cross
validation. For both experiments, the results of the Wilcoxon tests are highlighted in the tables. In only
three cases, the sparse model selected equal to or more than 75 % of the datasets’ total features, in one
case as few as 4 %.
The sparse model is considered superior if within one standard deviation of its non-sparse counterpart.
The comparison between the predictive performance of IGANN Full and Sparse shows that for 10 out of
14 datasets IGANN sparse is significantly better at p≤0.01. Each of the statistical tests are based on 100
observations. Moreover, IGANN Sparse is significantly better in 11 out of 14 datasets at p≤0.05.
Figure 2 exemplary shows the performance for two datasets (college and credit) for varying numbers of
selected features. Already with as few as four features we find very promising predictive performance
which is not further improved using more features.
As a feature selector, IGANN Sparse performed better than lasso in 9 out of 14 cases. In three of our seven
classification datasets and in two of our seven regression datasets, lasso performed better than IGANN
Sparse. Making IGANN Sparse the better feature selector in the majority of datasets for both classification
and regression tasks.
Classification Regression
Dataset IGANN Full IGANN Sparse Dataset IGANN Full IGANN Sparse
AUROC ± SD AUROC ± SD # Features RMSE ± SD RMSE ± SD # Features
college 0.863 ± 0.022 0.852 ± 0.025∗∗ 72.0 % bike 0.766 ± 0.006 0.768 ± 0.007∗∗ 80.4 %
churn 0.722 ± 0.012 0.711 ± 0.013∗∗ 51.5 % wine 0.901 ± 0.015 0.914 ± 0.016∗ 34.8 %
credit 0.731 ± 0.009 0.725 ± 0.016∗∗ 44.8 % productivity 0.896 ± 0.032∗∗ 0.960 ± 0.038 62.3 %
income 0.775 ± 0.006∗∗ 0.706 ± 0.024 51.1 % insurance 0.706 ± 0.020 0.707 ± 0.020∗∗ 68.0 %
bank 0.587 ± 0.005 0.584 ± 0.006∗∗ 90.1 % crimes 0.771 ± 0.026 0.778 ± 0.026∗∗ 4.0 %
airline 0.933 ± 0.002∗∗ 0.929 ± 0.002 64.6 % farming 0.815 ± 0.020 0.822 ± 0.020∗∗ 56.0 %
recidivism 0.685 ± 0.010 0.680 ± 0.014∗∗ 51.4 % house 0.733 ± 0.006 0.735 ± 0.005∗∗ 78.6 %
Table 2. Performance comparison of classification and regression datasets, showing AUROC and RMSE
results as well as standard deviations (SD) of the standard IGANN model compared to the sparse model
with the average percentage of selected features out of the input features described in Table 1, including
both categorical as well as numerical features. For classification, values closer to 1 are better, and for
regression, lower values are better. Significant differences using the Wilcoxon signed rank test are marked
with ∗ ∗ p ≤ 0.01, ∗p ≤ 0.05 at the respectively better model. We consider the sparser model to be better
if it performs within at least one standard deviation of the full model, due to its easier comprehensibility.
&