Feature Selection: January 2016
Feature Selection: January 2016
net/publication/308879182
Feature Selection
CITATIONS READS
2 551
3 authors:
Huan Liu
Arizona State University
627 PUBLICATIONS 32,872 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Suhang Wang on 23 October 2017.
a 1 b 1
class1 class1
class2 class2
0.8
0.5
0.6
f2
0
0.4
−0.5
0.2
−1 0
1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4
f1 f1
c 5 d 5
4.5 4
3
4
2
f6
f4
3.5
1
3
0
2.5 class1 class1
−1
class2 class2
2 −2
1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4
f1 f1
Feature Selection, Fig. 1 A toy example to illustrate feature. The presence of noisy features may degenerate
the concept of irrelevant, redundant, and noisy features. the learning performance. f 6 is a redundant feature when
f 1 is a relevant feature and can discriminate class1 f 1 is present. If f 1 is selected, removal of f 6 will not
and class2. f 2 is an irrelevant feature. Removal of f 2 affect the learning performance. (a) Relevant feature. (b)
will not affect the learning performance. f 4 is a noisy Irrelevant feature. (c) Redundant feature. (d) Noisy feature
a 15
10
# Attributes (Log)
0
1985 1990 1995 2000 2005 2010
Year
b 16
14
12
Sample Size (Log)
10
2
1985 1990 1995 2000 2005 2010
Year
Feature Selection, Fig. 2 Growth of the number of features and the number of samples in the UCI ML repository. (a)
UCI ML repository number of attribute growth. (b) UCI ML repository number of sample growth
with high dimensionality not only degenerates and feature selection. Both feature extraction
many algorithms’ performance due to the and feature selection are capable of improving
curse of dimensionality and the existence of performance, lowering computational complex-
irrelevant, redundant, and noisy dimensions, ity, building better generalization models, and
it also significantly increases the time and decreasing required storage. Feature extraction
memory requirement of the algorithms. Second, maps the original feature space to a new feature
storing and processing such amounts of high- space with lower dimensionality by combining
dimensional data become a challenge. the original feature space. Therefore, further
Dimensionality reduction is one of the most analysis of new features is problematic since
popular techniques to reduce dimensionality there is no physical meaning for the transformed
and can be categorized into feature extraction features obtained from feature extraction. In
4 Feature Selection
contrast, feature selection selects a subset of labels allows supervised feature selection algo-
features from the original feature set. Therefore, rithms to effectively select discriminative features
feature selection keeps the actual meaning of to distinguish samples from different classes. A
each selected feature, which makes it superior in general framework of supervised feature selec-
terms of feature readability and interpretability. tion is shown in Fig. 4a. Features are first gen-
erated from training data. Instead of using all
the data to train the supervised learning model,
Structure of the Learning System supervised feature selection will first select a
subset of features and then process the data with
From the perspective of label availability, feature the selected features to the learning model. The
selection methods can be broadly classified into feature selection phase will use the label infor-
supervised, unsupervised, and semi-supervised mation and the characteristics of the data, such as
methods. In terms of different selection strate- information gain or Gini index, to select relevant
gies, feature selection can be categorized as filter, features. The final selected features, as well as
wrapper, and embedded models. Figure 3 shows with the label information, are used to train a
the classification of feature selection methods. classifier, which can be used for prediction.
Supervised feature selection is usually used for Unsupervised feature selection is usually used
classification tasks. The availability of the class for clustering tasks. A general framework of
unsupervised feature selection is described in the similarity matrix so that label information
Fig. 4b, which is very similar to supervised fea- can provide discriminative information to select
ture selection, except that there’s no label infor- relevant features, while unlabeled data provide
mation involved in the feature selection phase and complementary information.
the model learning phase. Without label infor- Filter Models For filter models, features are
mation to define feature relevance, unsupervised selected based on the characteristics of the data
feature selection relies on another alternative cri- without utilizing learning algorithms. This ap-
terion during the feature selection phase. One proach is very efficient. However, it doesn’t con-
commonly used criterion chooses features that sider the bias and heuristics of the learning al-
can best preserve the manifold structure of the gorithms. Thus, it may miss features that are
original data. Another frequently used method relevant for the target learning algorithm. A filter
is to seek cluster indicators through clustering algorithm usually consists of two steps. In the F
algorithms and then transform the unsupervised first step, features are ranked based on certain cri-
feature selection into a supervised framework. terion. In the second step, features with the high-
There are two different ways to use this method. est rankings are chosen. A lot of ranking criteria,
One way is to seek cluster indicators and simulta- which measures different characteristics of the
neously perform the supervised feature selection features, are proposed: the ability to effectively
within one unified framework. The other way is to separate samples from different classes by con-
first seek cluster indicators, then to perform fea- sidering between class variance and within class
ture selection to remove or select certain features, variance, the dependence between the feature and
and finally to repeat these two steps iteratively the class label, the correlation between feature-
until certain criterion is met. In addition, certain class and feature-feature, the ability to preserve
supervised feature selection criterion can still be the manifold structure, the mutual information
used with some modification. between the features, and so on.
Semi-supervised feature selection is usually Wrapper Models The major disadvantage of
used when a small portion of the data is labeled. the filter approach is that it totally ignores the
When such data is given to perform feature effects of the selected feature subset on the per-
selection, both supervised and unsupervised formance of the clustering or classification al-
feature selection might not be the best choice. gorithm. The optimal feature subset should de-
Supervised feature selection might not be able to pend on the specific biases and heuristics of the
select relevant features because the labeled data learning algorithms. Based on this assumption,
is insufficient to represent the distribution of the wrapper models use a specific learning algorithm
features. Unsupervised feature selection will not to evaluate the quality of the selected features.
use the label information, while label information Given a predefined learning algorithm, a general
can give some discriminative information to framework of the wrapper model is shown in
select relevant features. Semi-supervised feature Fig. 5. The feature search component will pro-
selection, which takes advantage of both labeled duce a set of features based on certain search
data and unlabeled data, is a better choice strategies. The feature evaluation component will
to handle partially labeled data. The general then use the predefined learning algorithm to
framework of semi-supervised feature selection evaluate the performance, which will be returned
is the same as that of supervised feature to the feature search component for the next
selection, except that data is partially labeled. iteration of feature subset selection. The feature
Most of the existing semi-supervised feature set with the best performance will be chosen as
selection algorithms rely on the construction of the final set. The search space for m features
the similarity matrix and select features that is O.2m /. To avoid exhaustive search, a wide
best fit the similarity matrix. Both the label range of search strategies can be used, including
information and the similarity measure of the hill-climbing, best-first, branch-and-bound, and
labeled and unlabeled data are used to construct genetic algorithms.
6 Feature Selection
Sequence Analysis In bioinformatics, sequence for the dataset is unknown. If the number of
analysis is a very important process to under- selected features is too few, the performance will
stand a sequence’s features, functions, structure, be degenerated, since some relevant features are
or evolution. In addition to basic features that eliminated. If the number of selected features
represent nucleotide or amino acids at each po- is too large, the performance may also not be
sition in a sequence, many other features, such very good since some noisy, irrelevant, or redun-
as k-mer patterns, can be derived. By varying the dant features are selected to confuse the learning
pattern length k, the number of features grows model. In practice, we would grid search the
exponentially. However, many of these features number of features in a range and pick the one
are irrelevant or redundant; thus, feature selection that has relatively better performance on learning
techniques are applied to select a relevant feature models, which is computationally expensive. In
subset and essential for sequence analysis. particular, for supervised feature selection, cross
validation can be used to search the number of
features to select. How to automatically deter-
Open Problems mine the best number of selected features remains
an open problem.
Scalability With the rapid growth of dataset For many unsupervised feature selection
size, the scalability of current feature selection methods, in addition to choosing the optimal
algorithms may be a big issue, especially for number of features, we also need to specify
online classifiers. Large data cannot be loaded the number of clusters. Since there is no label
to the memory with a single scan. However, full information and we have limited knowledge
dimensionality data must be scanned for some about each domain, the actual number of clusters
feature selection. Usually, they require a suffi- in the data is usually unknown and not well
cient number of samples to obtain statistically defined. The number of clusters specified by
significant result. It is very difficult to observe the users will result in selecting different feature
feature relevance score without considering the subsets by the unsupervised feature selection
density around each sample. Therefore, scalabil- algorithm. How to choose the number of clusters
ity is a big issue. for unsupervised feature selection is an open
Stability Feature selection algorithms are often problem.
evaluated through classification accuracy or clus-
tering accuracy. However, the stability of algo-
rithms is also an important consideration when Cross-References
developing feature selection methods. For exam-
ple, when feature selection is applied on gene Classification
data, the domain experts would like to see the Clustering
same or at least similar sets of genes selected after Dimensionality Reduction
each time they obtain new samples with a small Feature Extraction
amount of perturbation. Otherwise, they will not
trust the algorithm. However, well-known fea-
ture selection methods, especially unsupervised
Recommended Reading
feature selection algorithms, can select features
with low stability after perturbation is introduced Alelyani S, Tang J, Liu H (2013) Feature selection
to the training data. Developing algorithms of for clustering: a review. In: Aggarwal CC (ed) Data
feature selection with high accuracy and stability clustering: algorithms and applications, vol 29. CRC
is still an open problem. Press, Hoboken
Dy JG, Brodley CE (2004) Feature selection for
Parameter Selection In feature selection, we unsupervised learning. J Mach Learn Res 5:845–
usually need to specify the number of features to 889
select. However, the optimal number of features
Feature Selection 9
Guyon I, Elisseeff A (2003) An introduction to Saeys Y, Inza I, Larrañaga P (2007) A review of feature
variable and feature selection. J Mach Learn Res 3: selection techniques in bioinformatics. Bioinfor-
1157–1182 matics 23(19):2507–2517
Jain A, Zongker D (1997) Feature selection: evalu- Tang J, Liu H (2012) Feature selection with linked
ation, application, and small sample performance. data in social media. In: SDM, Anaheim. SIAM,
IEEE Trans Pattern Anal Mach Intell 19(2):153–158 pp 118–128
Kohavi R, John GH (1997) Wrappers for feature subset Tang J, Alelyani S, Liu H (2014) Feature selection for
selection. Artif Intell 97(1):273–324 classification: a review. In: Aggarwal CC (ed) Data
Koller D, Sahami M (1996) Toward optimal feature classification: algorithms and applications. Chap-
selection. Technical report, Stanford InfoLab man & Hall/CRC, Boca Raton, p 37
Li, J, Cheng K, Wang S, Morstatter F, Trevino R P, Wu X, Yu K, Ding W, Wang H, Zhu X (2013) Online
Tang J, Liu H (2016) Feature Selection: A Data feature selection with streaming features. IEEE
Perspective. arXiv preprint 1601.07996 Trans Pattern Anal Mach Intell 35(5):1178–1192
Liu H, Motoda H (2007) Computational methods of Zhao ZA, Liu H (2011) Spectral feature selection
feature selection. CRC Press, New York for data mining. Chapman & Hall/CRC, Boca F
Liu H, Yu L (2005) Toward integrating feature se- Raton
lection algorithms for classification and clustering. Zhao Z, Morstatter F, Sharma S, Alelyani S, Anand A,
IEEE Trans Knowl Data Eng 17(4):491–502 Liu H (2010) Advancing feature selection research.
Liu H, Motoda H, Setiono R, Zhao Z (2010) Feature ASU feature selection repository, 1–28
selection: an ever evolving frontier in data mining.
In: FSDM, Hyderabad, pp 4–13