0% found this document useful (0 votes)
69 views10 pages

Feature Selection: January 2016

Uploaded by

ramadevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views10 pages

Feature Selection: January 2016

Uploaded by

ramadevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/308879182

Feature Selection

Chapter · January 2016


DOI: 10.1007/978-1-4899-7502-7_101-1

CITATIONS READS

2 551

3 authors:

Suhang Wang Jiliang Tang


Pennsylvania State University Arizona State University
81 PUBLICATIONS   1,493 CITATIONS    166 PUBLICATIONS   5,465 CITATIONS   

SEE PROFILE SEE PROFILE

Huan Liu
Arizona State University
627 PUBLICATIONS   32,872 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Multi-Source Assessment of State Stability_ONR N000141310835 View project

Cyberbullying View project

All content following this page was uploaded by Suhang Wang on 23 October 2017.

The user has requested enhancement of the downloaded file.


F

Definition (or Synopsis)


Feature Selection
Feature selection, as a dimensionality reduction
Suhang Wang1 , Jiliang Tang2 , and Huan Liu1
1 technique, aims to choose a small subset of the
Arizona State University, Tempe, AZ, USA
2 relevant features from the original ones by re-
Michigan State University, East Lansing, MI,
moving irrelevant, redundant, or noisy features.
USA
Feature selection usually leads to better learn-
ing performance, i.e., higher learning accuracy,
Abstract lower computational cost, and better model inter-
pretability.
Data dimensionality is growing rapidly, which Generally speaking, irrelevant features
poses challenges to the vast majority of existing are features that cannot help discriminate
mining and learning algorithms, such as the curse samples from different classes(supervised) or
of dimensionality, large storage requirement, and clusters(unsupervised). Removing irrelevant
high computational cost. Feature selection has features will not affect learning performance. In
been proven to be an effective and efficient way fact, the removal of irrelevant features may help
to prepare high-dimensional data for data mining learn a better model, as irrelevant features may
and machine learning. The recent emergence of confuse the learning system and cause memory
novel techniques and new types of data and fea- and computation inefficiency. For example, in
tures not only advances existing feature selection Fig. 1a, f1 is a relevant feature because f1 can
research but also evolves feature selection con- discriminate class1 and class2. In Fig. 1b, f2 is a
tinually, becoming applicable to a broader range redundant feature because f2 cannot distinguish
of applications. In this entry, we aim to provide a points from class1 and class2. Removal of f2
basic introduction to feature selection including doesn’t affect the ability of f1 to distinguish
basic concepts, classifications of existing sys- samples from class1 and class2.
tems, recent development, and applications. A redundant feature is a feature that implies
the copresence of another feature. Individually,
each redundant feature is relevant, but removal
of one of them will not affect the learning per-
Synonyms formance. For example, in Fig. 1c, f1 and f6
are strongly correlated. f6 is a relevant feature
Attribute selection; Feature subset selection; Fea- itself. However, when f1 is selected first, the
ture weighting later appearance of f6 doesn’t provide additional
information. Instead, it adds more memory and
© Springer Science+Business Media New York 2016
C. Sammut, G.I. Webb (eds.), Encyclopedia of Machine Learning and Data Mining,
DOI 10.1007/978-1-4899-7502-7 101-1
2 Feature Selection

a 1 b 1
class1 class1
class2 class2
0.8
0.5

0.6

f2
0
0.4

−0.5
0.2

−1 0
1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4

f1 f1
c 5 d 5

4.5 4

3
4
2
f6

f4
3.5
1
3
0
2.5 class1 class1
−1
class2 class2
2 −2
1 1.5 2 2.5 3 3.5 4 1 1.5 2 2.5 3 3.5 4
f1 f1

Feature Selection, Fig. 1 A toy example to illustrate feature. The presence of noisy features may degenerate
the concept of irrelevant, redundant, and noisy features. the learning performance. f 6 is a redundant feature when
f 1 is a relevant feature and can discriminate class1 f 1 is present. If f 1 is selected, removal of f 6 will not
and class2. f 2 is an irrelevant feature. Removal of f 2 affect the learning performance. (a) Relevant feature. (b)
will not affect the learning performance. f 4 is a noisy Irrelevant feature. (c) Redundant feature. (d) Noisy feature

computational requirement to learn the classifi- Motivation and Background


cation model.
A noisy feature is a type of relevant feature. In many real-world applications, such as data
However, due to the noise introduced during mining, machine learning, computer vision,
the data collection process or because of the and bioinformatics, we need to deal with high-
nature of this feature, a noisy feature may not dimensional data. In the past 30 years, the
be so relevant to the learning or mining task. dimensionality of the data involved in these areas
As shown in Fig. 1d, f4 is a noisy feature. It has increased explosively. The growth of the
can discriminate a part of the points from the number of attributes in the UCI machine learning
two classes and may confuse the learning model repository is shown in Fig. 2a. In addition, the
for the overlapping points (Noisy features are number of samples also increases explosively.
very subtle. One feature may be a noisy feature The growth of the number of samples in the UCI
itself. However, in some cases, when two or machine learning repository is shown in Fig. 2b.
more noisy features can complement each other The huge number of high-dimensional data has
to distinguish samples from different classes, they presented serious challenges to existing learning
may be selected together to benefit the learning methods. First, due to the large number of
model.) features and relatively small number of training
samples, a learning model tends to overfit, and
their learning performance degenerates. Data
Feature Selection 3

a 15

10
# Attributes (Log)

0
1985 1990 1995 2000 2005 2010
Year
b 16

14

12
Sample Size (Log)

10

2
1985 1990 1995 2000 2005 2010
Year

Feature Selection, Fig. 2 Growth of the number of features and the number of samples in the UCI ML repository. (a)
UCI ML repository number of attribute growth. (b) UCI ML repository number of sample growth

with high dimensionality not only degenerates and feature selection. Both feature extraction
many algorithms’ performance due to the and feature selection are capable of improving
curse of dimensionality and the existence of performance, lowering computational complex-
irrelevant, redundant, and noisy dimensions, ity, building better generalization models, and
it also significantly increases the time and decreasing required storage. Feature extraction
memory requirement of the algorithms. Second, maps the original feature space to a new feature
storing and processing such amounts of high- space with lower dimensionality by combining
dimensional data become a challenge. the original feature space. Therefore, further
Dimensionality reduction is one of the most analysis of new features is problematic since
popular techniques to reduce dimensionality there is no physical meaning for the transformed
and can be categorized into feature extraction features obtained from feature extraction. In
4 Feature Selection

contrast, feature selection selects a subset of labels allows supervised feature selection algo-
features from the original feature set. Therefore, rithms to effectively select discriminative features
feature selection keeps the actual meaning of to distinguish samples from different classes. A
each selected feature, which makes it superior in general framework of supervised feature selec-
terms of feature readability and interpretability. tion is shown in Fig. 4a. Features are first gen-
erated from training data. Instead of using all
the data to train the supervised learning model,
Structure of the Learning System supervised feature selection will first select a
subset of features and then process the data with
From the perspective of label availability, feature the selected features to the learning model. The
selection methods can be broadly classified into feature selection phase will use the label infor-
supervised, unsupervised, and semi-supervised mation and the characteristics of the data, such as
methods. In terms of different selection strate- information gain or Gini index, to select relevant
gies, feature selection can be categorized as filter, features. The final selected features, as well as
wrapper, and embedded models. Figure 3 shows with the label information, are used to train a
the classification of feature selection methods. classifier, which can be used for prediction.
Supervised feature selection is usually used for Unsupervised feature selection is usually used
classification tasks. The availability of the class for clustering tasks. A general framework of

Feature Selection, Fig. 3


Feature selection
categories

Feature Selection, Fig. 4


General frameworks of
supervised and
unsupervised feature
selection. (a) A general
framework of supervised
feature selection. (b) A
general framework of
unsupervised feature
selection
Feature Selection 5

unsupervised feature selection is described in the similarity matrix so that label information
Fig. 4b, which is very similar to supervised fea- can provide discriminative information to select
ture selection, except that there’s no label infor- relevant features, while unlabeled data provide
mation involved in the feature selection phase and complementary information.
the model learning phase. Without label infor- Filter Models For filter models, features are
mation to define feature relevance, unsupervised selected based on the characteristics of the data
feature selection relies on another alternative cri- without utilizing learning algorithms. This ap-
terion during the feature selection phase. One proach is very efficient. However, it doesn’t con-
commonly used criterion chooses features that sider the bias and heuristics of the learning al-
can best preserve the manifold structure of the gorithms. Thus, it may miss features that are
original data. Another frequently used method relevant for the target learning algorithm. A filter
is to seek cluster indicators through clustering algorithm usually consists of two steps. In the F
algorithms and then transform the unsupervised first step, features are ranked based on certain cri-
feature selection into a supervised framework. terion. In the second step, features with the high-
There are two different ways to use this method. est rankings are chosen. A lot of ranking criteria,
One way is to seek cluster indicators and simulta- which measures different characteristics of the
neously perform the supervised feature selection features, are proposed: the ability to effectively
within one unified framework. The other way is to separate samples from different classes by con-
first seek cluster indicators, then to perform fea- sidering between class variance and within class
ture selection to remove or select certain features, variance, the dependence between the feature and
and finally to repeat these two steps iteratively the class label, the correlation between feature-
until certain criterion is met. In addition, certain class and feature-feature, the ability to preserve
supervised feature selection criterion can still be the manifold structure, the mutual information
used with some modification. between the features, and so on.
Semi-supervised feature selection is usually Wrapper Models The major disadvantage of
used when a small portion of the data is labeled. the filter approach is that it totally ignores the
When such data is given to perform feature effects of the selected feature subset on the per-
selection, both supervised and unsupervised formance of the clustering or classification al-
feature selection might not be the best choice. gorithm. The optimal feature subset should de-
Supervised feature selection might not be able to pend on the specific biases and heuristics of the
select relevant features because the labeled data learning algorithms. Based on this assumption,
is insufficient to represent the distribution of the wrapper models use a specific learning algorithm
features. Unsupervised feature selection will not to evaluate the quality of the selected features.
use the label information, while label information Given a predefined learning algorithm, a general
can give some discriminative information to framework of the wrapper model is shown in
select relevant features. Semi-supervised feature Fig. 5. The feature search component will pro-
selection, which takes advantage of both labeled duce a set of features based on certain search
data and unlabeled data, is a better choice strategies. The feature evaluation component will
to handle partially labeled data. The general then use the predefined learning algorithm to
framework of semi-supervised feature selection evaluate the performance, which will be returned
is the same as that of supervised feature to the feature search component for the next
selection, except that data is partially labeled. iteration of feature subset selection. The feature
Most of the existing semi-supervised feature set with the best performance will be chosen as
selection algorithms rely on the construction of the final set. The search space for m features
the similarity matrix and select features that is O.2m /. To avoid exhaustive search, a wide
best fit the similarity matrix. Both the label range of search strategies can be used, including
information and the similarity measure of the hill-climbing, best-first, branch-and-bound, and
labeled and unlabeled data are used to construct genetic algorithms.
6 Feature Selection

Feature Selection, Fig. 5


A general framework of
wrapper models

Feature Selection, Fig. 6


Classification of recent
development of feature
selection from feature
perspective and data
perspective

Embedded Models Filter models are compu- Recent Developments


tationally efficient, but totally ignore the biases
of the learning algorithm. Compared with filter The recent emergence of new machine learning
models, wrapper models obtain better predictive algorithms, such as sparse learning, and new
accuracy estimates, since they take into account types of data, such as social media data, has
the biases of the learning algorithms. However, accelerated the evolution of feature selection. In
wrapper models are very computationally expen- this section, we will discuss recent developments
sive. Embedded models are a trade-off between of feature selection from both feature and data
the two models by embedding the feature selec- perspectives.
tion into the model construction. Thus, embedded From the feature perspective, features can
models take advantage of both filter models and be categorized as static and streaming features,
wrapper models: (1) they are far less computa- as shown in Fig. 6a. Static features can be further
tionally intensive than wrapper methods, since categorized as flat features and structured fea-
they don’t need to run the learning models many tures. The recent development of feature selection
times to evaluate the features, and (2) they in- from the feature perspective mainly focuses on
clude the interaction with the learning model. The streaming and structure features.
biggest difference between wrapper models and Usually we assume that all features are known
embedded models is that wrapper models first in advance. These features are designated as static
train learning models using the candidate features features. In some scenarios, new features are
and then perform feature selection by evaluating sequentially presented to the learning algorithm.
features using the learning model, while embed- For example, Twitter produces more than 250
ded models select features during the process of millions tweets per day, and many new words
model construction to perform feature selection (features) are generated, such as abbreviations.
without further evaluation of the features. In these scenarios, the candidate features are
Feature Selection 7

generated dynamically, and the size of features Applications


is unknown. These features are usually named
as streaming features, and feature selection for High-dimensional data is very ubiquitous in the
streaming features is called streaming feature real world, which makes feature selection a very
selection. For flat features, we assume that fea- popular and practical preprocessing technique
tures are independent. However, in many real- for various real-world applications, such as text
world applications, features may exhibit certain categorization, remote sensing, image retrieval,
intrinsic structures, such as overlapping groups, microarray analysis, mass spectrum analysis, se-
trees, and graph structures. For example, in speed quence analysis, and so on.
and signal processing, different frequency bands Text Clustering The task of text clustering is to
can be represented by groups. Figure 6a shows group similar documents together. In text cluster-
the classification of structured features. Incorpo- ing, a text or document is always represented as F
rating knowledge about feature structures may a bag of words, which causes high-dimensional
significantly improve the performance of learn- feature space and sparse representation. Obvi-
ing models and help select important features. ously, a single document has a sparse vector over
Feature selection algorithms for the structured the set of all terms. The performance of clustering
features usually use the recently developed sparse algorithms degrades dramatically due to high
learning techniques such as group lasso and tree- dimensionality and data sparseness. Therefore, in
guided lasso. practice, feature selection is a very important step
From the data perspective, data can be cate- to reduce the feature space in text clustering.
gorized as streaming data and static data as shown Genomic Microarray Data Microarray data is
in Fig. 6b. Static data can be further categorized usually short and fat data – high dimensionality
as independent identically distributed (i.i.d.) data with a small sample size, which poses a great
and heterogeneous data. The recent development challenge for computational techniques. Their
of feature selection from the data perspective is dimensionality can be up to tens of thousands
mainly concentrated on streaming and heteroge- of genes, while their sample sizes can only be
neous data. several hundreds. Furthermore, additional exper-
Similar to streaming features, streaming data imental complications like noise and variability
comes sequentially. Online streaming feature render the analysis of microarray data an exciting
selection is proposed to deal with streaming domain. Because of these issues, various feature
data. When new data instances come, an online selection algorithms are adopted to reduce the
feature selection algorithm needs to determine dimensionality and remove noise in microarray
(1) whether adding the newly generated features data analysis.
from the coming data to the currently selected Hyperspectral Image Classification Hyper-
features and (2) whether removing features spectral sensors record the reflectance from
from the set of currently selected features the Earth’s surface over the full range of solar
ID. Traditional data is usually assumed to wavelengths with high spectral resolution, which
be i.i.d. data, such as text and gene data. results in high-dimensional data that contains
However, heterogeneous data, such as linked rich information for a wide range of applications.
data, apparently contradicts this assumption. However, this high-dimensional data contains
For example, linked data is inherently not i.i.d., many irrelevant, noisy, and redundant features
since instances are linked and correlated. New that are not important, useful, or desirable for
types of data cultivate new types of feature specific tasks. Feature selection is a critical
selection algorithms correspondingly, such as preprocessing step to reduce computational cost
feature selection for linked data and multi-view for hyperspectral data classification by selecting
and multisource feature selection. relevant features.
8 Feature Selection

Sequence Analysis In bioinformatics, sequence for the dataset is unknown. If the number of
analysis is a very important process to under- selected features is too few, the performance will
stand a sequence’s features, functions, structure, be degenerated, since some relevant features are
or evolution. In addition to basic features that eliminated. If the number of selected features
represent nucleotide or amino acids at each po- is too large, the performance may also not be
sition in a sequence, many other features, such very good since some noisy, irrelevant, or redun-
as k-mer patterns, can be derived. By varying the dant features are selected to confuse the learning
pattern length k, the number of features grows model. In practice, we would grid search the
exponentially. However, many of these features number of features in a range and pick the one
are irrelevant or redundant; thus, feature selection that has relatively better performance on learning
techniques are applied to select a relevant feature models, which is computationally expensive. In
subset and essential for sequence analysis. particular, for supervised feature selection, cross
validation can be used to search the number of
features to select. How to automatically deter-
Open Problems mine the best number of selected features remains
an open problem.
Scalability With the rapid growth of dataset For many unsupervised feature selection
size, the scalability of current feature selection methods, in addition to choosing the optimal
algorithms may be a big issue, especially for number of features, we also need to specify
online classifiers. Large data cannot be loaded the number of clusters. Since there is no label
to the memory with a single scan. However, full information and we have limited knowledge
dimensionality data must be scanned for some about each domain, the actual number of clusters
feature selection. Usually, they require a suffi- in the data is usually unknown and not well
cient number of samples to obtain statistically defined. The number of clusters specified by
significant result. It is very difficult to observe the users will result in selecting different feature
feature relevance score without considering the subsets by the unsupervised feature selection
density around each sample. Therefore, scalabil- algorithm. How to choose the number of clusters
ity is a big issue. for unsupervised feature selection is an open
Stability Feature selection algorithms are often problem.
evaluated through classification accuracy or clus-
tering accuracy. However, the stability of algo-
rithms is also an important consideration when Cross-References
developing feature selection methods. For exam-
ple, when feature selection is applied on gene  Classification
data, the domain experts would like to see the  Clustering
same or at least similar sets of genes selected after  Dimensionality Reduction
each time they obtain new samples with a small  Feature Extraction
amount of perturbation. Otherwise, they will not
trust the algorithm. However, well-known fea-
ture selection methods, especially unsupervised
Recommended Reading
feature selection algorithms, can select features
with low stability after perturbation is introduced Alelyani S, Tang J, Liu H (2013) Feature selection
to the training data. Developing algorithms of for clustering: a review. In: Aggarwal CC (ed) Data
feature selection with high accuracy and stability clustering: algorithms and applications, vol 29. CRC
is still an open problem. Press, Hoboken
Dy JG, Brodley CE (2004) Feature selection for
Parameter Selection In feature selection, we unsupervised learning. J Mach Learn Res 5:845–
usually need to specify the number of features to 889
select. However, the optimal number of features
Feature Selection 9

Guyon I, Elisseeff A (2003) An introduction to Saeys Y, Inza I, Larrañaga P (2007) A review of feature
variable and feature selection. J Mach Learn Res 3: selection techniques in bioinformatics. Bioinfor-
1157–1182 matics 23(19):2507–2517
Jain A, Zongker D (1997) Feature selection: evalu- Tang J, Liu H (2012) Feature selection with linked
ation, application, and small sample performance. data in social media. In: SDM, Anaheim. SIAM,
IEEE Trans Pattern Anal Mach Intell 19(2):153–158 pp 118–128
Kohavi R, John GH (1997) Wrappers for feature subset Tang J, Alelyani S, Liu H (2014) Feature selection for
selection. Artif Intell 97(1):273–324 classification: a review. In: Aggarwal CC (ed) Data
Koller D, Sahami M (1996) Toward optimal feature classification: algorithms and applications. Chap-
selection. Technical report, Stanford InfoLab man & Hall/CRC, Boca Raton, p 37
Li, J, Cheng K, Wang S, Morstatter F, Trevino R P, Wu X, Yu K, Ding W, Wang H, Zhu X (2013) Online
Tang J, Liu H (2016) Feature Selection: A Data feature selection with streaming features. IEEE
Perspective. arXiv preprint 1601.07996 Trans Pattern Anal Mach Intell 35(5):1178–1192
Liu H, Motoda H (2007) Computational methods of Zhao ZA, Liu H (2011) Spectral feature selection
feature selection. CRC Press, New York for data mining. Chapman & Hall/CRC, Boca F
Liu H, Yu L (2005) Toward integrating feature se- Raton
lection algorithms for classification and clustering. Zhao Z, Morstatter F, Sharma S, Alelyani S, Anand A,
IEEE Trans Knowl Data Eng 17(4):491–502 Liu H (2010) Advancing feature selection research.
Liu H, Motoda H, Setiono R, Zhao Z (2010) Feature ASU feature selection repository, 1–28
selection: an ever evolving frontier in data mining.
In: FSDM, Hyderabad, pp 4–13

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy