0% found this document useful (0 votes)
66 views58 pages

Pert 3 Advanced Feature Selection Teqnique

The document provides an overview of feature selection methods in data mining, emphasizing the importance of selecting relevant input variables to improve model performance and understanding. It differentiates between feature selection, which involves choosing a subset of existing features, and feature extraction, which creates new features from the original data. Various methods for feature selection, including filtering and wrapper methods, are discussed along with their advantages and disadvantages.

Uploaded by

mfawazi252
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views58 pages

Pert 3 Advanced Feature Selection Teqnique

The document provides an overview of feature selection methods in data mining, emphasizing the importance of selecting relevant input variables to improve model performance and understanding. It differentiates between feature selection, which involves choosing a subset of existing features, and feature extraction, which creates new features from the original data. Various methods for feature selection, including filtering and wrapper methods, are discussed along with their advantages and disadvantages.

Uploaded by

mfawazi252
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 58

Feature Selection Methods

An overview
Thanks to Qiang Yang
Modified by Charles Ling

Data Mining: Concepts and Te


chniques 1
What is Feature selection ?

 Feature selection:
Problem of selecting some subset of a
learning algorithm’s input variables upon
which it should focus attention, while
ignoring the rest
(DIMENSIONALITY REDUCTION)

 Humans/animals do that constantly!

Data Mining: Concepts and Techniq


ues 2/542
Motivational example from [1]

Biology
Monkeys performing classification task

N. Sigala & N. Logothetis, 2002: Visual categorization shapes feature


Data Mining: Concepts selectivity in the primate temporal cortex.
and Techniq
[1] Nathasha Sigala, Nikos Logothetis: Visual categorization shapes feature selectivity
ues in the primate visual cortex. Nature Vol. 415(2002) 3/543
Motivational example from
Biology
Monkeys performing classification task

Diagnostic features:
- Eye separation
- Eye height

Non-Diagnostic features:
- Mouth height
Data Mining: Concepts and Techniq
- Noseueslength 4/544
Motivational example from
Biology
Monkeys performing classification task
Results:
 activity of a population of 150 neurons

in the anterior inferior temporal cortex


was measured
 44 neurons responded significantly

differently to at least one feature


 After Training: 72% (32/44) were

selective to one or both of the


diagnostic features (and not for the
non-diagnostic
Data Mining:features)
Concepts and Techniq
ues 5/545
Motivational example from
Biology
Monkeys performing classification task
Results:
(single neurons)

„The data from the present


study indicate that neuronal
selectivity was shaped by the
most relevant subset of
features during the
categorization training.“

Data Mining: Concepts and Techniq


ues 6/546
feature selection
 Reducing the feature space by throwing
out some of the features (covariates)
 Also called variable selection

 Motivating idea: try to find a simple,


“parsimonious” model
 Occam’s razor: simplest explanation

that accounts for the data is best

Data Mining: Concepts and Techniq


ues 7
feature extraction

 Feature Extraction is a process that extract a


new set of features from the original data
through numerical Functional mapping.
 Idea:
 Given data points in d-dimensional space,

 Project into lower dimensional space while

preserving as much information as possible



E.g., find best planar approximation to 3D data

E.g., find best planar approximation to 104D data

Data Mining: Concepts and Techniq


ues 8
Feature Selection vs Feature
Extraction
 Differs in two ways:
 Feature selection chooses subset of

features
 Feature extraction creates new features

(dimensions) defined as functions over


{ f1 ,..., fi ,..., f nall
} f . features
  { f i ,..., f i ,..., f i }
selection
1 j m

F F‘
F‘

}  Mining:
{ f1 ,..., fi ,..., f nData    {g1 ( and
Concepts
f . extraction
f1 ,..., f n ),..., g j ( f1 ,..., f n ),..., g m ( f1 ,..., f n )}
Techniq
ues 9
Outline

 What is Feature Reduction?



Feature Selection

Feature Extraction
 Why need Feature Reduction?
 Feature Selection Methods

Filter

Wrapper
 Feature Extraction Methods

Linear

Nonlinear

Data Mining: Concepts and Techniq


ues 10
Motivation

The objective of feature reduction is three-


fold:
 Improving the prediction performance

of the predictors (accuracy)


 Providing a faster and more cost-

effective predictors (CPU time)


 Providing a better understanding of the

underlying process that generated the


data ( 理解)
Data Mining: Concepts and Techniq 11
ues 11
feature reduction--examples
Task 1: classify whether a document is Task 2: predict chances of lung disease
about cats
Data: medical history survey
Data: word counts in the document
X X
cat 2 Vegetarian No
and 35 Plays video Yes
it 20 games
Reduced X
kitten 8 Family history No Reduced X
electric 2 cat 2 Athletic No
Family No
trouble 4 kitten 8 Smoker Yes history
then 5 feline 2 Sex Male Smoker Yes
several 9 Lung capacity 5.8L
feline 2 Hair color Red
while 4 Car Audi
… …
lemon 2 Weight 185 lbs

Data Mining: Concepts and Techniq


ues 12
Feature reduction in task 1

task 1: We’re interested in prediction; features


are not interesting in themselves, we just want to
build a good classifier (or other kind of predictor).
 Text classification
 Features for all 105 English words, and maybe

all word pairs


 Common practice: throw in every feature you

can think of, let feature selection get rid of


useless ones
 Training too expensive with all features

 The presence of irrelevant features hurts

generalization.
Data Mining: Concepts and Techniq
ues 13
Feature reduction in task 2
task 2: We’re interested in features—we want to
know which are relevant. If we fit a model, it
should be interpretable.
 What causes lung cancer?
 Features are aspects of a patient’s

medical history
 Binary response variable: did the

patient develop lung cancer?


 Which features best predict whether

lung cancer will develop? Might want


to legislate against these features.
Data Mining: Concepts and Techniq
ues 14
Get at Case 2 through Case 1
 Even if we just want to identify features, it
can be useful to pretend we want to do
prediction.
 Relevant features are (typically) exactly
those that most aid prediction.
 But not always. Highly correlated features
may be redundant but both interesting as
“causes”.
 e.g. smoking in the morning, smoking at

night
Data Mining: Concepts and Techniq
ues 15
Outline

 What is Feature Reduction?



Feature Selection

Feature Extraction
 Why need Feature Reduction?
 Feature Selection Methods

Filtering

Wrapper
 Feature Extraction Methods

Linear

Nonlinear

Data Mining: Concepts and Techniq


ues 16
Filtering methods
Basic idea: assign score to each feature f
indicating how “related” xf and y are.

 Intuition: if xi,f=yi for all i, then f is good no


matter what our model is—contains all
information about y.
 Many popular scores [see Yang and Pederson
’97]

Classification with categorical data: Chi-squared,
information gain
 Can use binning to make continuous data categorical

Regression: correlation, mutual information

Markov blanket [Koller and Sahami, ’96]
 Then somehow pick how many of the highest
scoring featuresDatatoMining:
keep (nested
Concepts
ues
and Techniq models)
17
Filtering methods
 Advantages:

Very fast

Simple to apply
 Disadvantages:

Doesn’t take into account which learning
algorithm will be used.

Doesn’t take into account correlations between
features

This can be an advantage if we’re only interested in
ranking the relevance of features, rather than performing
prediction.

Also a significant disadvantage—see homework
 Suggestion: use light filtering as an efficient initial
step if there are many obviously irrelevant features

Caveat here too—apparently useless features can
be useful when grouped
Data with
Mining: Concepts others
and Techniq
ues 18
Wrapper Methods
 Learner is considered a black-box
 Interface of the black-box is used to score
subsets of variables according to the
predictive power of the learner when using
the subsets.
 Results vary for different learners
 One needs to define:
 how to search the space of all possible variable
subsets ?
 how to assess the prediction performance of a
learner ?

Data Mining: Concepts and Techniq


ues 19/54
19
Wrapper Methods
 The problem of finding the optimal subset is NP-
hard!
 A wide range of heuristic search strategies can be
used.
Two different classes:

Forward selection
(start with empty feature set and add features at each
step)

Backward elimination
(start with full feature set and discard features at each
step)

 predictive power is usually measured on a


validation set or by cross-validation
 By using the learner as a black box wrappers are
universal and simple!
 Criticism: a large amount of computation is
required.
Data Mining: Concepts and Techniq
ues 20/54
20
Wrapper Methods

Data Mining: Concepts and Techniq


ues 21/54
21
Feature selection – search
strategy
Method Property Comments
Exhaustive Evaluate all (d^m) possible Guaranteed to find the optimal
search subsets subset; not feasible for even
moderately large values of m and
d.
Sequential Select the best single Once a feature is retained, it
Forward feature and then add one cannot be discarded;
Selection feature at a time which in computationally attractive since to
(SFS) combination with the select a subset of size 2, it
selected features maximize examines only (d-1) possible
criterion function. subsets.
Sequential Start with all the d features Once a feature is deleted, it
Backward and successively delete cannot be brought back into the
Selection one feature at a time. optimal subset; requires more
(SBS) computation than sequential
forward selection.

Data Mining: Concepts and Techniq


ues 22
Comparsion of filter and wrapper:

 Wrapper method is tied to solving a


classification algorithm, hence the criterion
can be optimaized;
 but it is potentially very time consuming

since they typically need to evaluate a


cross-validation scheme at every
iteration.

 Filtering method is much faster but it do not


incorporate learning. 23
Data Mining: Concepts and Techniq
ues 23
Multivariate FS is complex

Kohavi-John, 1997

n features, 2n possible feature subsets!


Data Mining: Concepts and Techniq
ues 24
In practice…

 Univariate feature selection often yields


better accuracy results than multivariate
feature selection.
 NO feature selection at all gives sometimes
the best accuracy results, even in the presence
of known distracters.
 Multivariate methods usually claim only
better “parsimony”.
 How can we make multivariate FS work better?

NIPS 2003 and WCCI 2006 challenges : http://clopinet.com/challenges


Data Mining: Concepts and Techniq
ues 25
Feature Extraction-Definition
 Given a set of featuresF { f1 ,..., f i ,..., f n }
the Feature Extraction(“Construction”)
problem is F ''
is to map F to some feature set that
maximizes the learner’s ability to classify
patterns.
F '' arg m axG  G 
(again )*

 This general definition subsumes feature selection


(i.e. a feature selection algorithm also performs a
mapping but can only map to subsets of the input
*  variables)
here is the Data Mining: Concepts and Techniq
set of all possible feature sets
ues 26/51
26
Linear, Unsupervised Feature
Selection

 Question: Are
attributes A1 and A2 Outlook Tempreature Humidity Windy Class
sunny hot high false N
independent? sunny hot high true N
 If they are very overcast hot high false P
rain mild high false P
dependent, we can rain cool normal false P
rain cool normal true N
remove either overcast cool normal true P
A1 or A2 sunny mild high false N
sunny cool normal false P

If A1 is independent rain mild normal false P
on a class attribute sunny mild normal true P
overcast mild high true P
A2, we can overcast hot normal false P
remove A1 from our rain mild high true N

training data
Data Mining: Concepts and Techniq
ues 27
Chi-Squared Test (cont.)
 Question: Are attributes A1 and A2 independent?

 These features are nominal valued (discrete)

 Null Hypothesis: we expect independence

Outlook Temperatur
e
Sunny High
Cloudy Low
Sunny High

Data Mining: Concepts and Techniq


ues 28
The Weather example: Observed
Count

temperatur High Low Outlook


e
Subtotal Outlook Tempera
ture
Outlook
Sunny High
Sunny 2 0 2

Cloudy Low
Cloudy 0 1 1
Sunny High
Tempera 2 1 Total
ture count in
Subtotal: table =3

Data Mining: Concepts and Techniq


ues 29
The Weather example: Expected
Count
If attributes were independent, then the subtotals would b
Like this (this table is also known as
temperatu High Low Subtotal
re Outlook Tempera
ture
Outlook Sunny High
Sunny 3*2/3*2/3 3*2/3*1/3 2
Cloudy Low
=4/3=1.3 =2/3=0.6 (prob=2/
3) Sunny High
Cloudy 3*2/3*1/3 3*1/3*1/3 1,
=0.6 =0.3 (prob=1/
3)
Subtotal: 2 1 Total
(prob=2/ (prob=1/ count in
3) 3) Mining: Concepts
Data table =3
and Techniq
ues 30
Question: How different between
observed and expected?

•If Chi-squared value is very large, then A1 and A2 are not


independent  that is, they are dependent!
•Degrees of freedom: if table has n*m items, then freedom
= (n-1)*(m-1)
•In our example
•Degree = 1
•Chi-Squared=?

Data Mining: Concepts and Techniq


ues 31
Chi-Squared Table: what does it
mean?
 If calculated value is much greater than in the table, then you have reason
to reject the independence assumption
 When your calculated chi-square value is greater than the chi2 value shown in
the 0.05 column (3.84) of this table  you are 95% certain that attributes are
actually dependent!
 i.e. there is only a 5% probability that your calculated X2 value would occur by
chance

Data Mining: Concepts and Techniq


ues 32
Example Revisited (
http://helios.bto.ed.ac.uk/bto/statistics/tress9.html )

We don’t have to have two-dimensional count table (also known as


contingency table)
 Suppose that the ratio of male to female students in the Science
Faculty is exactly 1:1,
 But, the Honours class over the past ten years there have been 80
females and 40 males.
 Question: Is this a significant departure from the (1:1) expectation?

Observe Male Female Total


d
Honours
40 80 120

Data Mining: Concepts and Techniq


ues 33
Expected (http://helios.bto.ed.ac.uk/bto/statistics/tress9.html)

 Suppose that the ratio of


male to female students in
the Science Faculty is
exactly 1:1, Expecte Male Female Total
 but in the Honours class d
over the past ten years there Honours
have been 80 females and 40
males. 60 60 120
 Question: Is this a
significant departure from
the (1:1) expectation?
 Note: the expected is filled
in, from 1:1 expectation,
instead of calculated

Data Mining: Concepts and Techniq


ues 34
Chi-Squared Calculation

Female Male Total

Observed
80 40 120
numbers (O)

Expected
60 60 120
numbers (E)

O-E 20 -20 0

(O-E)2 400 400


Sum=13.34 =
(O-E)2 / E 6.67 6.67
X2
Data Mining: Concepts and Techniq
ues 35
Chi-Squared Test (Cont.)
 Then, check the chi-squared table for significance

http://helios.bto.ed.ac.uk/bto/statistics/table2.html#Chi%20squared%20tes
t
 Compare our X2 value with a c2 (chi squared)
value in a table of c2 with n-1 degrees of freedom

n is the number of categories, i.e. 2 in our case -- males
and females).

We have only one degree of freedom (n-1). From the c2
table, we find a "critical value of 3.84 for p = 0.05.
 13.34 > 3.84, and the expectation (that the
Male:Female in honours major are 1:1) is
wrong!

Data Mining: Concepts and Techniq


ues 36
 https://www.geeksforgeeks.org/chi-square
-test-for-feature-selection-mathematical-e
xplanation/

Data Mining: Concepts and Techniq


ues 37
Chi-Squared Test in Weka:
weather.nominal.arff
@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}


@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no

Data Mining: Concepts and Techniq


ues 38
Chi-Squared Test in Weka

Data Mining: Concepts and Techniq


ues 39
Chi-Squared Test in Weka

Data Mining: Concepts and Techniq


ues 40
Example of Decision Tree Induction

Initial attribute set:


{A1, A2, A3, A4, A5, A6}
A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}


Data Mining: Concepts and Techniq
ues 41
Unsupervised Feature
Extraction:
 Given PCA
N data vectors (samples) from k-dimensions (features), find c <= k
orthogonal dimensions that can be best used to represent the data
 Feature set is reduced from k to c
 Example: data=collection of emails; k=100 word counts; c=10 new
features
 The original data set is reduced by projecting the N data vectors on c
principal components (reduced dimensions)
 Each (old) data vector Xj is a linear combination of the c principal component
vectors Y1, Y2, … Yc through weights Wi:
 Xj= m+W1*Y1+W2*Y2+…+Wc*Yc, i=1, 2, … N
 m is the mean of the data set
 W1, W2, … are the ith components
 Y1, Y2, … are the ith Eigen vectors
 Works for numeric data only
 Used when the number of dimensions is large

Data Mining: Concepts and Techniq


ues 42
Principal Component Analysis
 See online tutorials such as
http://www.cs.otago.ac.nz/cosc453/stu
X2
dent_tutorials/principal_components.pd
f
Y1
Y2 x
x
x
xx
Note: Y1 is x
x x
x
the first x
x
x

eigen x
x x
x
vector, x x
x x X1
Y2 is the x
x
x Key observation:
x
second. Y2
x variance = largest!
ignorable.

Data Mining: Concepts and Techniq


ues 43
Principle Component Analysis (PCA)

Principle Component Analysis: project onto subspace


with the most variance (unsupervised; doesn’t take y
into account)
Data Mining: Concepts and Techniq
ues 44
Principal Component Analysis: one
attribute first Temperatu
re
42
40
 Question: how much 24
spread is in the data 30
along the axis? 15
(distance to the 18
mean) 15
30
 Variance=Standardn
deviation^2
2
 (Xi  X) 2 15
30
s  i 1 35
(n  1) 30
40
30

Data Mining: Concepts and Techniq


ues 45
Now consider two dimensions
X=Temperature Y=Humidity
Covariance: measures the 40 90
correlation between X and Y 40 90
• cov(X,Y)=0: independent 40 90

•Cov(X,Y)>0: move same dir 30 90


15 70
•Cov(X,Y)<0: move oppo dir
15 70
15 70
30 90
15 70

n 30 70

 (X
i 1
i  X )(Yi  Y ) 30
30
70
90
cov( X , Y ) 
(n  1) 40 70
Data Mining: Concepts and Techniq 30 90
ues 46
More than two attributes: covariance
matrix
 Contains covariance values between all
possible dimensions (=attributes):

nxn
C (cij | cij cov(Dimi , Dim j ))
 Example for three attributes (x,y,z):

 cov( x, x) cov( x, y ) cov( x, z ) 


 
C  cov( y, x) cov( y, y ) cov( y, z ) 
 cov( z , x) cov( z , y ) cov( z , z ) 
 
Data Mining: Concepts and Techniq
ues 47
Background: eigenvalues AND
eigenvectors

 Eigenvectors e : C e = e
 How to calculate e and :
 Calculate det(C-I), yields a polynomial (degree

n)
 Determine roots to det(C-I)=0, roots are

eigenvalues 
 Check out any math book such as
 Elementary Linear Algebra by Howard Anton,

Publisher John,Wiley & Sons


 Or any math packages such as MATLAB

Data Mining: Concepts and Techniq


ues 48
Steps of PCA

 Calculate eigenvalues  and eigenvectors e for covariance


matrix C:
 Eigenvalues  corresponds to variance on each
j
component j
 Thus, sort by 
j

 Take the first n eigenvectors ei; where n is the number


of top eigenvalues
 y11   e1   x11  x1 
 These are the directions
  with 
the largest variances
 y12   e2   x12  x2 
 ...   ...   
     ... 
 y   e  x  x 
 1n   n   1n n

Data Mining: Concepts and Techniq


ues 49
An Example Mean1=24.1
Mean2=53.8
X1 X2 X1' X2' 100
90
80
70
19 63 -5.1 9.25 60
50 Series1
40

39 74 14.9 20.25 30
20
10
0
30 87 5.9 33.25 0 10 20 30 40 50

30 23 5.9 -30.75
40

30
15 35 -9.1 -18.75 20

10

15 43 -9.1 -10.75 0 Series1


-15 -10 -5 0 5 10 15 20
-10

-20
15 32 -9.1 -21.75
-30

-40
30 73 5.9 19.25
Data Mining: Concepts and Techniq
ues 50
Covariance Matrix

75 106
 C=
106 482

 Using MATLAB, we find out:


 Eigenvectors:

 e1=(-0.98,-0.21), 1=51.8

 e2=(0.21,-0.98), 2=560.2

 Thus the second eigenvector is more important!

Data Mining: Concepts and Techniq


ues 51
If we only keep one dimension: e2
yi
0.5
0.4 -10.14
0.3 -16.72
0.2
 We keep the 0.1
-31.35

dimension of 0 31.37
4
e2=(0.21,-0.98) -40 -20 -0.1 0 20 40
16.46
-0.2
 We can obtain the final -0.3
4
8.624
data as -0.4
-0.5 19.40
4
-17.63
 0.21 
yi xi1 xi 2   0.21* xi1  0.98 * xi 2
  0.98 

Data Mining: Concepts and Techniq


ues 52
Using Matlab to figure it out

Data Mining: Concepts and Techniq


ues 53
PCA in Weka

Data Mining: Concepts and Techniq


ues 54
Wesather Data from UCI Dataset
(comes with weka package)
@relation weather

@attribute outlook {sunny, overcast, rainy}


@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no

Data Mining: Concepts and Techniq


ues 55
Data Mining: Concepts and Techniq
ues 56
Summary of PCA

 PCA is used for reducing the number of


numerical attributes
 The key is in data transformation
 Adjust data by mean

 Find eigenvectors for covariance matrix

 Transform data

 Note: only linear combination of data


(weighted sum of original data)

Data Mining: Concepts and Techniq


ues 57
Summary

 Data preparation is a big issue for data mining


 Data preparation includes transformation, which
are:
 Data sampling and feature selection
 Discretization
 Missing value handling
 Incorrect value handling
 Feature Selection and Feature Extraction
Data Mining: Concepts and Techniq
ues 58

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy