0% found this document useful (0 votes)
78 views8 pages

Information Gain - Towards Data Science

This document discusses several machine learning concepts including decision trees, entropy, feature selection, and different feature selection methods. It provides an in-depth overview of decision tree fundamentals and how they work to partition data based on features. It also explains key feature selection concepts like entropy, information gain, and the need for feature selection to improve model performance. Finally, it describes different feature selection methods including filter, wrapper and embedded methods and provides examples of how each works.

Uploaded by

SIDDHARTHA SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views8 pages

Information Gain - Towards Data Science

This document discusses several machine learning concepts including decision trees, entropy, feature selection, and different feature selection methods. It provides an in-depth overview of decision tree fundamentals and how they work to partition data based on features. It also explains key feature selection concepts like entropy, information gain, and the need for feature selection to improve model performance. Finally, it describes different feature selection methods including filter, wrapper and embedded methods and provides examples of how each works.

Uploaded by

SIDDHARTHA SINHA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Get

started Open in app

Towards Data Science

A Medium publication sharing concepts, ideas and codes.

Follow 562K Followers

Information Gain
In Towards Data Science. More on Medium.

Huy Bui · Mar 31, 2020

Decision Tree Fundamentals


Learning about Gini Impurity, Entropy, and how to construct a decision tree

Photo by veeterzy on Unsplash

When talking about the decision trees, I always imagine a list of questions I would ask
my girlfriend when she does not know what she wants for dinner: Do you want to eat
something with the noodle? How much do you want to spend? Asian or Western?
Healthy or junk food?

Making a list of questions to narrow down the options is essentially the idea behind
decision trees. More formally, the decision tree is the algorithm that partitions the
observations into similar data points based on their features.

The decision tree is a supervised learning model that has the tree-like…
Read more · 7 min read

83
Renu Khandelwal · Oct 24, 2019

Feature Selection : Identifying the best input features


In this article, we will understand what is feature selection, the difference between
feature selection and dimensionality reduction. How does feature importance help.
Understand different techniques like filter method, wrapper and embedded method
for identifying the best features with code in Python.

What is Feature Selection?


Feature selection is also referred to as Attribute selection or Variable selection
and is part of Feature Engineering. It is the process to select a subset of most
relevant attributes or features in the data set for predictive modeling.

Machine Learning in Finance | Data Driven Investor


Before we cover some Machine Learning finance applications, let's
first understand what Machine Learning is. Machine…
www.datadriveninvestor.com

The selected features help predictive models to identify hidden business insights.

If we need to predict the salary of people in IT then based on our common


understanding, we would need the number of years of experience, skill sets, work
location, current designation. These are a few of the key features helpful with salary
prediction. If the data set contains the height of the person, we know that feature is
irrelevant to salary prediction and hence should not be included as part of feature
selection.

Feature selection is the process to decide which relevant original features


to include and which irrelevant features to exclude for predictive
modeling.

Difference between Feature Selection and dimensionality reduction


Feature selection and Dimensionality reduction’s goal is to reduce the number of
Feature selection and Dimensionality reduction’s goal is to reduce the number of
attributes or features in the data set.

The key difference between feature selection and dimensionality reduction is that in
Feature selection, we do not change the original features, however, in
dimensionality reduction we create new features from original features.
This transformation of features using dimensionality reduction is often irreversible.

Feature selection is based on certain statistical methods like filter, wrapper and
embedded methods that we will discuss in this article.

For dimensionality reduction, we use techniques like Principal Component


Analysis(PCA)

Need for Feature Selection


Helps train the model faster: We have reduced numbers of relevant features,
so training is much faster.

Increase model interpret-ability and simplifies the model — It reduces


the complexity of the model by including only the most relevant features and
hence easy to interpret. This is very helpful in explaining the predictive model

Improves accuracy of the model: We include only features that are relevant
for our prediction and that increases the accuracy of the model. Irrelevant features
introduce noise and reduce the accuracy of the model

Reduces Over-fitting: Over-fitting is when the predictive model does not


generalize well on test data or unseen data based on the training. To reduce
overfitting, we need to remove the noise in the data set and include the features
that most influence the prediction. Noise comes from irrelevant features in the
data set. When a predictive model has learned the noise as part of training then it
will not generalize well on unseen data.

Different methods for Feature Selection


Filter

Wrapper

Embedded methods

Filter method for feature selection


The filter method ranks each feature based on some uni-variate metric and then
selects the highest-ranking features. Some of the uni-variate metrics are

variance: removing constant and quasi constant features

chi-square: used for classification. It is a statistical test of independence to


determine the dependency of two variables.

correlation coefficients: removes duplicate features

Information gain or mutual information: assess the dependency of the


independent variable in predicting the target variable. In other words, it
determines the ability of the independent features to predict the target variable.
Advantages of Filter methods
Filter methods are model agnostic

Rely entirely on features in the data set


Rely entirely on features in the data set

Computationally very fast

Based on different statistical methods


The disadvantage of Filter methods
The filter method looks at individual features for identifying it’s relative
importance. A feature may not be useful on its own but maybe an important
influencer when combined with other features. Filter methods may miss such
features.
Filter criteria for selecting the best feature
Select independent features with

High correlation with the target variable

Low correlation with other independent variables

Higher information gain or mutual information of the independent variable

Wrapper method for feature selection


The wrapper method searches for the best subset of input features to predict the
target variable. It selects the features that provide the best accuracy of the model.
Wrapper methods use inferences based on the previous model to decide if a new
feature needs to be added or removed.

Wrapper methods are

Exhaustive search: evaluates all possible combinations of input features to find


the input feature subset that would give the best accuracy for a selected model.
Computationally very expensive when the number of input features gets larger;

Forward selection: start with a null feature set and keeping adding one input
feature at a time and evaluate the accuracy of the model. This process is continued
till we reach a certain accuracy with a predefined number of features;

Backward selection: start with all the features and then keep removing one
feature at a time to evaluate the accuracy of the model. Feature set that yields the
best accuracy is retained.

Always evaluate the accuracy of the model on the test data set.
Advantages
Models feature dependencies between each of the input features

Dependent on the model selected

selects the model with the highest accuracy based on feature subset
Disadvantages:
Computationally very expensive as training happens on each of the input feature
set combination

Not model agnostic

Embedded method for feature selection


Embedded methods use the qualities of both filter and wrapper feature selection
methods. Feature selection is embedded in the machine learning algorithm.

Filter methods do not incorporate learning and are only about feature selection.
Wrapper methods use a machine-learning algorithm to evaluate the subsets of
Wrapper methods use a machine-learning algorithm to evaluate the subsets of
features without incorporating knowledge about the specific structure of the
classification or regression function and can, therefore, be combined with any
learning machine

Embedded feature selection algorithms include

Decision Tree

Regularization — L1(Lasso)and L2(Ridge) Regularization

By fitting the model, using these machine learning techniques. These methods provide
us with the feature importance for better accuracy.

Read more on L1 and L2 regularization here

In the next article, we will implement some of the feature selection


methods in python using Filter method
References:
http://ijcsit.com/docs/Volume%202/vol2issue3/ijcsit2011020322.pdf

https://arxiv.org/pdf/1907.07384.pdf

http://people.cs.pitt.edu/~iyad/DR.pdf

https://link.springer.com/chapter/10.1007%2F978-3-540-35488-8_6

180

ayşe bilge gündüz · Nov 11, 2019

Machine Learning 101-ID3 Decision Tree and Entropy Calculation


(1)
This series includes the Machine Learning course notes which I collected while I was
in the course phase at Ph.D.
Training Approaches
Machine Learning training approaches divide into 3;

1. Supervised Learning

2. Unsupervised Learning

3. Reinforcement Learning

ID3 Decision Tree


This approach known as supervised and non-parametric decision tree type.

Mostly, it is used for classification and regression.

A tree consists of an inter decision node and terminal leaves. And terminal leaves
has outputs. The output display class values in classification, however display
numeric value for regression.

The aim of dividing subsets into decision trees is to make each subset as
homogeneous as possible. The disadvantage of decision tree algorithms is that
they are greedy approaches. A Greedy algorithm is any algorithm that follows
the problem-solving heuristic…
Read more · 5 min read

77

Lukas Molzberger · Sep 23, 2019


Using Information Gain for the Unsupervised Training of
Excitatory Neurons
Looking for a biologically more plausible way to train a neural network.

Traditionally, artificial neural networks have been trained using the Delta rule and
backpropagation. But this contradicts the findings that the neurosciences have made
on the function of the brain. There simply is no gradient error signal that is
propagated backwards through biological neurons (see here and here). Besides, the
human brain can find patterns in its audiovisual training data by itself without the
need for training labels. When a parent shows a cat to a child, the child doesn’t use
this information to learn every detail of what…
Read more · 11 min read

91

Azika Amelia · Sep 6, 2019

Decision tree: Part 2/2


Calculating Entropy and Information gain by hand
This post is second in the “Decision tree” series, the first post in this series develops
an intuition about the decision trees and gives you an idea of where to draw a decision
boundary. In this post, we’ll see how a decision tree does it.

Spoiler: It involves some mathematics.

We’ll be using a really tiny dataset for easy visualization and follow through. However,
in practice, such datasets would definitely over-fit. This dataset decides if you should
buy a car given 3 features: Age, Mileage and whether or not the car is road test.
Read more · 4 min read

35

About Write Help Legal

Get the Medium app

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy