0% found this document useful (0 votes)
6 views4 pages

Encoding Notes

The document discusses the encoding of categorical variables in machine learning, highlighting the necessity of converting these variables into numerical formats for algorithm compatibility. It explains different encoding techniques such as Ordinal Encoding, One Hot Encoding, and Label Encoding, detailing their applications and limitations. Additionally, it provides a comparison between Label Encoding and One-Hot Encoding, emphasizing their respective uses and impacts on data dimensions.

Uploaded by

nhkjdhyegvemd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

Encoding Notes

The document discusses the encoding of categorical variables in machine learning, highlighting the necessity of converting these variables into numerical formats for algorithm compatibility. It explains different encoding techniques such as Ordinal Encoding, One Hot Encoding, and Label Encoding, detailing their applications and limitations. Additionally, it provides a comparison between Label Encoding and One-Hot Encoding, emphasizing their respective uses and impacts on data dimensions.

Uploaded by

nhkjdhyegvemd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Machine Learning | YBI Foundation

Encoding Concept Notes

Categorical variables are usually represented as ‘strings’ or ‘categories’ and are finite in number.
The variables only have definite possible values. Many machine learning algorithms cannot work
with categorical data directly. The categories must be converted into numbers. This is required for
both input and output variables that are categorical.

Further, categorical variables can be divided into two categories: Nominal (No particular order) and
Ordinal (some ordered).

1. Ordinal Data: The categories have an inherent order. In Ordinal data, while encoding, one
should retain the information regarding the order in which the category is provided. Like
qualification as schooling, graduate, post graduate etc. of a person possesses decides
whether a person is suitable for a post or not. Also, these qualifications are ordered from
schooling being least to post graduation being maximum qualification.
2. Nominal Data: The categories do not have an inherent order. While encoding Nominal data,
we have to consider the presence or absence of a feature. In such a case, no notion of order
is present. In such a case, no notion of order is present. For example, the city a person lives
in. For the data, it is important to retain where a person lives. Here, we do not have any order
or sequence. It is equal if a person lives in Delhi or Bangalore.

Ordinal Encoding

We do Ordinal encoding to ensure the encoding of variables retains the ordinal nature of the variable.
If we consider the temperature scale as the order, then the ordinal value should be from cold to
“Very Hot. “Ordinal encoding will assign values as ( Cold(0) <Warm(1)<Hot(2)<Very Hot(3)). Usually,
Ordinal Encoding is done starting from 0. Whereas, as per alphabetically sorted order Scikit-learn
ordinal encoding function assignee Cold(0), Hot(1), Very Hot (2) and Warm (3).

One Hot Encoding or Dummy Variable


In this method, we map each category to a vector that contains 1 and 0, denoting the presence or
absence of the feature. The number of vectors depends on the number of categories for features.
www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

Page 2|5
Machine Learning | YBI Foundation

This method produces many columns that slow down the learning significantly if the number of the
category is very high for the feature. Pandas has get_dummies function, which is quite easy to use.
Scikit-learn has OneHotEncoder for this purpose, but it does not create an additional feature column
(another code is needed.

One Hot Encoding is very popular. We can represent all categories by N-1 (N= No of Category) as
sufficient to encode the one that is not included. Usually, for Regression, we use N-1 (drop first or
last column of One Hot Coded new feature). Let’s explain, If the model includes an intercept and
contains dummy variables, then the columns would add up (row-wise) to the intercept and this linear
combination would prevent the matrix inverse from being computed (as it is singular).

Still, for classification, the recommendation is to use all N columns without as most of the tree-
based algorithm builds a tree based on all available variables. One hot encoding with N-1 binary
variables should be used in linear Regression to ensure the correct number of degrees of freedom
(N-1). The linear Regression has access to all of the features as it is being trained and therefore
examines the whole set of dummy variables altogether. This means that N-1 binary variables give
complete information about (represent completely) the original categorical variable to the linear
Regression. This approach can be adopted for any machine learning algorithm that looks at ALL the
features simultaneously during training—for example, support vector machines and neural networks
as well as clustering algorithms.

We will never consider that additional label in tree-based methods if we drop. Thus, if we use the
categorical variables in a tree-based learning algorithm, it is good practice to encode it into N binary
variables and don’t drop.

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

Page 3|5
Machine Learning | YBI Foundation

Label Encoding
Labels can be words or numbers. Usually, the training data is labeled with words to make it readable.
Label encoding converts word labels into numbers to let algorithms work on them.

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

Page 4|5
Machine Learning | YBI Foundation

Encoding Interview Preparation

Q. What is the difference between Label Encoding and One-Hot Encoding?

Label Encoding One-Hot Encoding


How is the Converts the data into dummy
categorical Labels the data into numbers variables, i.e., binary having 1 or 0 as
data treated? values.
Var_Male: 1 and 0 / Var_Female: 0 and
Example Male: 1 Female: 2
1
Dummies can be created by either
It can be used via the sklearn
How to use it in sklearn’s function: OneHotEncoder or
package’s function called
Python? Python’s inbuilt function:
LabelEncoder
pd.get_dummies
Changes the nominal data into ordinal
The method creates extra redundant
making the values given to the
Limitation of columns for each category, and a
categories as weights, and hence the
the method different column is generated. This
machine accordingly gives those
increases the dimensions of the data.
values importance.
Solution Employ Dummy creation or One-Hot Use the various methods available for
available encoding technique dimensionality reduction
Label Encoding One-Hot Encoding

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

Page 5|5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy