Encoding Notes
Encoding Notes
Categorical variables are usually represented as ‘strings’ or ‘categories’ and are finite in number.
The variables only have definite possible values. Many machine learning algorithms cannot work
with categorical data directly. The categories must be converted into numbers. This is required for
both input and output variables that are categorical.
Further, categorical variables can be divided into two categories: Nominal (No particular order) and
Ordinal (some ordered).
1. Ordinal Data: The categories have an inherent order. In Ordinal data, while encoding, one
should retain the information regarding the order in which the category is provided. Like
qualification as schooling, graduate, post graduate etc. of a person possesses decides
whether a person is suitable for a post or not. Also, these qualifications are ordered from
schooling being least to post graduation being maximum qualification.
2. Nominal Data: The categories do not have an inherent order. While encoding Nominal data,
we have to consider the presence or absence of a feature. In such a case, no notion of order
is present. In such a case, no notion of order is present. For example, the city a person lives
in. For the data, it is important to retain where a person lives. Here, we do not have any order
or sequence. It is equal if a person lives in Delhi or Bangalore.
Ordinal Encoding
We do Ordinal encoding to ensure the encoding of variables retains the ordinal nature of the variable.
If we consider the temperature scale as the order, then the ordinal value should be from cold to
“Very Hot. “Ordinal encoding will assign values as ( Cold(0) <Warm(1)<Hot(2)<Very Hot(3)). Usually,
Ordinal Encoding is done starting from 0. Whereas, as per alphabetically sorted order Scikit-learn
ordinal encoding function assignee Cold(0), Hot(1), Very Hot (2) and Warm (3).
Page 2|5
Machine Learning | YBI Foundation
This method produces many columns that slow down the learning significantly if the number of the
category is very high for the feature. Pandas has get_dummies function, which is quite easy to use.
Scikit-learn has OneHotEncoder for this purpose, but it does not create an additional feature column
(another code is needed.
One Hot Encoding is very popular. We can represent all categories by N-1 (N= No of Category) as
sufficient to encode the one that is not included. Usually, for Regression, we use N-1 (drop first or
last column of One Hot Coded new feature). Let’s explain, If the model includes an intercept and
contains dummy variables, then the columns would add up (row-wise) to the intercept and this linear
combination would prevent the matrix inverse from being computed (as it is singular).
Still, for classification, the recommendation is to use all N columns without as most of the tree-
based algorithm builds a tree based on all available variables. One hot encoding with N-1 binary
variables should be used in linear Regression to ensure the correct number of degrees of freedom
(N-1). The linear Regression has access to all of the features as it is being trained and therefore
examines the whole set of dummy variables altogether. This means that N-1 binary variables give
complete information about (represent completely) the original categorical variable to the linear
Regression. This approach can be adopted for any machine learning algorithm that looks at ALL the
features simultaneously during training—for example, support vector machines and neural networks
as well as clustering algorithms.
We will never consider that additional label in tree-based methods if we drop. Thus, if we use the
categorical variables in a tree-based learning algorithm, it is good practice to encode it into N binary
variables and don’t drop.
Page 3|5
Machine Learning | YBI Foundation
Label Encoding
Labels can be words or numbers. Usually, the training data is labeled with words to make it readable.
Label encoding converts word labels into numbers to let algorithms work on them.
Page 4|5
Machine Learning | YBI Foundation
Page 5|5