0% found this document useful (0 votes)
6 views15 pages

Data Preparation-All Pds

The document discusses data preparation techniques for encoding categorical values in Python, using the UCI Machine Learning Repository's Automobile Data Set as an example. It outlines various approaches such as find and replace, label encoding, one hot encoding, and custom binary encoding, detailing the steps and code snippets for each method. The document emphasizes the importance of choosing the right encoding method based on the data and analysis goals.

Uploaded by

Atiya Falak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

Data Preparation-All Pds

The document discusses data preparation techniques for encoding categorical values in Python, using the UCI Machine Learning Repository's Automobile Data Set as an example. It outlines various approaches such as find and replace, label encoding, one hot encoding, and custom binary encoding, detailing the steps and code snippets for each method. The document emphasizes the importance of choosing the right encoding method based on the data and analysis goals.

Uploaded by

Atiya Falak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

2/20/2022

Big Data Analytics


Data Preparation
Muhammad Affan Alim

Data Preparation
• Guide to Encoding Categorical Values in Python

1
2/20/2022

Data Preparation
• As with many other aspects of the Data Science world, there is no
single answer on how to approach this problem

• Each approach has trade-offs and has potential impact on the


outcome of the analysis

The Data Set


• dataset at the UCI Machine Learning Repository. This particular
Automobile Data Set includes a good mix of categorical values as
well as continuous values and serves as a useful example that is
relatively easy to understand

• Before we get started encoding the various values, we need to


important the data and do some minor cleanups. Fortunately,
pandas makes this straightforward:

2
2/20/2022

The Data Set


>> import pandas as pd
>> import numpy as np

>> # Define the headers since the data does not have any
>> headers = ["symboling", "normalized_losses", "make", "fuel_type",
"aspiration","num_doors", "body_style", "drive_wheels",
"engine_location", "wheel_base", "length", "width", "height",
"curb_weight", "engine_type", "num_cylinders", "engine_size",
"fuel_system", "bore", "stroke", "compression_ratio",
"horsepower", "peak_rpm", "city_mpg", "highway_mpg",
"price"]

The Data Set


>> # Read in the CSV file and convert "?" to NaN

>> df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-
databases/autos/imports-85.data", header=None,
names=headers, na_values="?" )

>> df.head()

3
2/20/2022

The Data Set


• The final check we want to do is see what data types we have:

>> df.dtypes

The Data Set


• Since this article will only focus on encoding the categorical
variables, we are going to include only the object columns in our
dataframe.

>> obj_df = df.select_dtypes(include=['object']).copy()


>> obj_df.head()

4
2/20/2022

The Data Set


• s

The Data Set


• Before going any further, there are a couple of null values in the
data that we need to clean up.
>> obj_df[obj_df.isnull().any(axis=1)]

10

5
2/20/2022

The Data Set


• For the sake of simplicity, just fill in the value with the number 4
(since that is the most common value):
>> obj_df["num_doors"].value_counts()

>> obj_df = obj_df.fillna({"num_doors": "four"})

11

Approach #1 - Find and Replace


• Before we go into some of the more “standard” approaches for
encoding categorical data, this data set highlights one potential
approach I’m calling “find and replace

• We have already seen that the num_doors data only includes 2 or 4


doors.
• The number of cylinders only includes 7 values and they are easily
translated to valid numbers:

12

6
2/20/2022

Approach #1 - Find and Replace


>> obj_df["num_cylinders"].value_counts()

• For our uses, we are going to create a mapping dictionary that


contains each column to process as well as a dictionary of the
values to translate.

13

Approach #1 - Find and Replace


• Here is the complete dictionary for cleaning up the num_doors
and num_cylinders columns:

>> cleanup_nums = {"num_doors": {"four": 4, "two": 2},


"num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8,
"two": 2, "twelve": 12, "three":3 }}

14

7
2/20/2022

Approach #1 - Find and Replace


• To convert the columns to numbers using replace :

>> obj_df = obj_df.replace(cleanup_nums)


>> obj_df.head()

15

Approach #1 - Find and Replace


• The nice benefit to this approach is that pandas “knows” the
types of values in the columns so the object is now a int64

>> obj_df.dtypes

16

8
2/20/2022

Approach #1 - Find and Replace


• While this approach may only work in certain scenarios it is a very
useful demonstration of how to convert text values to numeric
when there is an “easy” human interpretation of the data. This
concept is also useful for more general data cleanup.

17

Approach #2 - Label Encoding


• Another approach to encoding categorical values is to use a
technique called label encoding. Label encoding is simply
converting each value in a column to a number.

• For example, the body_style column contains 5 different values.

18

9
2/20/2022

Approach #2 - Label Encoding


• One trick you can use in pandas is to convert a column to a
category, then use those category values for your label encoding:
>> obj_df["body_style"] = obj_df["body_style"].astype('category')
>> obj_df.dtypes

19

Approach #2 - Label Encoding


>> obj_df["body_style_cat"] = obj_df["body_style"].cat.codes
>> obj_df.head()

20

10
2/20/2022

Approach #3 - One Hot Encoding


• Label encoding has the advantage that it is straightforward but it
has the disadvantage that the numeric values can be
“misinterpreted” by the algorithms.
• For example, the value of 0 is obviously less than the value of 4 but
does that really correspond to the data set in real life?

21

Approach #3 - One Hot Encoding


• A common alternative approach is called one hot encoding (but
also goes by several different names shown below).

• Despite the different names, the basic strategy is to convert each


category value into a new column and assigns a 1 or 0 (True/False)
value to the column.

• This has the benefit of not weighting a value improperly but does
have the downside of adding more columns to the data set.

22

11
2/20/2022

Approach #3 - One Hot Encoding


• We can look at the column drive_wheels where we have values of
4wd , fwd or rwd

• By using get_dummies we can convert this to three columns with a


1 or 0 corresponding to the correct value:

23

Approach #3 - One Hot Encoding


>> pd.get_dummies(obj_df, columns=["drive_wheels"]).head()

24

12
2/20/2022

Approach #3 - One Hot Encoding


• This function is powerful because you can pass as many category
columns as you would like and choose how to label the columns
using prefix .

• Proper naming will make the rest of the analysis just a little bit
easier.

25

Approach #3 - One Hot Encoding


>> pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"],
prefix=["body", "drive"]).head()

26

13
2/20/2022

Approach #4 - Custom Binary Encoding


• In this particular data set, there is a column called engine_type that
contains several different values:
>> obj_df["engine_type"].value_counts()

27

Approach #4 - Custom Binary Encoding


• For the sake of discussion, maybe all we care about is whether or
not the engine is an Overhead Cam (OHC) or not.
• In other words, the various versions of OHC are all the same for
this analysis.
• If this is the case, then we could use the str accessor plus np.where
to create a new column the indicates whether or not the car has an
OHC engine.

28

14
2/20/2022

Approach #4 - Custom Binary Encoding


>> obj_df["OHC_Code"] =
np.where(obj_df["engine_type"].str.contains("ohc"), 1, 0)

29

15

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy