0% found this document useful (0 votes)

6 views15 pages

Data Preparation-All Pds

The document discusses data preparation techniques for encoding categorical values in Python, using the UCI Machine Learning Repository's Automobile Data Set as an example. It outlines various approaches such as find and replace, label encoding, one hot encoding, and custom binary encoding, detailing the steps and code snippets for each method. The document emphasizes the importance of choosing the right encoding method based on the data and analysis goals.

Uploaded by

Atiya Falak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views15 pages

Data Preparation-All Pds

Uploaded by

Atiya Falak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

2/20/2022

Big Data Analytics

Data Preparation
Muhammad Affan Alim

Data Preparation
• Guide to Encoding Categorical Values in Python

1
2/20/2022

Data Preparation
• As with many other aspects of the Data Science world, there is no
single answer on how to approach this problem

• Each approach has trade-offs and has potential impact on the

outcome of the analysis

The Data Set

• dataset at the UCI Machine Learning Repository. This particular
Automobile Data Set includes a good mix of categorical values as
well as continuous values and serves as a useful example that is
relatively easy to understand

• Before we get started encoding the various values, we need to

important the data and do some minor cleanups. Fortunately,
pandas makes this straightforward:

2
2/20/2022

The Data Set

>> import pandas as pd
>> import numpy as np

>> # Define the headers since the data does not have any
>> headers = ["symboling", "normalized_losses", "make", "fuel_type",
"aspiration","num_doors", "body_style", "drive_wheels",
"engine_location", "wheel_base", "length", "width", "height",
"curb_weight", "engine_type", "num_cylinders", "engine_size",
"fuel_system", "bore", "stroke", "compression_ratio",
"horsepower", "peak_rpm", "city_mpg", "highway_mpg",
"price"]

The Data Set

>> # Read in the CSV file and convert "?" to NaN

>> df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-
databases/autos/imports-85.data", header=None,
names=headers, na_values="?" )

>> df.head()

3
2/20/2022

The Data Set

• The final check we want to do is see what data types we have:

>> df.dtypes

The Data Set

• Since this article will only focus on encoding the categorical
variables, we are going to include only the object columns in our
dataframe.

>> obj_df = df.select_dtypes(include=['object']).copy()

>> obj_df.head()

4
2/20/2022

The Data Set

• s

The Data Set

• Before going any further, there are a couple of null values in the
data that we need to clean up.
>> obj_df[obj_df.isnull().any(axis=1)]

5
2/20/2022

The Data Set

• For the sake of simplicity, just fill in the value with the number 4
(since that is the most common value):
>> obj_df["num_doors"].value_counts()

>> obj_df = obj_df.fillna({"num_doors": "four"})

Approach #1 - Find and Replace

• Before we go into some of the more “standard” approaches for
encoding categorical data, this data set highlights one potential
approach I’m calling “find and replace

• We have already seen that the num_doors data only includes 2 or 4

doors.
• The number of cylinders only includes 7 values and they are easily
translated to valid numbers:

6
2/20/2022

Approach #1 - Find and Replace

>> obj_df["num_cylinders"].value_counts()

• For our uses, we are going to create a mapping dictionary that

contains each column to process as well as a dictionary of the
values to translate.

Approach #1 - Find and Replace

• Here is the complete dictionary for cleaning up the num_doors
and num_cylinders columns:

>> cleanup_nums = {"num_doors": {"four": 4, "two": 2},

"num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8,
"two": 2, "twelve": 12, "three":3 }}

7
2/20/2022

Approach #1 - Find and Replace

• To convert the columns to numbers using replace :

>> obj_df = obj_df.replace(cleanup_nums)

>> obj_df.head()

Approach #1 - Find and Replace

• The nice benefit to this approach is that pandas “knows” the
types of values in the columns so the object is now a int64

>> obj_df.dtypes

8
2/20/2022

Approach #1 - Find and Replace

• While this approach may only work in certain scenarios it is a very
useful demonstration of how to convert text values to numeric
when there is an “easy” human interpretation of the data. This
concept is also useful for more general data cleanup.

Approach #2 - Label Encoding

• Another approach to encoding categorical values is to use a
technique called label encoding. Label encoding is simply
converting each value in a column to a number.

• For example, the body_style column contains 5 different values.

9
2/20/2022

Approach #2 - Label Encoding

• One trick you can use in pandas is to convert a column to a
category, then use those category values for your label encoding:
>> obj_df["body_style"] = obj_df["body_style"].astype('category')
>> obj_df.dtypes

Approach #2 - Label Encoding

>> obj_df["body_style_cat"] = obj_df["body_style"].cat.codes
>> obj_df.head()

10
2/20/2022

Approach #3 - One Hot Encoding

• Label encoding has the advantage that it is straightforward but it
has the disadvantage that the numeric values can be
“misinterpreted” by the algorithms.
• For example, the value of 0 is obviously less than the value of 4 but
does that really correspond to the data set in real life?

Approach #3 - One Hot Encoding

• A common alternative approach is called one hot encoding (but
also goes by several different names shown below).

• Despite the different names, the basic strategy is to convert each

category value into a new column and assigns a 1 or 0 (True/False)
value to the column.

• This has the benefit of not weighting a value improperly but does
have the downside of adding more columns to the data set.

11
2/20/2022

Approach #3 - One Hot Encoding

• We can look at the column drive_wheels where we have values of
4wd , fwd or rwd

• By using get_dummies we can convert this to three columns with a

1 or 0 corresponding to the correct value:

Approach #3 - One Hot Encoding

>> pd.get_dummies(obj_df, columns=["drive_wheels"]).head()

12
2/20/2022

Approach #3 - One Hot Encoding

• This function is powerful because you can pass as many category
columns as you would like and choose how to label the columns
using prefix .

• Proper naming will make the rest of the analysis just a little bit
easier.

Approach #3 - One Hot Encoding

>> pd.get_dummies(obj_df, columns=["body_style", "drive_wheels"],
prefix=["body", "drive"]).head()

13
2/20/2022

Approach #4 - Custom Binary Encoding

• In this particular data set, there is a column called engine_type that
contains several different values:
>> obj_df["engine_type"].value_counts()

Approach #4 - Custom Binary Encoding

• For the sake of discussion, maybe all we care about is whether or
not the engine is an Overhead Cam (OHC) or not.
• In other words, the various versions of OHC are all the same for
this analysis.
• If this is the case, then we could use the str accessor plus np.where
to create a new column the indicates whether or not the car has an
OHC engine.

14
2/20/2022

Approach #4 - Custom Binary Encoding

>> obj_df["OHC_Code"] =
np.where(obj_df["engine_type"].str.contains("ohc"), 1, 0)

Prim Maths 4 2ed TR Unit 11 Test
100% (3)
Prim Maths 4 2ed TR Unit 11 Test
4 pages
Heim - Theory - Reconstructed-10-05-2025-Toward Experimental Validation
No ratings yet
Heim - Theory - Reconstructed-10-05-2025-Toward Experimental Validation
23 pages
Data Analysis Report
No ratings yet
Data Analysis Report
74 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
CSE445 T2b Data Preprocessing
No ratings yet
CSE445 T2b Data Preprocessing
42 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
IP Project Model
No ratings yet
IP Project Model
51 pages
Data Wrangling PDF
No ratings yet
Data Wrangling PDF
14 pages
Data - Wrangling Analysis
No ratings yet
Data - Wrangling Analysis
26 pages
Feature Engineering
No ratings yet
Feature Engineering
50 pages
Data Wrangling
No ratings yet
Data Wrangling
24 pages
Lec ExploratoryDataAnalysis1Unit5Part1
No ratings yet
Lec ExploratoryDataAnalysis1Unit5Part1
22 pages
Pandas Notes Basic To Advance
No ratings yet
Pandas Notes Basic To Advance
21 pages
Lecture 5 Encoding
No ratings yet
Lecture 5 Encoding
35 pages
Lec 18
No ratings yet
Lec 18
17 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
Machine Learning Record VR19
No ratings yet
Machine Learning Record VR19
46 pages
EDA Withoutcode
No ratings yet
EDA Withoutcode
36 pages
GRIP (BIOLOGY) 2021 PMC NMDCAT NUMS AGHA KHAN 12000+ MCQS Question Bank
No ratings yet
GRIP (BIOLOGY) 2021 PMC NMDCAT NUMS AGHA KHAN 12000+ MCQS Question Bank
103 pages
7 - InnovatiCS - Categorical Data & Data Transformation
No ratings yet
7 - InnovatiCS - Categorical Data & Data Transformation
20 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
05 Pandas
No ratings yet
05 Pandas
12 pages
Car Price Prediction 1
No ratings yet
Car Price Prediction 1
24 pages
Report
No ratings yet
Report
24 pages
Engo 645
No ratings yet
Engo 645
9 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Ip Project 2ND Year
No ratings yet
Ip Project 2ND Year
18 pages
One Hot Encodding
No ratings yet
One Hot Encodding
7 pages
Dogar AMC Book Biology Portion (Taleem360)
No ratings yet
Dogar AMC Book Biology Portion (Taleem360)
49 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Dealing With Categorical
No ratings yet
Dealing With Categorical
25 pages
Data Analysis
No ratings yet
Data Analysis
58 pages
Exp 5 Exploratory Data Analysis SDK Ok
No ratings yet
Exp 5 Exploratory Data Analysis SDK Ok
13 pages
Working With Categorical Data Chapter4
No ratings yet
Working With Categorical Data Chapter4
33 pages
Machine Learning With Python - Part-2
No ratings yet
Machine Learning With Python - Part-2
27 pages
Week 10
No ratings yet
Week 10
50 pages
Jashan ML
No ratings yet
Jashan ML
20 pages
Lab File
No ratings yet
Lab File
96 pages
Binning and Normalization Activity
No ratings yet
Binning and Normalization Activity
2 pages
Xii Project PDF
No ratings yet
Xii Project PDF
19 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory
No ratings yet
Assignment - CarsData - Descriptive - EDA - Munjal - Exercise - Ipynb - Colaboratory
6 pages
GmPrac1 - Jupyter Notebook
No ratings yet
GmPrac1 - Jupyter Notebook
11 pages
Python Basics - Hamza Zahoor
No ratings yet
Python Basics - Hamza Zahoor
6 pages
HKAL Pure Math Booklist
No ratings yet
HKAL Pure Math Booklist
8 pages
Experiment 1 Solution
No ratings yet
Experiment 1 Solution
5 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Statisitics Project 7
No ratings yet
Statisitics Project 7
22 pages
EDA Assignment
No ratings yet
EDA Assignment
16 pages
CH 3 2
No ratings yet
CH 3 2
17 pages
2 Pier Alignment
No ratings yet
2 Pier Alignment
27 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Intro To Exploratory Data Analysis Eda in Python
No ratings yet
Intro To Exploratory Data Analysis Eda in Python
7 pages
750-Article Text-3615-1-10-20240613
No ratings yet
750-Article Text-3615-1-10-20240613
16 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
Problem Statement Is To Predict Price Column Based On Data With 24 Columns With Over 200 Data Entries Using Linear Regression
No ratings yet
Problem Statement Is To Predict Price Column Based On Data With 24 Columns With Over 200 Data Entries Using Linear Regression
5 pages
1
No ratings yet
1
3 pages
Meer Taqi Meer
No ratings yet
Meer Taqi Meer
4 pages
Exp 2 Data Preprocessing - Cleaning The Dataset Obtained From The UCI ML Repository
No ratings yet
Exp 2 Data Preprocessing - Cleaning The Dataset Obtained From The UCI ML Repository
9 pages
Assignment+questions+python+fundmentals ANSWER
No ratings yet
Assignment+questions+python+fundmentals ANSWER
3 pages
Python Codes
No ratings yet
Python Codes
17 pages
Data Wrangling
No ratings yet
Data Wrangling
24 pages
Statisitics Project 3
No ratings yet
Statisitics Project 3
22 pages
1 - Relations and Functions
No ratings yet
1 - Relations and Functions
18 pages
SPE 165493 Analytical Evaluation of Casing Connections For Thermal Well Applications
No ratings yet
SPE 165493 Analytical Evaluation of Casing Connections For Thermal Well Applications
15 pages
Data Treatment
No ratings yet
Data Treatment
6 pages
Lesson 9 5 Multiplication Division of Radical Expressions
100% (1)
Lesson 9 5 Multiplication Division of Radical Expressions
17 pages
Cube and Cube Roots
No ratings yet
Cube and Cube Roots
5 pages
Data Wrangling Python.
No ratings yet
Data Wrangling Python.
8 pages
2008 NED Entry Test Physics - Full MCQs Solution (Recreated) - ECAT & MDCAT Preparation
No ratings yet
2008 NED Entry Test Physics - Full MCQs Solution (Recreated) - ECAT & MDCAT Preparation
44 pages
Economics, Game Theory and Terrorism (Walter Enders, Todd Sandler)
No ratings yet
Economics, Game Theory and Terrorism (Walter Enders, Todd Sandler)
544 pages
SE IT CGL Lab Manual
No ratings yet
SE IT CGL Lab Manual
96 pages
MAA00A1 Learning Guide
No ratings yet
MAA00A1 Learning Guide
12 pages
Astm-D7336 D7336M
No ratings yet
Astm-D7336 D7336M
9 pages
Cell Cycle PDF
No ratings yet
Cell Cycle PDF
12 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
9709 s20 QP 31-Solved (Handwritten)
No ratings yet
9709 s20 QP 31-Solved (Handwritten)
12 pages
Answers)
100% (1)
Answers)
12 pages
Machine Learning: Cross Validation Machine Learning by Tom M. Mitchell Muhammad Affan Alim
No ratings yet
Machine Learning: Cross Validation Machine Learning by Tom M. Mitchell Muhammad Affan Alim
56 pages
Chapter 9 Biotechnology
No ratings yet
Chapter 9 Biotechnology
21 pages
Software Defined Radio Handbook: Eighth Edition
No ratings yet
Software Defined Radio Handbook: Eighth Edition
53 pages
Txt_2005 NED Entry Test Physics - Full MCQs Solution (Recreated) - ECAT & MDCAT Preparation
No ratings yet
Txt_2005 NED Entry Test Physics - Full MCQs Solution (Recreated) - ECAT & MDCAT Preparation
36 pages
Reasoning Under Uncertainty
100% (1)
Reasoning Under Uncertainty
17 pages
DUHS Strategic Plan
No ratings yet
DUHS Strategic Plan
55 pages
Multiple Correct Questions 1. Physics: Paper-1 JEE-Advanced - FT-02 - Sample Paper
No ratings yet
Multiple Correct Questions 1. Physics: Paper-1 JEE-Advanced - FT-02 - Sample Paper
12 pages
Interpretation and Report Writing: Bm-Aryan Panchal
No ratings yet
Interpretation and Report Writing: Bm-Aryan Panchal
13 pages
1 - Hutchison 1957, Concluding Remarks
No ratings yet
1 - Hutchison 1957, Concluding Remarks
13 pages
Curriculum Content: 1. General Physics
No ratings yet
Curriculum Content: 1. General Physics
3 pages
CH SHM, Waves & Sound
No ratings yet
CH SHM, Waves & Sound
2 pages
Proforma Invoice Lift (Highway Traders LHR)
No ratings yet
Proforma Invoice Lift (Highway Traders LHR)
9 pages
Chemistry Blanks
No ratings yet
Chemistry Blanks
15 pages
Data For Gratuity Valuation - June 30 2021 v1
No ratings yet
Data For Gratuity Valuation - June 30 2021 v1
27 pages
150 MCQs
No ratings yet
150 MCQs
13 pages
Jamia Tul Madina Faizan
No ratings yet
Jamia Tul Madina Faizan
6 pages
Guess Paper XI Zoology 2022
No ratings yet
Guess Paper XI Zoology 2022
3 pages
Essays 2022
No ratings yet
Essays 2022
7 pages
Python-Final Exam
No ratings yet
Python-Final Exam
2 pages
Result Chem GT (CH # 2, 5) MDCAT
No ratings yet
Result Chem GT (CH # 2, 5) MDCAT
1 page
Carbohydrateanki CSV
No ratings yet
Carbohydrateanki CSV
2 pages
A Review On Cartans Structure Equations For Certa
No ratings yet
A Review On Cartans Structure Equations For Certa
7 pages
1.1 Functions and Theis Representations
No ratings yet
1.1 Functions and Theis Representations
17 pages
Engineering Economics Formulas
No ratings yet
Engineering Economics Formulas
2 pages
Writing Approaches
No ratings yet
Writing Approaches
3 pages
Digital Learning
No ratings yet
Digital Learning
2 pages
What Is KMC
No ratings yet
What Is KMC
2 pages
Fundamentals of Computer Programming: Arrays (CLO3)
No ratings yet
Fundamentals of Computer Programming: Arrays (CLO3)
17 pages
New Hexagonal Geometry in Cellular Network Systems
No ratings yet
New Hexagonal Geometry in Cellular Network Systems
8 pages
OTS Matrices Determinants PDF
No ratings yet
OTS Matrices Determinants PDF
5 pages
Practical Examination: Intermediate FOR 2018
No ratings yet
Practical Examination: Intermediate FOR 2018
6 pages
SQL
No ratings yet
SQL
1 page
Akhuwat Internship Programme
No ratings yet
Akhuwat Internship Programme
2 pages
Bearing Capacityof Embedded Strip Footing Placed Adjacentto Sandy Soil Slopes
No ratings yet
Bearing Capacityof Embedded Strip Footing Placed Adjacentto Sandy Soil Slopes
8 pages
Haste Makes Waste Hurry Makes Curry
No ratings yet
Haste Makes Waste Hurry Makes Curry
1 page
History of Exponents
No ratings yet
History of Exponents
2 pages
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Preparation-All Pds

Uploaded by

Data Preparation-All Pds

Uploaded by

2/20/2022

Big Data Analytics

• Each approach has trade-offs and has potential impact on the

The Data Set

• Before we get started encoding the various values, we need to

The Data Set

The Data Set

The Data Set

The Data Set

>> obj_df = df.select_dtypes(include=['object']).copy()

The Data Set

The Data Set

The Data Set

>> obj_df = obj_df.fillna({"num_doors": "four"})

Approach #1 - Find and Replace

• We have already seen that the num_doors data only includes 2 or 4

Approach #1 - Find and Replace

• For our uses, we are going to create a mapping dictionary that

Approach #1 - Find and Replace

>> cleanup_nums = {"num_doors": {"four": 4, "two": 2},

Approach #1 - Find and Replace

>> obj_df = obj_df.replace(cleanup_nums)

Approach #1 - Find and Replace

Approach #1 - Find and Replace

Approach #2 - Label Encoding

• For example, the body_style column contains 5 different values.

Approach #2 - Label Encoding

Approach #2 - Label Encoding

Approach #3 - One Hot Encoding

Approach #3 - One Hot Encoding

• Despite the different names, the basic strategy is to convert each

Approach #3 - One Hot Encoding

• By using get_dummies we can convert this to three columns with a

Approach #3 - One Hot Encoding

Approach #3 - One Hot Encoding

Approach #3 - One Hot Encoding

Approach #4 - Custom Binary Encoding

Approach #4 - Custom Binary Encoding

Approach #4 - Custom Binary Encoding

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.