0% found this document useful (0 votes)

7 views9 pages

Lec 4

The document outlines a lecture on Data Wrangling and Summarization, focusing on techniques for handling missing values, duplicates, and categorical data. It discusses methods such as using pandas functions like dropna(), fillna(), and get_dummies() to manage data effectively. The importance of addressing these issues in data science and machine learning is emphasized to ensure accurate outcomes.

Uploaded by

opoe14055

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views9 pages

Lec 4

Uploaded by

opoe14055

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Data Wrangling & Summarization

Mr. Asad Abbas

Today’s Lecture Outline
2. Data Wrangling
i. Understanding Data
ii. Filtering Data
iii. Type Casting
iv. Transformation
v. Imputing Missing Values
vi. Handling Duplicates
vii. Handling Categorical Data
viii. Normalization
ix. String Manipulation
3. Summarization
Imputing Missing Values
• Missing values can lead to all sorts of problems when
dealing with Machine Learning and Data Science related
use cases.
• Not only can they cause problems for algorithms, they
can mess up calculations and even final outcomes.
• Missing values also pose risk of being interpreted in
non-standard ways as well leading to confusion and
more errors.
• One of the easiest ways of handling missing values is to
ignore or remove them altogether from the dataset.
• When the dataset is fairly large and we have enough
samples of various types required, this option can be
safely exercised.
Imputing Missing Values
• We use the dropna() function from pandas in the following
snippet to remove rows of data where the date of transaction
is missing.:
print("Drop Rows with missing dates::" )
df_dropped = df.dropna(subset=['date'])
print("Shape::",df_dropped.shape)

Dataframe without any missing date information

4
Imputing Missing Values
• In many scenarios, missing values are imputed using the help of
other values in the dataframe.

• One commonly used trick is to replace missing values with a

central tendency measure like mean or median.

• We utilize the fillna() method from pandas to fill these values

with mean price value from our dataframe.

• On the same lines, we use the ffill() and bfill() functions to

impute missing values for the user_type attribute.

• user_type is a string type attribute, we use a proximity

based solution to handle missing values in this case.

• The ffill() and bfill() functions copy forward the data from the
previous row (forward fill) or copy the value from the next row
(backward fill).

5
Imputing Missing Values
• Fill Missing Price values with mean price::

• Fill Missing user_type values with value from previous row (forward fill) ::

• Fill Missing user_type values with value from next row (backward fill) ::

vi. Handling Duplicates

Handling Duplicates
• Another issue with many datasets is the presence of duplicates.

• To identify duplicates, we have a utility called duplicated() that

can applied on the whole dataframe as well as on a subset of it.

• We may handle duplicates by fixing the errors and use the

duplicated() function, although we may also choose to drop the
duplicate data points altogether.
• To drop duplicates, we use the method drop_duplicates().

vii. Handling Categorical Data

COSC-3107 Machine Learning
Handling Categorical Data
• The attribute user_type is a categorical variable that can
take only a limited number of values from the allowed set
{a,b,c,d}.

• With pandas, we can handle categorical variables in a

couple of different ways.

• The first method is using the map() function, where we

simply map each value from the allowed set to a numeric
value.

• The second method is to convert the categorical variable

into indicator variables using the get_dummies() function.

Handling Categorical Data

• Method I: The first method is using the map() function,
where we simply map each value from the allowed set to a
numeric value.

12
Handling Categorical Data
• The second method is to convert the categorical variable
into indicator variables using the get_dummies() function.

13 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information

Adjusted American - Putney
No ratings yet
Adjusted American - Putney
228 pages
Eye Tracking A Comprehensive Guide To Methods
No ratings yet
Eye Tracking A Comprehensive Guide To Methods
4 pages
8control and DSP Lab - PDF
No ratings yet
8control and DSP Lab - PDF
50 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Unit V
No ratings yet
Unit V
47 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Lab File
No ratings yet
Lab File
96 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Lecture 8 Handling Missing Values
No ratings yet
Lecture 8 Handling Missing Values
25 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
No ratings yet
Kenny-230722-Data Cleaning With Python and Pandas - Detecting Missing Values
13 pages
Chapter 1. Data Preparation
No ratings yet
Chapter 1. Data Preparation
74 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
12 Useful Pandas Techniques in Python For Data Manipulation
100% (2)
12 Useful Pandas Techniques in Python For Data Manipulation
19 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Phython Example
No ratings yet
Phython Example
12 pages
Lecture 4 Data Pre-Processing
No ratings yet
Lecture 4 Data Pre-Processing
43 pages
Lab 1 ML Lab
No ratings yet
Lab 1 ML Lab
15 pages
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
No ratings yet
How To Handle Missing Data in Python. (Explained in 5 Easy Steps)
10 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Lecture 4 New Data Pre Processing
No ratings yet
Lecture 4 New Data Pre Processing
41 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Lec9 Dealing With Missing Values
No ratings yet
Lec9 Dealing With Missing Values
22 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Summary of The Chapter "Working With Missing Values"
No ratings yet
Summary of The Chapter "Working With Missing Values"
5 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
56 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
ML Practical 03
No ratings yet
ML Practical 03
20 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Pandas
No ratings yet
Pandas
30 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
S08 Slides
No ratings yet
S08 Slides
14 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Jntuh BT Che 5 Mass-Transfer-Operations-I-2011
No ratings yet
Jntuh BT Che 5 Mass-Transfer-Operations-I-2011
8 pages
Donna M. Richter Honored As A VIP For Fall 2024 by P.O.W.E.R. (Professional Organization of Women of Excellence Recognized)
No ratings yet
Donna M. Richter Honored As A VIP For Fall 2024 by P.O.W.E.R. (Professional Organization of Women of Excellence Recognized)
3 pages
Rey's Thesis Chapter 1-5 Final Draft
No ratings yet
Rey's Thesis Chapter 1-5 Final Draft
47 pages
Antika Nirada Jane IX.8 - Ujian Praktik
No ratings yet
Antika Nirada Jane IX.8 - Ujian Praktik
54 pages
Parental Leave Will Cost
No ratings yet
Parental Leave Will Cost
284 pages
Vte Current Handbook
No ratings yet
Vte Current Handbook
39 pages
UI Unit 8 Test B
No ratings yet
UI Unit 8 Test B
3 pages
LNG Custody Transfer Handbook PDF
100% (2)
LNG Custody Transfer Handbook PDF
108 pages
MB4 CNC Conversion: Parts List Description Product Name
No ratings yet
MB4 CNC Conversion: Parts List Description Product Name
4 pages
Analysis Report - Soil Nail SGHR100 MacMat
No ratings yet
Analysis Report - Soil Nail SGHR100 MacMat
2 pages
914-Article Text-3490-3-10-20191231
No ratings yet
914-Article Text-3490-3-10-20191231
8 pages
G8 Sci SLM Q4 Wk4 CorrectedBeta Tested
No ratings yet
G8 Sci SLM Q4 Wk4 CorrectedBeta Tested
25 pages
Sharifi Yazdi2019
No ratings yet
Sharifi Yazdi2019
20 pages
Building and Environment: Mosha Zhao, Schew-Ram Mehra, Hartwig M. Künzel
No ratings yet
Building and Environment: Mosha Zhao, Schew-Ram Mehra, Hartwig M. Künzel
16 pages
1 s2.0 S2590291124003711 Main
No ratings yet
1 s2.0 S2590291124003711 Main
8 pages
Complete Bundle Vile Boys Spine Ridge University Clarissa Wild HQ File
No ratings yet
Complete Bundle Vile Boys Spine Ridge University Clarissa Wild HQ File
406 pages
Alexander Duff - Heidegger and Politics - Ontology of Radical Discontent (2015)
100% (1)
Alexander Duff - Heidegger and Politics - Ontology of Radical Discontent (2015)
228 pages
D21 First Report On Facts Figures v12 Clean CJp0V9c6gjjZCCVu1KBoxweRdZk 82977
No ratings yet
D21 First Report On Facts Figures v12 Clean CJp0V9c6gjjZCCVu1KBoxweRdZk 82977
184 pages
Activity 1 Algebra & Trigonometry
No ratings yet
Activity 1 Algebra & Trigonometry
3 pages
Using ChatGPT Custom Instructions For Fun and Profit
No ratings yet
Using ChatGPT Custom Instructions For Fun and Profit
13 pages
SAVCH PLC User's Manual of E and S Series MPU
No ratings yet
SAVCH PLC User's Manual of E and S Series MPU
10 pages
ENME 392 - Homework 13 - Fa13 - Solutions
No ratings yet
ENME 392 - Homework 13 - Fa13 - Solutions
23 pages
DS - SG10KTL-MT Datasheet - V10 - EN PDF
No ratings yet
DS - SG10KTL-MT Datasheet - V10 - EN PDF
1 page
Powergrout - Ns1: High Performance Precision Grout
No ratings yet
Powergrout - Ns1: High Performance Precision Grout
2 pages
Storingscodes Hisense Hi Therma
No ratings yet
Storingscodes Hisense Hi Therma
54 pages
To Ascertain The Efficiency and Limiting Efficiency of Pulley Systems
No ratings yet
To Ascertain The Efficiency and Limiting Efficiency of Pulley Systems
7 pages
Determination of The Thermodynamic Solubility Product of Potassium Hydrogen Tartrate (KHT) Uncovering The Procedure - Expt 2
No ratings yet
Determination of The Thermodynamic Solubility Product of Potassium Hydrogen Tartrate (KHT) Uncovering The Procedure - Expt 2
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lec 4

Uploaded by

Lec 4

Uploaded by

Data Wrangling & Summarization

Mr. Asad Abbas

Dataframe without any missing date information

• One commonly used trick is to replace missing values with a

• We utilize the fillna() method from pandas to fill these values

• On the same lines, we use the ffill() and bfill() functions to

• user_type is a string type attribute, we use a proximity

vi. Handling Duplicates

• To identify duplicates, we have a utility called duplicated() that

• We may handle duplicates by fixing the errors and use the

vii. Handling Categorical Data

• With pandas, we can handle categorical variables in a

• The first method is using the map() function, where we

• The second method is to convert the categorical variable

Handling Categorical Data

13 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.