0% found this document useful (0 votes)
7 views9 pages

Lec 4

The document outlines a lecture on Data Wrangling and Summarization, focusing on techniques for handling missing values, duplicates, and categorical data. It discusses methods such as using pandas functions like dropna(), fillna(), and get_dummies() to manage data effectively. The importance of addressing these issues in data science and machine learning is emphasized to ensure accurate outcomes.

Uploaded by

opoe14055
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views9 pages

Lec 4

The document outlines a lecture on Data Wrangling and Summarization, focusing on techniques for handling missing values, duplicates, and categorical data. It discusses methods such as using pandas functions like dropna(), fillna(), and get_dummies() to manage data effectively. The importance of addressing these issues in data science and machine learning is emphasized to ensure accurate outcomes.

Uploaded by

opoe14055
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Wrangling & Summarization

Mr. Asad Abbas


Today’s Lecture Outline
2. Data Wrangling
i. Understanding Data
ii. Filtering Data
iii. Type Casting
iv. Transformation
v. Imputing Missing Values
vi. Handling Duplicates
vii. Handling Categorical Data
viii. Normalization
ix. String Manipulation
3. Summarization
Imputing Missing Values
• Missing values can lead to all sorts of problems when
dealing with Machine Learning and Data Science related
use cases.
• Not only can they cause problems for algorithms, they
can mess up calculations and even final outcomes.
• Missing values also pose risk of being interpreted in
non-standard ways as well leading to confusion and
more errors.
• One of the easiest ways of handling missing values is to
ignore or remove them altogether from the dataset.
• When the dataset is fairly large and we have enough
samples of various types required, this option can be
safely exercised.
Imputing Missing Values
• We use the dropna() function from pandas in the following
snippet to remove rows of data where the date of transaction
is missing.:
print("Drop Rows with missing dates::" )
df_dropped = df.dropna(subset=['date'])
print("Shape::",df_dropped.shape)

Dataframe without any missing date information


4
Imputing Missing Values
• In many scenarios, missing values are imputed using the help of
other values in the dataframe.

• One commonly used trick is to replace missing values with a


central tendency measure like mean or median.

• We utilize the fillna() method from pandas to fill these values


with mean price value from our dataframe.

• On the same lines, we use the ffill() and bfill() functions to


impute missing values for the user_type attribute.

• user_type is a string type attribute, we use a proximity


based solution to handle missing values in this case.

• The ffill() and bfill() functions copy forward the data from the
previous row (forward fill) or copy the value from the next row
(backward fill).

5
Imputing Missing Values
• Fill Missing Price values with mean price::

• Fill Missing user_type values with value from previous row (forward fill) ::

• Fill Missing user_type values with value from next row (backward fill) ::

vi. Handling Duplicates


Handling Duplicates
• Another issue with many datasets is the presence of duplicates.

• To identify duplicates, we have a utility called duplicated() that


can applied on the whole dataframe as well as on a subset of it.

• We may handle duplicates by fixing the errors and use the


duplicated() function, although we may also choose to drop the
duplicate data points altogether.
• To drop duplicates, we use the method drop_duplicates().

vii. Handling Categorical Data


COSC-3107 Machine Learning
Handling Categorical Data
• The attribute user_type is a categorical variable that can
take only a limited number of values from the allowed set
{a,b,c,d}.

• With pandas, we can handle categorical variables in a


couple of different ways.

• The first method is using the map() function, where we


simply map each value from the allowed set to a numeric
value.

• The second method is to convert the categorical variable


into indicator variables using the get_dummies() function.

11

Handling Categorical Data


• Method I: The first method is using the map() function,
where we simply map each value from the allowed set to a
numeric value.

12
Handling Categorical Data
• The second method is to convert the categorical variable
into indicator variables using the get_dummies() function.

13 Shahzad Hussain, Lecturer, Khawaja Fareed University of Engineering and Information

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy