0% found this document useful (0 votes)
16 views20 pages

Session 7 - Data Preprocessing and Transformation - 2025

The document outlines the process of data preprocessing and transformation in business management, focusing on data cleaning techniques. It details steps such as removing duplicates, fixing structural errors, filtering outliers, handling missing data, and validating the cleaned data. Additionally, it emphasizes understanding data types and relationships for effective analysis and visualization.

Uploaded by

My Thao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views20 pages

Session 7 - Data Preprocessing and Transformation - 2025

The document outlines the process of data preprocessing and transformation in business management, focusing on data cleaning techniques. It details steps such as removing duplicates, fixing structural errors, filtering outliers, handling missing data, and validating the cleaned data. Additionally, it emphasizes understanding data types and relationships for effective analysis and visualization.

Uploaded by

My Thao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

PROBLEM SOLVING IN

BUSINESS MANAGEMENT
Session 7
Data Preprocessing and
Transformation
M.Sc. Thien Nguyen
Email: thien.nguyen@isb.edu.vn
Phone: 0949088908
Agenda
1. Data Cleaning for analysis
2. Create effective data visualization for
insights
3. Read the Data
4. Practicing with Data
5. Q&A
Part I
1. What is data cleaning?
2. How to clean data

Data cleaning for


analysis
Source: https://www.tableau.com/learn/articles/what-is-data-cleaning

3
What is data cleaning?

• Data cleaning is the process of fixing or removing incorrect,


corrupted, incorrectly formatted, duplicate, or incomplete data
within a dataset.
• If data is incorrect, outcomes and algorithms are unreliable,
even though they may look correct.

4
How to clean data
Step 1: Remove duplicate or irrelevant observations

• Remove unwanted observations from your dataset, including duplicate


observations or irrelevant observations.
• Duplicate observations will happen most often during data collection.
For example
If you want to analyze data regarding millennial customers, but your
dataset includes older generations, you might remove those irrelevant
observations.

5
How to clean data

Step 2: Fix structural errors

• Structural errors are when you measure or transfer data and


notice strange naming conventions, typos, or incorrect
capitalization.
• These inconsistencies can cause mislabeled categories or
classes.
For example, you may find “N/A” and “Not Applicable” both appear,
but they should be analyzed as the same category.
6
How to clean data

Step 3: Filter unwanted outliers

• There will be one-off observations where, they do not appear to


fit within the data you are analyzing.
• If you have a legitimate reason to remove an outlier, like
improper data-entry, doing so will help the performance of the
data you are working with.
• However, sometimes it is the appearance of an outlier that will
prove a theory you are working on.
7
How to clean data
Step 4: Handle missing data

You can’t ignore missing data because many algorithms will not accept missing
values.
There are a couple of ways to deal with missing data. Neither is optimal, but both can
be considered.
1. As a first option, you can drop observations that have missing values, but doing
this will drop or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations;
again, there is an opportunity to lose integrity of the data because you may be
operating from assumptions and not actual observations.

8
How to clean data
Step 5: Validate and QA

At the end of the data cleaning process, you should be able to answer these
questions as a part of basic validation:
• Does the data make sense?
• Does the data follow the appropriate rules for its field?
• Can you find trends in the data to help you form your next theory?
• If not, is that because of a data quality issue?

9
Part II 1. Understand each data field:
Meaning, properties & data type
2. Find duplicates
3. Check missing data
4. Identify common errors
Read The Data 5. Identify relationships/casuals

10
II. Read The Data

● Before being processed & cleaned: data


fields (information fields)
● After being processed & cleaned: data
features

11
Data types

Qualitative
● nominal (định danh)
● binary (định danh True/False)
● ordinal (thứ tự)
⇒ unstructured data
(text, category, datetime)

Quantitative
● discrete (rời rạc)
● continuous (liên tục)
● interval (khoảng)
⇒ structured data

Source: https://www.intellspot.com/data-types/
12
II. Read The Data
1. Understand each column

13
II. Read The Data
1. Understand each column

Data type
Column Meaning Relationships
Qualitative Quantitative
1 Invoice ID ID of each order Nominal
2 City Name of branch Nominal
3 Customer type Name of cus. types Ordinal
4 Gender Gender type Nominal
Number (integer)
5 Product line ID of each product
⇒ Nominal
6 Cogs Cost of goods sold Continuous
7 Tax 5% 5% of COGS Continuous = 5% of Col6
Total amount of
8 Total Continuous = Col6 + Col7
payment
9 Date Date of sale Ordinal
10 Time Time of sale Ordinal
11 Payment Type of payment Nominal
12 Rating Score of rating Discrete

14
II. Read The Data
2. Find duplicates

➤Select the column → Conditional Formatting → Highlight Cells Rules → Duplicate Values

15
II. Read The Data
2. Find duplicates

Quickly remove duplicates

Source: Filter For Unique Values Or Remove Duplicate Values


16
II. Read The Data
4. Check common errors

Name of cities: Name of CT and Gender: Format of date, time:


Format of numbers: Name of Payment
- Hanoi vs. Ha Noi? - Member vs. member? - 1/5/19 vs. 1-5-19?
- 1.5 or 1,5? method:
- HCM City vs. HCMC? - Normal vs. Nomal? - Are all time values
- All are numerical? - Only 03?
- Da Nang vs. Đà Nẵng? - Female vs. female? correct?

17
II. Read The Data
5. Identify relationships/casuals

There are no price and


unit quantity, should we What are their relations?
collect more data?
Is there any relation/correlation?

18
Individual Activities
THANK YOU

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy