Session 7 - Data Preprocessing and Transformation - 2025
Session 7 - Data Preprocessing and Transformation - 2025
BUSINESS MANAGEMENT
Session 7
Data Preprocessing and
Transformation
M.Sc. Thien Nguyen
Email: thien.nguyen@isb.edu.vn
Phone: 0949088908
Agenda
1. Data Cleaning for analysis
2. Create effective data visualization for
insights
3. Read the Data
4. Practicing with Data
5. Q&A
Part I
1. What is data cleaning?
2. How to clean data
3
What is data cleaning?
4
How to clean data
Step 1: Remove duplicate or irrelevant observations
5
How to clean data
You can’t ignore missing data because many algorithms will not accept missing
values.
There are a couple of ways to deal with missing data. Neither is optimal, but both can
be considered.
1. As a first option, you can drop observations that have missing values, but doing
this will drop or lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations;
again, there is an opportunity to lose integrity of the data because you may be
operating from assumptions and not actual observations.
8
How to clean data
Step 5: Validate and QA
At the end of the data cleaning process, you should be able to answer these
questions as a part of basic validation:
• Does the data make sense?
• Does the data follow the appropriate rules for its field?
• Can you find trends in the data to help you form your next theory?
• If not, is that because of a data quality issue?
9
Part II 1. Understand each data field:
Meaning, properties & data type
2. Find duplicates
3. Check missing data
4. Identify common errors
Read The Data 5. Identify relationships/casuals
10
II. Read The Data
11
Data types
Qualitative
● nominal (định danh)
● binary (định danh True/False)
● ordinal (thứ tự)
⇒ unstructured data
(text, category, datetime)
Quantitative
● discrete (rời rạc)
● continuous (liên tục)
● interval (khoảng)
⇒ structured data
Source: https://www.intellspot.com/data-types/
12
II. Read The Data
1. Understand each column
13
II. Read The Data
1. Understand each column
Data type
Column Meaning Relationships
Qualitative Quantitative
1 Invoice ID ID of each order Nominal
2 City Name of branch Nominal
3 Customer type Name of cus. types Ordinal
4 Gender Gender type Nominal
Number (integer)
5 Product line ID of each product
⇒ Nominal
6 Cogs Cost of goods sold Continuous
7 Tax 5% 5% of COGS Continuous = 5% of Col6
Total amount of
8 Total Continuous = Col6 + Col7
payment
9 Date Date of sale Ordinal
10 Time Time of sale Ordinal
11 Payment Type of payment Nominal
12 Rating Score of rating Discrete
14
II. Read The Data
2. Find duplicates
➤Select the column → Conditional Formatting → Highlight Cells Rules → Duplicate Values
15
II. Read The Data
2. Find duplicates
17
II. Read The Data
5. Identify relationships/casuals
18
Individual Activities
THANK YOU