0% found this document useful (0 votes)

6 views18 pages

Data Proprocesing

Data preprocessing involves transforming raw data into a clean and consistent format suitable for analysis, ensuring data quality and reliability. Key processes include data cleaning, integration, reduction, and transformation, which address issues like missing values, inconsistent formats, and dimensionality reduction. Techniques such as normalization, standardization, and encoding are used to prepare data for effective analysis and modeling.

Uploaded by

prarit.work

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views18 pages

Data Proprocesing

Uploaded by

prarit.work

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Data

Preprocesin
g
What is a
Data
Preprocessing
Transforming raw data into a
?
clean, consistent format
suitable for analysis
• Ensures data quality, reliability, and analytical
accuracy
Process
Flow
Data Cleaning

Finding and fixing mistakes in your

data
• *Missing Value Handling*: Dealing with blank spaces in your data:
- *Imputation*: Filling in the blanks with educated guesses
- *Mean/Median Substitution*: Replacing blanks with average values
- *Prediction Models*: Using patterns in other data to guess what should
be in the blank
- *Deletion Strategies*: Removing records with too many blanks

• *Outlier Detection*: Finding unusual values that don't fit the pattern:
- Z-score: Measuring how far a value is from the average
- IQR (Interquartile Range): Looking at whether values fall outside the
middle range
- Visualization: Creating charts to spot unusual values
Data Cleaning
Finding and fixing mistakes in your
data
*Error Correction* - Fixing obvious mistakes:
- *Fixing Inconsistent Formats*: Making sure dates all look like
"MM/DD/YYYY" instead of some being "DD-MM-YYYY"
- *Typos*: Correcting misspellings
- *Duplicates*: Removing repeat entries

• Data Validation: Checking if data makes sense according to

business rules:
- For example, ensuring ages aren't negative or birthdates aren't in
the future

• Standardization: Making sure similar information is formatted

the same way:
- All phone numbers written as (XXX) XXX-XXXX
- All addresses following the same pattern
Data Integration

• *Schema Integration*: Making sure data tables from different systems fit together:
- Resolving when one system calls it "Customer Name" and another calls it "Client"

• *Entity Identification*: Finding when different records actually refer to the same
thing:
- Recognizing "John Smith" and "J. Smith" might be the same person

• Data Consolidation: Combining duplicate records into one complete record:

- Taking the phone number from one record and the email from another to create one
Data Integration

Conflict Resolution: Deciding what to do when sources disagree:

- When one system says a customer is "Active" and another says "Inactive"

• Metadata Management: Creating a "dictionary" that explains what each piece of

data means:
- Documenting that "DOB" means "Date of Birth" and what format i
EXAMPLE-Data
Cleaning

Observations of Issues:
Missing Values: Customer ID 102 has a missing "Age", and
Customer ID 105
has a missing "Income".
Inconsistent Formats: "Income" has different currency notations
(\ and USD)
and a missing value. "Order Date" has varying formats
(MM/DD/YYYY, YYYY-
Explanation of Cleaning Steps:
• Age: The missing age for Customer ID 102 was imputed using the median age (22, 25, 38, 45
-> median is (25+38)/2 = 31.5, rounded to 32).
• Income: The "Income" values were standardized to USD, and the missing value for Customer
ID 105 was imputed using the mean income (55000 + 62000 + 78000 + 90000) / 4 = 66250.
• Order Date: All "Order Date" entries were converted to the YYYY-MM-DD
format.
This example visually demonstrates how data cleaning transfor
Data
Reduction
Techniques to reduce the volume of data while
preserving its integrity and analytical value

THERE ARE TWO TYPES OF DATA REDUCTION

TECHNIQUES -:

Dimensionality Reduction
Numerosity Reduction
DIMENSIONALITY REDUCTION
Decreasing the number of variables (columns) in your data:

DWT(Data Wavelength Transformation) : it is a process of using wavelnght coefficients to

reduce dimensionality
PCA (Principal Component Analysis): Combining related variables into new summary variables
Feature Selection : Keeping only the most important variables
Numerosity Reduction
Decreasing the number of records (rows) in your data:

Sampling: Taking a representative subset of records

Binning Grouping similar values together
Clustering Grouping similar records and using the group's average

Example of the Numerosity

Reduction :
OTHER
METHODS Data Compression
Using techniques to store the same information using
less space:
- Like zipping a file on your computer

Discretization
Converting continuous numbers into categories:
Changing ages (18, 19, 20...) into age groups (18-25,
26-35...)

Redundancy Elimination
Removing information that can be calculated from
other data:
- Removing "Age" if you already have "Birth Dat
Data Transformation
Goal: The goal of data transformation is to convert and restructure data into a format that is
more suitable and efficient for analysis and modeling. This often involves scaling,
aggregating, or encoding data to bring it into a consistent and usable range or
representation.

Processes:
• Normalization: Scaling numerical data to a specific range, typically between 0 and 1. This is
useful when features have different scales and can prevent features with larger values from
dominating the analysis. A common method is Min-Max scaling:

• Standardization: Scaling numerical data to have a mean of 0 and a standard deviation of 1 (also
known as Z-score scaling). This is helpful when the data follows a normal distribution
or when algorithms are sensitive to feature scaling:
Encoding Categorical Data: Converting categorical variables (e.g., colors, city,names) into numerical
representations that machine learning algorithms can understand. Common techniques include:
Label Encoding: Assigning a unique numerical label to each category (e.g.,
Red=0, Blue=1, Green=2).

One-Hot Encoding: Creating binary (0 or 1) columns for each category.

For example, the "Color" column with values "Red," "Blue," "Green" would be
transformed into three columns: "Color_Red," "Color_Blue," ,“colour_green “

•Aggregation: Summarizing data by grouping it based on certain attributes (e.g. calculating the
average sales per region, the total number of customers per city
Concept hierarchy generation -attributes such as
street can
be generalized to higher-level concepts, like city or
country. Many hierarchies for nominal attributes are
implicit within the database schema and can be
automatically
defined at the schema definition level.
Discretization -the raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.)
or conceptual labels (e.g., youth, adult, senior). The labels, in
turn, can be recursively organized into higher-level concepts,
resulting in a concept hierarchy for the numeric attribute.

While discretization is done for numerical

continuous data , Concept hierarchy
generation
Thank
You Payal -
04719051623

Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Creating A Website With Joomla
No ratings yet
Creating A Website With Joomla
15 pages
What Is System Catalog
No ratings yet
What Is System Catalog
3 pages
Week 2
No ratings yet
Week 2
96 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
DWDM PDF
No ratings yet
DWDM PDF
21 pages
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
No ratings yet
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
13 pages
At The End of The Session You Will Have Adequate Knowledge To Understand
100% (3)
At The End of The Session You Will Have Adequate Knowledge To Understand
248 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
00 Creating The Driving School Database
No ratings yet
00 Creating The Driving School Database
28 pages
Azure Application Architecture Guide
100% (1)
Azure Application Architecture Guide
1,420 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
WQD7005 (Alternative Assessment)
100% (1)
WQD7005 (Alternative Assessment)
4 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Drdo PDF
No ratings yet
Drdo PDF
1 page
Tickets - Odoo Hackathon 2025 (Aug 11, 2025, 8-00-00 AM)
No ratings yet
Tickets - Odoo Hackathon 2025 (Aug 11, 2025, 8-00-00 AM)
4 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
Accounting Information Systems 14th Edition (Ebook PDF) Download
100% (1)
Accounting Information Systems 14th Edition (Ebook PDF) Download
58 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Ufgs 01 33 16.00 10 Design Data (Design After Award)
No ratings yet
Ufgs 01 33 16.00 10 Design Data (Design After Award)
58 pages
Unit 3
No ratings yet
Unit 3
164 pages
Merak Dbtools Help: December, 2006
No ratings yet
Merak Dbtools Help: December, 2006
16 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Preprocessing-Cleaning & Reduction
No ratings yet
Preprocessing-Cleaning & Reduction
42 pages
Asm2 Bi 2ST
No ratings yet
Asm2 Bi 2ST
57 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Chap 3
No ratings yet
Chap 3
26 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
BI Unit 4
No ratings yet
BI Unit 4
21 pages
Nasdaq Data Link Data Fabric
100% (1)
Nasdaq Data Link Data Fabric
12 pages
10 - Sqlite
No ratings yet
10 - Sqlite
46 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Kollu Hemanth - Java Resume
No ratings yet
Kollu Hemanth - Java Resume
5 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Virtual Gym Management System: Manjiri R. Girnale Komal D. Untwal
No ratings yet
Virtual Gym Management System: Manjiri R. Girnale Komal D. Untwal
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
3RT19755AF31 Datasheet en
No ratings yet
3RT19755AF31 Datasheet en
2 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Concurrency Control
No ratings yet
Concurrency Control
42 pages
Ishan Roy Resume
No ratings yet
Ishan Roy Resume
3 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
CH 3
No ratings yet
CH 3
68 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Week 1 - Data Management Project - Database Initial Study Phase
No ratings yet
Week 1 - Data Management Project - Database Initial Study Phase
5 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Selected Practical Slips
No ratings yet
Selected Practical Slips
13 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Automated Root Causing of Cloud Incidents Using In-Context Learning With GPT-4
No ratings yet
Automated Root Causing of Cloud Incidents Using In-Context Learning With GPT-4
12 pages
CC 4057
No ratings yet
CC 4057
2 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Types
No ratings yet
Data Types
2 pages
Lecture 2-Intro To DSA - 071646
No ratings yet
Lecture 2-Intro To DSA - 071646
22 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
GL122 Probability and Statistics 2019 1
No ratings yet
GL122 Probability and Statistics 2019 1
6 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Data Mining
No ratings yet
Data Mining
22 pages
Comp HHW
No ratings yet
Comp HHW
15 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Dbms Set-1
No ratings yet
Dbms Set-1
2 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Prarit Aroa ADA Final
No ratings yet
Prarit Aroa ADA Final
110 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
Normalization
No ratings yet
Normalization
35 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Dbms 1-4 Unit Notes
No ratings yet
Dbms 1-4 Unit Notes
87 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
NLP File
No ratings yet
NLP File
21 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Teena 05019051623 B1 CN
No ratings yet
Teena 05019051623 B1 CN
12 pages
Teena Ai File
No ratings yet
Teena Ai File
24 pages
Project Name
No ratings yet
Project Name
6 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
2 F2026-T&P USAR2 Notice CloudTechner Services
No ratings yet
2 F2026-T&P USAR2 Notice CloudTechner Services
12 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Untitled Document
No ratings yet
Untitled Document
14 pages
Internship Guidelines 2025-26-4-5
No ratings yet
Internship Guidelines 2025-26-4-5
2 pages
Wallet
No ratings yet
Wallet
1 page
IICS MCQs
100% (1)
IICS MCQs
7 pages
Assignment - Ad Hoc On-Demand Distance Vector (AODV) Routing Protocol
No ratings yet
Assignment - Ad Hoc On-Demand Distance Vector (AODV) Routing Protocol
3 pages
ML Digit Classification Report
No ratings yet
ML Digit Classification Report
7 pages
QR 11894294
No ratings yet
QR 11894294
1 page
Geek Verse Guidelines Offline
No ratings yet
Geek Verse Guidelines Offline
1 page
Level 1
No ratings yet
Level 1
1 page
ArogyaAI Compressed
No ratings yet
ArogyaAI Compressed
10 pages
ML Project Report Puranjay
No ratings yet
ML Project Report Puranjay
2 pages
ML Digit Classification Report
No ratings yet
ML Digit Classification Report
2 pages
Green Modern Agriculture Presentation
No ratings yet
Green Modern Agriculture Presentation
9 pages
Green Modern Agriculture Presentation - Compressed
No ratings yet
Green Modern Agriculture Presentation - Compressed
9 pages
AUITS ProblemStatement
No ratings yet
AUITS ProblemStatement
2 pages
SRS Sentiment Analysis Project
No ratings yet
SRS Sentiment Analysis Project
4 pages
Prarit Arora CV - Compressed
No ratings yet
Prarit Arora CV - Compressed
1 page
CN File
No ratings yet
CN File
16 pages
Manjot CN
No ratings yet
Manjot CN
1 page
ARM 258 Computer Networks Lab File: Submitted To: Submitted by
No ratings yet
ARM 258 Computer Networks Lab File: Submitted To: Submitted by
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Proprocesing

Uploaded by

Data Proprocesing

Uploaded by

Data

Finding and fixing mistakes in your

• Data Validation: Checking if data makes sense according to

• Standardization: Making sure similar information is formatted

• Data Consolidation: Combining duplicate records into one complete record:

Conflict Resolution: Deciding what to do when sources disagree:

• Metadata Management: Creating a "dictionary" that explains what each piece of

THERE ARE TWO TYPES OF DATA REDUCTION

DWT(Data Wavelength Transformation) : it is a process of using wavelnght coefficients to

Sampling: Taking a representative subset of records

Example of the Numerosity

One-Hot Encoding: Creating binary (0 or 1) columns for each category.

While discretization is done for numerical

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Data Proprocesing

Uploaded by

Data Proprocesing

Uploaded by

Data

Finding and fixing mistakes in your

• *Data Validation*: Checking if data makes sense according to

• *Standardization*: Making sure similar information is formatted

• *Data Consolidation*: Combining duplicate records into one complete record:

*Conflict Resolution*: Deciding what to do when sources disagree:

• *Metadata Management*: Creating a "dictionary" that explains what each piece of

THERE ARE TWO TYPES OF DATA REDUCTION

DWT(Data Wavelength Transformation) : it is a process of using wavelnght coefficients to

Sampling: Taking a representative subset of records

Example of the Numerosity

One-Hot Encoding: Creating binary (0 or 1) columns for each category.

While discretization is done for numerical

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

• Data Validation: Checking if data makes sense according to

• Standardization: Making sure similar information is formatted

• Data Consolidation: Combining duplicate records into one complete record:

Conflict Resolution: Deciding what to do when sources disagree:

• Metadata Management: Creating a "dictionary" that explains what each piece of