0% found this document useful (0 votes)
6 views18 pages

Data Proprocesing

Data preprocessing involves transforming raw data into a clean and consistent format suitable for analysis, ensuring data quality and reliability. Key processes include data cleaning, integration, reduction, and transformation, which address issues like missing values, inconsistent formats, and dimensionality reduction. Techniques such as normalization, standardization, and encoding are used to prepare data for effective analysis and modeling.

Uploaded by

prarit.work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views18 pages

Data Proprocesing

Data preprocessing involves transforming raw data into a clean and consistent format suitable for analysis, ensuring data quality and reliability. Key processes include data cleaning, integration, reduction, and transformation, which address issues like missing values, inconsistent formats, and dimensionality reduction. Techniques such as normalization, standardization, and encoding are used to prepare data for effective analysis and modeling.

Uploaded by

prarit.work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Data

Preprocesin
g
What is a
Data
Preprocessing
Transforming raw data into a
?
clean, consistent format
suitable for analysis
• Ensures data quality, reliability, and analytical
accuracy
Process
Flow
Data Cleaning

Finding and fixing mistakes in your


data
• *Missing Value Handling*: Dealing with blank spaces in your data:
- *Imputation*: Filling in the blanks with educated guesses
- *Mean/Median Substitution*: Replacing blanks with average values
- *Prediction Models*: Using patterns in other data to guess what should
be in the blank
- *Deletion Strategies*: Removing records with too many blanks

• *Outlier Detection*: Finding unusual values that don't fit the pattern:
- Z-score: Measuring how far a value is from the average
- IQR (Interquartile Range): Looking at whether values fall outside the
middle range
- Visualization: Creating charts to spot unusual values
Data Cleaning
Finding and fixing mistakes in your
data
*Error Correction* - Fixing obvious mistakes:
- *Fixing Inconsistent Formats*: Making sure dates all look like
"MM/DD/YYYY" instead of some being "DD-MM-YYYY"
- *Typos*: Correcting misspellings
- *Duplicates*: Removing repeat entries

• *Data Validation*: Checking if data makes sense according to


business rules:
- For example, ensuring ages aren't negative or birthdates aren't in
the future

• *Standardization*: Making sure similar information is formatted


the same way:
- All phone numbers written as (XXX) XXX-XXXX
- All addresses following the same pattern
Data Integration

• *Schema Integration*: Making sure data tables from different systems fit together:
- Resolving when one system calls it "Customer Name" and another calls it "Client"

• *Entity Identification*: Finding when different records actually refer to the same
thing:
- Recognizing "John Smith" and "J. Smith" might be the same person

• *Data Consolidation*: Combining duplicate records into one complete record:


- Taking the phone number from one record and the email from another to create one
Data Integration

*Conflict Resolution*: Deciding what to do when sources disagree:


- When one system says a customer is "Active" and another says "Inactive"

• *Metadata Management*: Creating a "dictionary" that explains what each piece of


data means:
- Documenting that "DOB" means "Date of Birth" and what format i
EXAMPLE-Data
Cleaning

Observations of Issues:
Missing Values: Customer ID 102 has a missing "Age", and
Customer ID 105
has a missing "Income".
Inconsistent Formats: "Income" has different currency notations
(\ and USD)
and a missing value. "Order Date" has varying formats
(MM/DD/YYYY, YYYY-
Explanation of Cleaning Steps:
• Age: The missing age for Customer ID 102 was imputed using the median age (22, 25, 38, 45
-> median is (25+38)/2 = 31.5, rounded to 32).
• Income: The "Income" values were standardized to USD, and the missing value for Customer
ID 105 was imputed using the mean income (55000 + 62000 + 78000 + 90000) / 4 = 66250.
• Order Date: All "Order Date" entries were converted to the YYYY-MM-DD
format.
This example visually demonstrates how data cleaning transfor
Data
Reduction
Techniques to reduce the volume of data while
preserving its integrity and analytical value

THERE ARE TWO TYPES OF DATA REDUCTION


TECHNIQUES -:

Dimensionality Reduction
Numerosity Reduction
DIMENSIONALITY REDUCTION
Decreasing the number of variables (columns) in your data:

DWT(Data Wavelength Transformation) : it is a process of using wavelnght coefficients to


reduce dimensionality
PCA (Principal Component Analysis): Combining related variables into new summary variables
Feature Selection : Keeping only the most important variables
Numerosity Reduction
Decreasing the number of records (rows) in your data:

Sampling: Taking a representative subset of records


Binning Grouping similar values together
Clustering Grouping similar records and using the group's average

Example of the Numerosity


Reduction :
OTHER
METHODS Data Compression
Using techniques to store the same information using
less space:
- Like zipping a file on your computer

Discretization
Converting continuous numbers into categories:
Changing ages (18, 19, 20...) into age groups (18-25,
26-35...)

Redundancy Elimination
Removing information that can be calculated from
other data:
- Removing "Age" if you already have "Birth Dat
Data Transformation
Goal: The goal of data transformation is to convert and restructure data into a format that is
more suitable and efficient for analysis and modeling. This often involves scaling,
aggregating, or encoding data to bring it into a consistent and usable range or
representation.

Processes:
• Normalization: Scaling numerical data to a specific range, typically between 0 and 1. This is
useful when features have different scales and can prevent features with larger values from
dominating the analysis. A common method is Min-Max scaling:

• Standardization: Scaling numerical data to have a mean of 0 and a standard deviation of 1 (also
known as Z-score scaling). This is helpful when the data follows a normal distribution
or when algorithms are sensitive to feature scaling:
Encoding Categorical Data: Converting categorical variables (e.g., colors, city,names) into numerical
representations that machine learning algorithms can understand. Common techniques include:
Label Encoding: Assigning a unique numerical label to each category (e.g.,
Red=0, Blue=1, Green=2).

One-Hot Encoding: Creating binary (0 or 1) columns for each category.


For example, the "Color" column with values "Red," "Blue," "Green" would be
transformed into three columns: "Color_Red," "Color_Blue," ,“colour_green “

•Aggregation: Summarizing data by grouping it based on certain attributes (e.g. calculating the
average sales per region, the total number of customers per city
Concept hierarchy generation -attributes such as
street can
be generalized to higher-level concepts, like city or
country. Many hierarchies for nominal attributes are
implicit within the database schema and can be
automatically
defined at the schema definition level.
Discretization -the raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.)
or conceptual labels (e.g., youth, adult, senior). The labels, in
turn, can be recursively organized into higher-level concepts,
resulting in a concept hierarchy for the numeric attribute.

While discretization is done for numerical


continuous data , Concept hierarchy
generation
Thank
You Payal -
04719051623

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy