Data Proprocesing
Data Proprocesing
Preprocesin
g
What is a
Data
Preprocessing
Transforming raw data into a
?
clean, consistent format
suitable for analysis
• Ensures data quality, reliability, and analytical
accuracy
Process
Flow
Data Cleaning
• *Outlier Detection*: Finding unusual values that don't fit the pattern:
- Z-score: Measuring how far a value is from the average
- IQR (Interquartile Range): Looking at whether values fall outside the
middle range
- Visualization: Creating charts to spot unusual values
Data Cleaning
Finding and fixing mistakes in your
data
*Error Correction* - Fixing obvious mistakes:
- *Fixing Inconsistent Formats*: Making sure dates all look like
"MM/DD/YYYY" instead of some being "DD-MM-YYYY"
- *Typos*: Correcting misspellings
- *Duplicates*: Removing repeat entries
• *Schema Integration*: Making sure data tables from different systems fit together:
- Resolving when one system calls it "Customer Name" and another calls it "Client"
• *Entity Identification*: Finding when different records actually refer to the same
thing:
- Recognizing "John Smith" and "J. Smith" might be the same person
Observations of Issues:
Missing Values: Customer ID 102 has a missing "Age", and
Customer ID 105
has a missing "Income".
Inconsistent Formats: "Income" has different currency notations
(\ and USD)
and a missing value. "Order Date" has varying formats
(MM/DD/YYYY, YYYY-
Explanation of Cleaning Steps:
• Age: The missing age for Customer ID 102 was imputed using the median age (22, 25, 38, 45
-> median is (25+38)/2 = 31.5, rounded to 32).
• Income: The "Income" values were standardized to USD, and the missing value for Customer
ID 105 was imputed using the mean income (55000 + 62000 + 78000 + 90000) / 4 = 66250.
• Order Date: All "Order Date" entries were converted to the YYYY-MM-DD
format.
This example visually demonstrates how data cleaning transfor
Data
Reduction
Techniques to reduce the volume of data while
preserving its integrity and analytical value
Dimensionality Reduction
Numerosity Reduction
DIMENSIONALITY REDUCTION
Decreasing the number of variables (columns) in your data:
Discretization
Converting continuous numbers into categories:
Changing ages (18, 19, 20...) into age groups (18-25,
26-35...)
Redundancy Elimination
Removing information that can be calculated from
other data:
- Removing "Age" if you already have "Birth Dat
Data Transformation
Goal: The goal of data transformation is to convert and restructure data into a format that is
more suitable and efficient for analysis and modeling. This often involves scaling,
aggregating, or encoding data to bring it into a consistent and usable range or
representation.
Processes:
• Normalization: Scaling numerical data to a specific range, typically between 0 and 1. This is
useful when features have different scales and can prevent features with larger values from
dominating the analysis. A common method is Min-Max scaling:
• Standardization: Scaling numerical data to have a mean of 0 and a standard deviation of 1 (also
known as Z-score scaling). This is helpful when the data follows a normal distribution
or when algorithms are sensitive to feature scaling:
Encoding Categorical Data: Converting categorical variables (e.g., colors, city,names) into numerical
representations that machine learning algorithms can understand. Common techniques include:
Label Encoding: Assigning a unique numerical label to each category (e.g.,
Red=0, Blue=1, Green=2).
•Aggregation: Summarizing data by grouping it based on certain attributes (e.g. calculating the
average sales per region, the total number of customers per city
Concept hierarchy generation -attributes such as
street can
be generalized to higher-level concepts, like city or
country. Many hierarchies for nominal attributes are
implicit within the database schema and can be
automatically
defined at the schema definition level.
Discretization -the raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.)
or conceptual labels (e.g., youth, adult, senior). The labels, in
turn, can be recursively organized into higher-level concepts,
resulting in a concept hierarchy for the numeric attribute.