Data Preparation.
Data Preparation.
By Ahlem Marzouk
What's data
preparation ?
It's a time-consuming process of
preparing raw data for analysis.
”
variety of sources.
✓ Surveys.
✓ Interviews.
✓ Observations
✓ Document Analysis
✓ Scraping
“ Data discovery and processing.
Explore the data collected to better understand what it contains and what
needs to be done to prepare it for its intended uses.
➢ Discovery of structures
➢ Exploring Connections
We detect connections, similarities, differences and associations
between data sources.
”
“ Data Cleaning
The process of correcting incorrect, incomplete, duplicate or otherwise
”
erroneous data in a dataset.
”
This involves converting the data into a format suitable for the desired
analysis techniques.
For example, this may involve converting textual data into numerical
data, or normalizing data to a common scale.
“ Data Transformation
”
Data standardization: This involves converting data into a common
format, for example: a date format, IP addresses etc.., or converting all
data into a single unit of measurement, where all numerical values can
be scaled within a range of 0 to 1.
“ Goal:
The goal of Z-Score Normalization is to rescale
the features so that they have the properties of a
”
standard normal distribution with a mean of 0
Standardization: and a standard deviation of 1.
Z-Score
Method (Z-score normalization):
Normalization
Z-Score Normalization involves subtracting the
mean of the variable from each data point and
then dividing by the standard deviation.
“
”
Formula:
Standardization:
Z-Score
Normalization
“ Properties:
After Z-Score Normalization, the transformed data
has a mean of 0 and a standard deviation of 1.
”
Not affected by outliers: Standardization is less
Standardization:
sensitive to the presence of outliers compared to
Z-Score
Min-Max Scaling.
Normalization
It's better for data with unknown range or not
normal distribution.
Used for algorithms sensitivity to outliers.
Example
Example
Example
Example
“ Goal:
The goal of Min-Max Normalization is to scale the
”
values of a variable to a specific range, usually
Standardization: between 0 and 1.
Min-Max
Normalization Method (Min-Max Normalization):
For each data point, the minimum value of the
variable is subtracted, and the result is divided by the
range (the difference between the maximum and
minimum values).
“
”
Formula:
Standardization:
Z-Score
Normalization
“
”
Standardization:
We use it for data with known range.
Z-Score
It preserves the relative ordering of data points.
Normalization
Example
Example