0% found this document useful (0 votes)
21 views36 pages

Data Preparation.

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views36 pages

Data Preparation.

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Data Preparation

By Ahlem Marzouk
What's data
preparation ?
It's a time-consuming process of
preparing raw data for analysis.

It represents the most difficult but


most important step.
Goal:

Provide a clean, consistent and


well-organized data set that can be
used to generate accurate and
reliable information.
To:

❑ Ensure data quality.


❑ Make data more accessible.
❑ Improve analysis efficiency.
Why is it important to
do the data
preparation
❑ Reduces the risk of inaccurate information:
it allows us to identify and correct data
errors and prevents us from taking bad
risks.

❑ Improves model's accuracy.


❑ Make data more accessible.
❑ Improves analysis efficiency.
Different steps in Data
Preparation.
“ Collect Data
Data is collected from a


variety of sources.

✓ Surveys.
✓ Interviews.

✓ Observations

✓ Document Analysis
✓ Scraping
“ Data discovery and processing.
Explore the data collected to better understand what it contains and what
needs to be done to prepare it for its intended uses.

➢ Discovery of structures

✓ Identify the data type of each attribute.

✓ Ensure that the format of each attribute is correct.



“ Data discovery and processing.
Explore the data collected to better understand what it contains and what
needs to be done to prepare it for its intended uses.

➢ Discovery of the content


✓ Identify the missing values.

✓ Identify the anomalies or the outliers.



“ Data discovery and processing.
Explore the data collected to better understand what it contains and what
needs to be done to prepare it for its intended uses.

➢ Exploring Connections
We detect connections, similarities, differences and associations
between data sources.

“ Data Cleaning
The process of correcting incorrect, incomplete, duplicate or otherwise


erroneous data in a dataset.

It involves identifying data errors, then modifying, updating or deleting


them to correct the data.
Also, it includes correcting spelling and other typographical errors,
incorrect numerical entries, syntax errors and missing values, such as
empty or null fields that should contain data.
Example
Example
Example
Example
Example
Example
Example
Example
Example
“ Data Transformation


This involves converting the data into a format suitable for the desired
analysis techniques.

For example, this may involve converting textual data into numerical
data, or normalizing data to a common scale.
“ Data Transformation


Data standardization: This involves converting data into a common
format, for example: a date format, IP addresses etc.., or converting all
data into a single unit of measurement, where all numerical values can
be scaled within a range of 0 to 1.
“ Goal:
The goal of Z-Score Normalization is to rescale
the features so that they have the properties of a


standard normal distribution with a mean of 0
Standardization: and a standard deviation of 1.

Z-Score
Method (Z-score normalization):
Normalization
Z-Score Normalization involves subtracting the
mean of the variable from each data point and
then dividing by the standard deviation.


Formula:
Standardization:
Z-Score
Normalization
“ Properties:
After Z-Score Normalization, the transformed data
has a mean of 0 and a standard deviation of 1.


Not affected by outliers: Standardization is less
Standardization:
sensitive to the presence of outliers compared to
Z-Score
Min-Max Scaling.
Normalization
It's better for data with unknown range or not
normal distribution.
Used for algorithms sensitivity to outliers.
Example
Example
Example
Example
“ Goal:
The goal of Min-Max Normalization is to scale the


values of a variable to a specific range, usually
Standardization: between 0 and 1.
Min-Max
Normalization Method (Min-Max Normalization):
For each data point, the minimum value of the
variable is subtracted, and the result is divided by the
range (the difference between the maximum and
minimum values).


Formula:
Standardization:
Z-Score
Normalization


Standardization:
We use it for data with known range.
Z-Score
It preserves the relative ordering of data points.
Normalization
Example
Example

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy