0% found this document useful (0 votes)

21 views43 pages

DS Module2 L3 L13

Uploaded by

rishipaul221

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views43 pages

DS Module2 L3 L13

Uploaded by

rishipaul221

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

Data Science(BTCS-616-18)

Module -2
Lecture-3
Presented By
Dr. Rini Saxena
Professor (Computer Science & Engineering)
CEC Jhanjeri Mohali
rini.cgctc@gmail.com
Preview of last Lecture
Quantitative data collection
Descriptive,
Correlational,
Experimental, and
Quasi-experimental.

Secondary data collection

Interviews
Questionnaires and surveys
Observations
Documents and records
Focus groups
Oral histories
Today’s Content
Data Preprocessing
Data Pre-processing
Data preprocessing is a data mining technique that
involves transforming raw data into an
understandable format.

Data preprocessing is a proven method of resolving

such issues.

data could be in so many different forms: Structured

Tables, Images, Audio files, Videos etc..

Machines don’t understand free text, image or video

data as it is, they understand 1s and 0s.
Data Pre-processing
So it probably won’t be good enough if we put on a
slideshow of all our images and expect our machine
learning model to get trained just by that!

Why use Data Pre-processing?

In the real world data are generally incomplete:
lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data.

Noisy: containing errors or outliers. Inconsistent:

containing discrepancies in codes or names
Data Pre-processing
in any Machine Learning process, Data Preprocessing
is that step in which the data gets transformed,
or Encoded, to bring it to such a state that now the
machine can easily parse it.

In other words, the features of the data can now be

easily interpreted by the algorithm.

A dataset can be viewed as a collection of data

objects, which are often also called as a records, points,
vectors, patterns, events, cases, samples, observations,
or entities.
Data Pre-processing
Data objects are described by a number of features,
that capture the basic characteristics of an object, such
as the mass of a physical object or the time at which an
event occurred, etc..

Features are often called as variables, characteristics,

fields, attributes, or dimensions.

For instance, color, mileage and power can be

considered as features of a car. There are different
types of features that we can come across when we
deal with data.
Data Pre-processing
Data Pre-processing

Categorical : Features whose values are taken from a

defined set of values.

For instance, days in a week : {Monday, Tuesday,

Wednesday, Thursday, Friday, Saturday, Sunday} is a
category because its value is always taken from this set.

Another example could be the Boolean set : {True,

False}
Data Pre-processing

Numerical : Features whose values are continuous or

integer-valued.

They are represented by numbers and possess most of

the properties of numbers.

 For instance, number of steps you walk in a day, or the

speed at which you are driving your car at.
Data Pre-processing
Data Pre-processing Steps
In Data pre-processing not all the steps are applicable
for each problem, it is highly dependent on the data
we are working with, so maybe only a few steps might
be required with your dataset. Generally they are :
Data Quality Assessment
Feature Aggregation
Feature Sampling
Dimensionality Reduction
Feature Encoding
Data Pre-processing Steps
1. Data Quality Assessment
Because data is often taken from multiple sources
which are normally not too reliable and that too in
different formats.
more than half our time is consumed in dealing with
data quality issues when working on a machine
learning problem.
It is simply unrealistic to expect that the data will be
perfect.
There may be problems due to human error,
limitations of measuring devices, or flaws in the
data collection process.
Data Pre-processing Steps
Missing values :
It is very much usual to have missing values in your
dataset.
It may have happened during data collection, or
maybe due to some data validation rule, but regardless
missing values must be taken into consideration.

Eliminate rows with missing data :

Simple and sometimes effective strategy. Fails if many
objects have missing values. If a feature has mostly
missing values, then that feature itself can also be
eliminated.
Data Pre-processing Steps
Estimate missing values :
If only a reasonable percentage of values are missing,
then we can also run simple interpolation methods to
fill in those values.

However, most common method of dealing with

missing values is by filling them in with the mean,
median or mode value of the respective feature.
Data Pre-processing Steps
Inconsistent values :
We know that data can contain inconsistent values.
For instance, the ‘Address’ field contains the ‘Phone
number’.
 It may be due to human error or maybe the
information was misread while being scanned from a
handwritten form.
It is therefore always advised to perform data
assessment like knowing what the data type of the
features should be and whether it is the same for all
the data objects.
Data Pre-processing Steps
Duplicate values :
A dataset may include data objects which are
duplicates of one another.
It may happen when say the same person submits a
form more than once.
The term deduplication is often used to refer to the
process of dealing with duplicates.
In most cases, the duplicates are removed so as to not
give that particular data object an advantage or bias,
when running machine learning algorithms.
Data Pre-processing Steps
2. Feature Aggregation
Feature Aggregations are performed so as to take the
aggregated values in order to put the data in a better
perspective.
Think of transactional data, suppose we have day-to-
day transactions of a product from recording the daily
sales of that product in various store locations over the
year.
Aggregating the transactions to single store-wide
monthly or yearly transactions will help us reducing
hundreds or potentially thousands of transactions that
occur daily at a specific store, thereby reducing the
number of data objects.
Data Pre-processing Steps
This results in reduction of memory consumption and
processing time

Aggregations provide us with a high-level view of the

data as the behaviour of groups or aggregates is more
stable than individual data objects
Data Pre-processing Steps
Data Pre-processing Steps
3. Feature Sampling
Sampling is a very common method for selecting a
subset of the dataset that we are analyzing.

In most cases, working with the complete dataset can
turn out to be too expensive considering the memory
and time constraints.

Using a sampling algorithm can help us reduce the

size of the dataset to a point where we can use a better,
but more expensive, machine learning algorithm.
Data Pre-processing Steps

The key principle here is that the sampling should be

done in such a manner that the sample generated
should have approximately the same properties as the
original dataset, meaning that the sample
is representative.

This involves choosing the correct sample size and

sampling strategy. Simple Random
Sampling dictates that there is an equal probability of
selecting any particular entity.
Data Pre-processing Steps
It has two main variations as well :
Sampling without Replacement : As each item is
selected, it is removed from the set of all the objects
that form the total dataset.

Sampling with Replacement : Items are not

removed from the total dataset after getting selected.
This means they can get selected more than once.
Data Pre-processing Steps
Although Simple Random Sampling provides two
great sampling techniques, it can fail to output a
representative sample when the dataset includes
object types which vary drastically in ratio.

This can cause problems when the sample needs to

have a proper representation of all object types, for
example, when we have an imbalanced dataset.

An Imbalanced dataset is one where the number of

instances of a class(es) are significantly higher than
another class(es), thus leading to an imbalance and
creating rarer class(es).
Data Pre-processing Steps
It is critical that the rarer classes be adequately
represented in the sample.

In these cases, there is another sampling technique

which we can use, called Stratified Sampling, which
begins with predefined groups of objects.

There are different versions of Stratified Sampling too,
with the simplest version suggesting equal number of
objects be drawn from all the groups even though the
groups are of different sizes.
Data Pre-processing Steps
4. Dimensionality Reduction
Most real world datasets have a large number of
features.
 For example, consider an image processing problem,
we might have to deal with thousands of features, also
called as dimensions.
As the name suggests, dimensionality reduction aims
to reduce the number of features - but not simply by
selecting a sample of features from the feature-set,
which is something else — Feature Subset Selection or
simply Feature Selection.
Data Pre-processing Steps
Conceptually, dimension refers to the number of
geometric planes the dataset lies in, which could be
high so much so that it cannot be visualized with pen
and paper.

More the number of such planes, more is the

complexity of the dataset.
Data Pre-processing Steps

The Curse of Dimensionality

This refers to the phenomena that generally data
analysis tasks become significantly harder as the
dimensionality of the data increases.

As the dimensionality increases, the number planes
occupied by the data increases thus adding more and
more sparsity to the data which is difficult to model
and visualize.
Data Pre-processing Steps
Data Pre-processing Steps
The basic objective of techniques which are used for
this purpose is to reduce the dimensionality of a
dataset by creating new features which are a
combination of the old features.

In other words, the higher-dimensional feature-space

is mapped to a lower-dimensional feature-space.

Principal Component Analysis and

Singular Value Decomposition are two widely accepted
techniques.
Data Pre-processing Steps
A few major benefits of dimensionality reduction are :
Data Analysis algorithms work better if the
dimensionality of the dataset is lower. This is mainly
because irrelevant features and noise have now been
eliminated.
The models which are built on top of lower
dimensional data are more understandable and
explainable.
The data may now also get easier to visualize!
Features can always be taken in pairs or triplets for
visualization purposes, which makes more sense if the
featureset is not that big.
Data Pre-processing Steps
5 Feature Encoding
The whole purpose of data preprocessing is
to encode the data in order to bring it to such a state
that the machine now understands it.

Feature encoding is basically performing

transformations on the data such that it can be easily
accepted as input for machine learning algorithms
while still retaining its original meaning.
Data Pre-processing Steps
There are some general norms or rules which are
followed when performing feature encoding. For
Continuous variables :

Nominal : Any one-to-one mapping can be done

which retains the meaning. For instance, a
permutation of values like in One-Hot Encoding.
Ordinal : An order-preserving change of values. The
notion of small, medium and large can be represented
equally well with the help of a new function, that is,
<new_value = f(old_value)> - For example, {0, 1, 2} or
maybe {1, 2, 3}.
Data Pre-processing Steps
Data Pre-processing Steps
For Numeric variables:
Interval : Simple mathematical transformation like
using the equation <new_value = a*old_value + b>, a
and b being constants.

For example, Fahrenheit and Celsius scales, which

differ in their Zero values size of a unit, can be
encoded in this manner.
Data Pre-processing Steps
Ratio : These variables can be scaled to any particular
measures, of course while still maintaining the
meaning and ratio of their values. Simple
mathematical transformations work in this case as
well,

like the transformation <new_value = a*old_value>.

For, length can be measured in meters or feet, money
can be taken in different currencies.
Data Pre-processing Steps
Train / Validation / Test Split
After feature encoding is done, our dataset is ready for
the exciting machine learning algorithms!
But before we start deciding the algorithm which
should be used, it is always advised to split the dataset
into 2 or sometimes 3 parts.

Machine Learning algorithms, or any algorithm for

that matter, has to be first trained on the data
distribution available and then validated and tested,
before it can be deployed to deal with real-world data.
Data Pre-processing Steps
Training data : This is the part on which your
machine learning algorithms are actually trained to
build a model. The model tries to learn the dataset and
its various characteristics and intricacies, which also
raises the issue of Overfitting v/s Underfitting.

Validation data : This is the part of the dataset which

is used to validate our various model fits. In simpler
words, we use validation data to choose and improve
our model hyperparameters. The model does
not learn the validation set but uses it to get to a better
state of hyperparameters.
Data Pre-processing Steps

Test data : This part of the dataset is used to test our

model hypothesis.

It is left untouched and unseen until the model and

hyperparameters are decided, and only after that the
model is applied on the test data to get an accurate
measure of how it would perform when deployed on
real-world data.
Data Pre-processing

Split Ratio : Data is split as per a split ratio which is

highly dependent on the type of model we are building
and the dataset itself.

If our dataset and model are such that a lot of training
is required, then we use a larger chunk of the data just
for training purposes (usually the case) — For
instance, training on textual data, image data, or video
data usually involves thousands of features!
Data Pre-processing
If the model has a lot of hyperparameters that can be
tuned, then keeping a higher percentage of data for
the validation set is advisable.

Models with less number of hyperparameters are easy
to tune and update, and so we can keep a smaller
validation set.

Like many other things in Machine Learning, the split

ratio is highly dependent on the problem we are trying
to solve and must be decided after taking into account
all the various details about the model and the dataset
in hand.
Data Pre-processing
Data Pre-processing
Machine Learning Process
Steps in Data Preprocessing in practical
 Step 1 : Import the libraries
Step 2 : Import the data-set
Step 3 : Check out the missing values
Step 4 : See the Categorical Values
Step 5 : Splitting the data-set into Training and Test
Set
Step 6 : Feature Scaling

Research-Methods-Statistics - Kathrynn Adams
100% (2)
Research-Methods-Statistics - Kathrynn Adams
708 pages
Organization Study at Kirloskar Electricals-Synopsis
No ratings yet
Organization Study at Kirloskar Electricals-Synopsis
14 pages
Final Bsba Thesis
100% (4)
Final Bsba Thesis
61 pages
Data
No ratings yet
Data
36 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Machine Learning Chapter 2
No ratings yet
Machine Learning Chapter 2
37 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
4.1 - Data Preprocessing
No ratings yet
4.1 - Data Preprocessing
28 pages
Data Preprocessing - Cleaning and Normalization
No ratings yet
Data Preprocessing - Cleaning and Normalization
11 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
DMDW Chapter 3
No ratings yet
DMDW Chapter 3
13 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
DS-Unit-2 ABM Final
No ratings yet
DS-Unit-2 ABM Final
134 pages
What Is Data Preprocessing
No ratings yet
What Is Data Preprocessing
4 pages
Unit 2
No ratings yet
Unit 2
18 pages
Bi 20soeit11002 Antala Krishnaa
No ratings yet
Bi 20soeit11002 Antala Krishnaa
5 pages
Data Mining and Data Warehousing - Data Preprocessing - L03
No ratings yet
Data Mining and Data Warehousing - Data Preprocessing - L03
10 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
23 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
16-Data Preprocessing
No ratings yet
16-Data Preprocessing
27 pages
Chaper 3 FoDS
No ratings yet
Chaper 3 FoDS
127 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Data Preprocessing in Python Pandas (With Code)
No ratings yet
Data Preprocessing in Python Pandas (With Code)
11 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
1.3 Introduction To Data Preprocessing
No ratings yet
1.3 Introduction To Data Preprocessing
16 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
CMR BDA Data Pre Processing
No ratings yet
CMR BDA Data Pre Processing
10 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
CSC407 - Chapter 2-3
No ratings yet
CSC407 - Chapter 2-3
46 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Big Data Lecture # 03
No ratings yet
Big Data Lecture # 03
14 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
No ratings yet
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
25 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Unit - II
No ratings yet
Unit - II
56 pages
Normalization
No ratings yet
Normalization
35 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
633777800398832500ata Minig Presentation
No ratings yet
633777800398832500ata Minig Presentation
20 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Data Preprocessing Before Classification: Presented by
No ratings yet
Data Preprocessing Before Classification: Presented by
23 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
SEHH1008 Chapter 05 Elementary Probability Theory
No ratings yet
SEHH1008 Chapter 05 Elementary Probability Theory
40 pages
Thesis Systematic Literature Review
100% (2)
Thesis Systematic Literature Review
8 pages
Confidence Interval
No ratings yet
Confidence Interval
10 pages
Cash Transfers and Domestic Violence
No ratings yet
Cash Transfers and Domestic Violence
17 pages
RENG
No ratings yet
RENG
10 pages
Statistics - Methods of Describing Sets of Data
No ratings yet
Statistics - Methods of Describing Sets of Data
12 pages
TUGAS 1 - Metode Penelitian - FADYA AM
No ratings yet
TUGAS 1 - Metode Penelitian - FADYA AM
7 pages
Chapter 8 Sampling and Estimation
No ratings yet
Chapter 8 Sampling and Estimation
14 pages
Final Exam in m233
No ratings yet
Final Exam in m233
4 pages
Sample Carib Sba One
No ratings yet
Sample Carib Sba One
38 pages
MSC Psychology Syllabus Final
No ratings yet
MSC Psychology Syllabus Final
55 pages
Final Report BPP & Company
No ratings yet
Final Report BPP & Company
74 pages
Stoppers Stop
No ratings yet
Stoppers Stop
111 pages
Abstracts For Poster Presentation ICAS Pablo Mauricio Moscoso Ontiveros
No ratings yet
Abstracts For Poster Presentation ICAS Pablo Mauricio Moscoso Ontiveros
156 pages
Six Sigma Symbol: Bill Smith Motorola
No ratings yet
Six Sigma Symbol: Bill Smith Motorola
7 pages
Coca Cola
No ratings yet
Coca Cola
110 pages
@eleni Research
No ratings yet
@eleni Research
40 pages
Measuring of Customer Satisfaction and Availability of Several Stuffs at The Web Portal of Flipkart
No ratings yet
Measuring of Customer Satisfaction and Availability of Several Stuffs at The Web Portal of Flipkart
32 pages
Research Report-1
No ratings yet
Research Report-1
31 pages
Impact of Discipline On Academic Performance of Pupils in Public
No ratings yet
Impact of Discipline On Academic Performance of Pupils in Public
10 pages
Instituto Superior Tecnologico Unitek
No ratings yet
Instituto Superior Tecnologico Unitek
12 pages
STP32537S Characterization of High Purity Cathodes For Plant Control
No ratings yet
STP32537S Characterization of High Purity Cathodes For Plant Control
30 pages
Unit 3
No ratings yet
Unit 3
46 pages
Group 6 - Stats and Probability Peta
No ratings yet
Group 6 - Stats and Probability Peta
29 pages
4 Statistics Measures of Central Tendencyandy
100% (1)
4 Statistics Measures of Central Tendencyandy
24 pages
Group 1 Business Final Paper
No ratings yet
Group 1 Business Final Paper
66 pages
Definition of Key Terms in A Research Paper
100% (1)
Definition of Key Terms in A Research Paper
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DS Module2 L3 L13

Uploaded by

DS Module2 L3 L13

Uploaded by

Data Science(BTCS-616-18)

Secondary data collection

Data preprocessing is a proven method of resolving

data could be in so many different forms: Structured

Machines don’t understand free text, image or video

Why use Data Pre-processing?

Noisy: containing errors or outliers. Inconsistent:

In other words, the features of the data can now be

A dataset can be viewed as a collection of data

Features are often called as variables, characteristics,

For instance, color, mileage and power can be

Categorical : Features whose values are taken from a

For instance, days in a week : {Monday, Tuesday,

Another example could be the Boolean set : {True,

Numerical : Features whose values are continuous or

They are represented by numbers and possess most of

 For instance, number of steps you walk in a day, or the

Eliminate rows with missing data :

However, most common method of dealing with

Aggregations provide us with a high-level view of the

Using a sampling algorithm can help us reduce the

The key principle here is that the sampling should be

This involves choosing the correct sample size and

Sampling with Replacement : Items are not

This can cause problems when the sample needs to

An Imbalanced dataset is one where the number of

In these cases, there is another sampling technique

More the number of such planes, more is the

The Curse of Dimensionality

In other words, the higher-dimensional feature-space

Principal Component Analysis and

Feature encoding is basically performing

Nominal : Any one-to-one mapping can be done

For example, Fahrenheit and Celsius scales, which

like the transformation <new_value = a*old_value>.

Machine Learning algorithms, or any algorithm for

Validation data : This is the part of the dataset which

Test data : This part of the dataset is used to test our

It is left untouched and unseen until the model and

Split Ratio : Data is split as per a split ratio which is

Like many other things in Machine Learning, the split

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.