0% found this document useful (0 votes)
51 views127 pages

3-Preparing The Data-10-01-2024

The document discusses data preprocessing techniques including data cleaning, integration, transformation, and reduction. It also covers why preprocessing is important, different attribute value types, discrete vs continuous attributes, and handling missing values in data.

Uploaded by

zerohero.pvg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views127 pages

3-Preparing The Data-10-01-2024

The document discusses data preprocessing techniques including data cleaning, integration, transformation, and reduction. It also covers why preprocessing is important, different attribute value types, discrete vs continuous attributes, and handling missing values in data.

Uploaded by

zerohero.pvg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 127

MDI4001 Machine Learning for Data Science

Module 1: Preparing the data

Dr. Sunil Kumar


Assistant Professor

School of Computer Science and Engineering


Vellore Institute of Technology
Vellore, Tamil Nadu
India

January 10, 2024


Data Preprocessing

• Raw data is noisy, incomplete and inconsistent. Data preprocess-


ing is required to make sense of the data.
• Techniques:
• Data Cleaning
• Data Integration
• Data Transformation
• Normalization (Standardization)
• Aggregation
• Discretization
• Data Reduction
• Feature subset selection
• Distance/Similarity Calculation
• Dimensionality Reduction
• Sampling
Why is data prepocessing important?

▶ It improves accuracy and reliability.


Why is data prepocessing important?

▶ It improves accuracy and reliability.


▶ It makes data consistent.
Why is data prepocessing important?

▶ It improves accuracy and reliability.


▶ It makes data consistent.
▶ Reduces the risk of overfitting.
Why is data prepocessing important?

▶ It improves accuracy and reliability.


▶ It makes data consistent.
▶ Reduces the risk of overfitting.
▶ Saves time and effort in modeling.
Data

Data
Data

Data → Data objects


Data

Data → Data objects → Attribute


Data

Data → Data objects → Attribute → Attribute Values


Attributes Values

▶ Nominal/ Categorical
Attributes Values

▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
Attributes Values

▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
▶ Ordinal
Attributes Values

▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
▶ Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in tall, medium, short
Attributes Values

▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
▶ Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in tall, medium, short
▶ Interval
Attributes Values

▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
▶ Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in tall, medium, short
▶ Interval
Examples: calendar dates
Attributes Values

▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
▶ Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in tall, medium, short
▶ Interval
Examples: calendar dates
▶ Ratio
Attributes Values

▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
▶ Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in tall, medium, short
▶ Interval
Examples: calendar dates
▶ Ratio
Examples: length, time, counts
Discrete and Continuous Attributes

▶ Discrete Attribute
Discrete and Continuous Attributes

▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
Discrete and Continuous Attributes

▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
▶ Examples: zip codes, counts, or the set of words in a collection of
documents
Discrete and Continuous Attributes

▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
▶ Examples: zip codes, counts, or the set of words in a collection of
documents
▶ Often represented as integer variables.
Discrete and Continuous Attributes

▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
▶ Examples: zip codes, counts, or the set of words in a collection of
documents
▶ Often represented as integer variables.
▶ Continuous Attribute
Discrete and Continuous Attributes

▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
▶ Examples: zip codes, counts, or the set of words in a collection of
documents
▶ Often represented as integer variables.
▶ Continuous Attribute
▶ Has real numbers as attribute values
Discrete and Continuous Attributes

▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
▶ Examples: zip codes, counts, or the set of words in a collection of
documents
▶ Often represented as integer variables.
▶ Continuous Attribute
▶ Has real numbers as attribute values
▶ Examples: temperature, height, or weight.
Discrete and Continuous Attributes

▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
▶ Examples: zip codes, counts, or the set of words in a collection of
documents
▶ Often represented as integer variables.
▶ Continuous Attribute
▶ Has real numbers as attribute values
▶ Examples: temperature, height, or weight.
▶ Practically, real values can only be measured and represented
using a finite number of digits.
Data quality

Data quality, include many factors:


▶ accuracy
▶ completeness
▶ consistency
▶ timeliness
▶ believability/trust
▶ interpretability.
Data quality problems

▶ Noise and outliers


▶ Noise refers to modification of original values
Data quality problems

▶ Noise and outliers


▶ Noise refers to modification of original values
▶ Missing values
Data quality problems

▶ Noise and outliers


▶ Noise refers to modification of original values
▶ Missing values
▶ Duplicate data
Data Quality: Missing Values

▶ Reasons for missing values


Data Quality: Missing Values

▶ Reasons for missing values


▶ Information is not collected
Data Quality: Missing Values

▶ Reasons for missing values


▶ Information is not collected
▶ Attributes may not be applicable to all cases -
Data Quality: Missing Values

▶ Reasons for missing values


▶ Information is not collected
▶ Attributes may not be applicable to all cases - (e.g., annual income
is not applicable to children)
▶ Handling missing values
Data Quality: Missing Values

▶ Reasons for missing values


▶ Information is not collected
▶ Attributes may not be applicable to all cases - (e.g., annual income
is not applicable to children)
▶ Handling missing values
▶ Ignore the tuple
Data Quality: Missing Values

▶ Reasons for missing values


▶ Information is not collected
▶ Attributes may not be applicable to all cases - (e.g., annual income
is not applicable to children)
▶ Handling missing values
▶ Ignore the tuple
▶ Fill in the missing value manually
Data Quality: Missing Values

▶ Reasons for missing values


▶ Information is not collected
▶ Attributes may not be applicable to all cases - (e.g., annual income
is not applicable to children)
▶ Handling missing values
▶ Ignore the tuple
▶ Fill in the missing value manually
▶ Use a global constant to fill in the missing value
Data Quality: Missing Values

▶ Reasons for missing values


▶ Information is not collected
▶ Attributes may not be applicable to all cases - (e.g., annual income
is not applicable to children)
▶ Handling missing values
▶ Ignore the tuple
▶ Fill in the missing value manually
▶ Use a global constant to fill in the missing value
▶ Use a measure of central tendency for the attribute (e.g., the mean
or median) to fill in the missing value
Data Quality: Missing Values

▶ Reasons for missing values


▶ Information is not collected
▶ Attributes may not be applicable to all cases - (e.g., annual income
is not applicable to children)
▶ Handling missing values
▶ Ignore the tuple
▶ Fill in the missing value manually
▶ Use a global constant to fill in the missing value
▶ Use a measure of central tendency for the attribute (e.g., the mean
or median) to fill in the missing value
▶ Use the attribute mean or median for all samples belonging to the
same class as the given tuple
Data Quality: Missing Values

▶ Reasons for missing values


▶ Information is not collected
▶ Attributes may not be applicable to all cases - (e.g., annual income
is not applicable to children)
▶ Handling missing values
▶ Ignore the tuple
▶ Fill in the missing value manually
▶ Use a global constant to fill in the missing value
▶ Use a measure of central tendency for the attribute (e.g., the mean
or median) to fill in the missing value
▶ Use the attribute mean or median for all samples belonging to the
same class as the given tuple
▶ Use the most probable value to fill in the missing value
Data Quality: Missing Values

Tid Refund Marital Status Taxable Income Cheat


1 Yes Single 125K No
2 No Maried 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 10000K Yes
6 No NULL 60K No
7 Yes Divorced 220K NULL
8 No Single 85K Yes
9 No Married 90K No
9 No Single 90K No
Data Quality: Outliers
Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set.
Data Quality: Outliers
Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set.
Data Quality: Handle Noise

Data smoothing techniques:


▶ Binning
Data Quality: Handle Noise

Data smoothing techniques:


▶ Binning
▶ smoothing by bin means
Data Quality: Handle Noise

Data smoothing techniques:


▶ Binning
▶ smoothing by bin means
▶ smoothing by bin medians
Data Quality: Handle Noise

Data smoothing techniques:


▶ Binning
▶ smoothing by bin means
▶ smoothing by bin medians
▶ smoothing by bin boundaries
Data Quality: Handle Noise

Data smoothing techniques:


▶ Binning
▶ smoothing by bin means
▶ smoothing by bin medians
▶ smoothing by bin boundaries
▶ Regression: smooth by fitting a regression function
Data Quality: Handle Noise

Data smoothing techniques:


▶ Binning
▶ smoothing by bin means
▶ smoothing by bin medians
▶ smoothing by bin boundaries
▶ Regression: smooth by fitting a regression function
▶ Clustering: detect and remove outliers
Data Quality: Handle Noise(Binning)

▶ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Data Quality: Handle Noise(Binning)

▶ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
▶ Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Data Quality: Handle Noise(Binning)

▶ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
▶ Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Data Quality: Handle Noise(Binning)

▶ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
▶ Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Data Quality: Handle Noise(Regression)
▶ Replace noisy or missing values by predicted values
Data Quality: Handle Noise(Regression)
▶ Replace noisy or missing values by predicted values
▶ Requires model of attribute dependencies
Data Quality: Handle Noise(Regression)
▶ Replace noisy or missing values by predicted values
▶ Requires model of attribute dependencies
▶ Can be used for data smoothing or for handling missing data
Data Quality: Handle Noise(Regression)
▶ Replace noisy or missing values by predicted values
▶ Requires model of attribute dependencies
▶ Can be used for data smoothing or for handling missing data
Data Transformation

Data transformation refers to the process of converting raw data into


a format that is suitable for analysis and modeling.
Data Transformation

Data transformation refers to the process of converting raw data into


a format that is suitable for analysis and modeling.
▶ Smoothing
Data Transformation

Data transformation refers to the process of converting raw data into


a format that is suitable for analysis and modeling.
▶ Smoothing
▶ Normalization
Data Transformation

Data transformation refers to the process of converting raw data into


a format that is suitable for analysis and modeling.
▶ Smoothing
▶ Normalization
▶ Aggregation
Data Transformation

Data transformation refers to the process of converting raw data into


a format that is suitable for analysis and modeling.
▶ Smoothing
▶ Normalization
▶ Aggregation
▶ Discretization
Data Transformation

Data transformation refers to the process of converting raw data into


a format that is suitable for analysis and modeling.
▶ Smoothing
▶ Normalization
▶ Aggregation
▶ Discretization
▶ Sampling
Data Transformation

Data transformation refers to the process of converting raw data into


a format that is suitable for analysis and modeling.
▶ Smoothing
▶ Normalization
▶ Aggregation
▶ Discretization
▶ Sampling
▶ Generalization
Data Transformation: Normalization

Data normalization involves converting all data variables into a given


range.
Data Transformation: Normalization

Data normalization involves converting all data variables into a given


range.
▶ Recalculating the values for better comparison
Data Transformation: Normalization

Data normalization involves converting all data variables into a given


range.
▶ Recalculating the values for better comparison
▶ Ensure consistent units (monetary, measurements, temperature):
Data Transformation: Normalization

Data normalization involves converting all data variables into a given


range.
▶ Recalculating the values for better comparison
▶ Ensure consistent units (monetary, measurements, temperature):
▶ Metric, British, American weights, lengths
Data Transformation: Normalization

Data normalization involves converting all data variables into a given


range.
▶ Recalculating the values for better comparison
▶ Ensure consistent units (monetary, measurements, temperature):
▶ Metric, British, American weights, lengths
▶ currency–use common unit (Euro, USD)
Data Transformation: Normalization

Data normalization involves converting all data variables into a given


range.
▶ Recalculating the values for better comparison
▶ Ensure consistent units (monetary, measurements, temperature):
▶ Metric, British, American weights, lengths
▶ currency–use common unit (Euro, USD)
▶ currency adjusted for inflation–value of money is not the same as
10 years ago
Data Transformation: Normalization

Data normalization involves converting all data variables into a given


range.
▶ Recalculating the values for better comparison
▶ Ensure consistent units (monetary, measurements, temperature):
▶ Metric, British, American weights, lengths
▶ currency–use common unit (Euro, USD)
▶ currency adjusted for inflation–value of money is not the same as
10 years ago
▶ Normalization gives equal weights/importance to each variable
▶ e.g. algorithm based on distance measures (Euclidean distance)
Data Transformation: Normalization

Techniques that are used for normalization are:


▶ Min-Max Normalization:
▶ This transforms the original data linearly.
Data Transformation: Normalization

Techniques that are used for normalization are:


▶ Min-Max Normalization:
▶ This transforms the original data linearly.
▶ Suppose that: minA is the minima and maxA is the maxima of an
attribute A
▶ Where vi is the value you want to plot in the new range.
▶ v ′ is the new value you get after normalizing the old value.
▶ The min-max normalization would map Vi to the Vi′ in a new
smaller range [newminA , newmaxA ].
Min-Max Normalization:

V = 73600, minA = 1200, maxA = 9800


Min-Max Normalization:

V = 73600, minA = 1200, maxA = 9800

▶ Often the desired scale range is [0,1]


Min-Max Normalization:

V = 73600, minA = 1200, maxA = 9800

▶ Often the desired scale range is [0,1]

▶ Works when you know the limits (minimum and maximum) of the
original values.
Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
▶ used when minimums and maximums are not known (i.e., expect
data in the future but want consistency).
Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
▶ used when minimums and maximums are not known (i.e., expect
data in the future but want consistency).
▶ Works when you know the limits (minimum and maximum) of the
original values.

▶ Here Ā and σA are the mean and standard deviation for attribute
A.
Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
▶ used when minimums and maximums are not known (i.e., expect
data in the future but want consistency).
▶ Works when you know the limits (minimum and maximum) of the
original values.

▶ Here Ā and σA are the mean and standard deviation for attribute
A.
Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
▶ used when minimums and maximums are not known (i.e., expect
data in the future but want consistency).
▶ Works when you know the limits (minimum and maximum) of the
original values.

▶ Here Ā and σA are the mean and standard deviation for attribute
A.
Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
▶ used when minimums and maximums are not known (i.e., expect
data in the future but want consistency).
▶ Works when you know the limits (minimum and maximum) of the
original values.

▶ Here Ā and σA are the mean and standard deviation for attribute
A.

▶ all values won’t be between -1 and 1 but most will be


Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
▶ used when minimums and maximums are not known (i.e., expect
data in the future but want consistency).
▶ Works when you know the limits (minimum and maximum) of the
original values.

▶ Here Ā and σA are the mean and standard deviation for attribute
A.

▶ all values won’t be between -1 and 1 but most will be


▶ the average of the scaled values should be near 0
Normalization: Decimal Scaling:
▶ It normalizes the values of an attribute by changing the position
of their decimal points
Normalization: Decimal Scaling:
▶ It normalizes the values of an attribute by changing the position
of their decimal points
▶ divide by a constant that brings all values into the acceptable
range
Normalization: Decimal Scaling:
▶ It normalizes the values of an attribute by changing the position
of their decimal points
▶ divide by a constant that brings all values into the acceptable
range
▶ The number of points by which the decimal point is moved can
be determined by the absolute maximum value of attribute A.
Normalization: Decimal Scaling:
▶ It normalizes the values of an attribute by changing the position
of their decimal points
▶ divide by a constant that brings all values into the acceptable
range
▶ The number of points by which the decimal point is moved can
be determined by the absolute maximum value of attribute A.
▶ e.g. if we have a range of values -40 to 120, then dividing by 120,
we’ll have values between -1 and 1. or add -40 and divide by 80
to map -40 to -1 and 120 to 1.

Where j is the smallest integer such that Max(|Vi |)<1


Data Aggregation

Combining two or more attributes (or objects) into a single attribute


(or object)
Purpose:
▶ Data reduction
▶ Reduce the number of attributes or objects
▶ Results in simpler models
▶ Faster computation of the models
▶ Change of scale
▶ Cities aggregated into regions, states, countries, etc.
▶ Days aggregated into weeks, months, or years
▶ More “stable” data
▶ Aggregated data tends to have less variability
Data Aggregation
▶ Temporal aggregation: summarizing data over intervals of time,
such as hours, days, weeks, or months. It is useful for identifying
trends and patterns in time series data.
Data Aggregation
▶ Temporal aggregation: summarizing data over intervals of time,
such as hours, days, weeks, or months. It is useful for identifying
trends and patterns in time series data.
▶ Spatial aggregation: summarizing data according to spatial
criteria such as geographical area, postal code or IP address. It
is useful for analyzing location-based data, such as customer
demographics, sales territories, or traffic patterns.
Data Aggregation
▶ Temporal aggregation: summarizing data over intervals of time,
such as hours, days, weeks, or months. It is useful for identifying
trends and patterns in time series data.
▶ Spatial aggregation: summarizing data according to spatial
criteria such as geographical area, postal code or IP address. It
is useful for analyzing location-based data, such as customer
demographics, sales territories, or traffic patterns.
▶ Attribute Aggregation: summarizing data based on specific
attributes or categories, such as product category, customer
segment, or user role. It is useful for identifying patterns and
trends in categorical data.
Data Aggregation
▶ Temporal aggregation: summarizing data over intervals of time,
such as hours, days, weeks, or months. It is useful for identifying
trends and patterns in time series data.
▶ Spatial aggregation: summarizing data according to spatial
criteria such as geographical area, postal code or IP address. It
is useful for analyzing location-based data, such as customer
demographics, sales territories, or traffic patterns.
▶ Attribute Aggregation: summarizing data based on specific
attributes or categories, such as product category, customer
segment, or user role. It is useful for identifying patterns and
trends in categorical data.
▶ Hierarchical aggregation: summarizing data at different levels
of the hierarchy, such as organization level, product hierarchy, or
geographic hierarchy. It is useful for analyzing data that has a
natural hierarchy.
Data Aggregation
▶ Temporal aggregation: summarizing data over intervals of time,
such as hours, days, weeks, or months. It is useful for identifying
trends and patterns in time series data.
▶ Spatial aggregation: summarizing data according to spatial
criteria such as geographical area, postal code or IP address. It
is useful for analyzing location-based data, such as customer
demographics, sales territories, or traffic patterns.
▶ Attribute Aggregation: summarizing data based on specific
attributes or categories, such as product category, customer
segment, or user role. It is useful for identifying patterns and
trends in categorical data.
▶ Hierarchical aggregation: summarizing data at different levels
of the hierarchy, such as organization level, product hierarchy, or
geographic hierarchy. It is useful for analyzing data that has a
natural hierarchy.
▶ Statistical aggregation: summarizing data using a statistical
measure such as mean, median, mode, standard deviation, or
percentile. It is useful for analyzing numerical data and
identifying outliers and anomalies.
Data Aggregation
Example: Australia precipitation standard deviation

▶ The left histogram shows the standard deviation of average


monthly precipitation.
▶ The right histogram shows the standard deviation of the average
yearly precipitation for the same locations.
▶ The average yearly precipitation has less variability than the
average monthly precipitation.
Data Discretization

Raw values of numeric attribute are replaced by interval labels.


Purpose
▶ Some ML algorithms only accept discrete attributes
Data Discretization

Raw values of numeric attribute are replaced by interval labels.


Purpose
▶ Some ML algorithms only accept discrete attributes
▶ May improve understandability of patterns
Data Discretization

Raw values of numeric attribute are replaced by interval labels.


Purpose
▶ Some ML algorithms only accept discrete attributes
▶ May improve understandability of patterns
▶ For example, the values for the age attribute can be replaced by
the interval labels such as (0-10, 11-20. . . ) or (kid, youth, adult,
senior).
Methods:
▶ Using Binning
Data Discretization

Raw values of numeric attribute are replaced by interval labels.


Purpose
▶ Some ML algorithms only accept discrete attributes
▶ May improve understandability of patterns
▶ For example, the values for the age attribute can be replaced by
the interval labels such as (0-10, 11-20. . . ) or (kid, youth, adult,
senior).
Methods:
▶ Using Binning
▶ Using Histogram analysis
Data Discretization

Raw values of numeric attribute are replaced by interval labels.


Purpose
▶ Some ML algorithms only accept discrete attributes
▶ May improve understandability of patterns
▶ For example, the values for the age attribute can be replaced by
the interval labels such as (0-10, 11-20. . . ) or (kid, youth, adult,
senior).
Methods:
▶ Using Binning
▶ Using Histogram analysis
▶ Using Cluster analysis
Data Discretization

Raw values of numeric attribute are replaced by interval labels.


Purpose
▶ Some ML algorithms only accept discrete attributes
▶ May improve understandability of patterns
▶ For example, the values for the age attribute can be replaced by
the interval labels such as (0-10, 11-20. . . ) or (kid, youth, adult,
senior).
Methods:
▶ Using Binning
▶ Using Histogram analysis
▶ Using Cluster analysis
▶ Using Decision Tree
Data Discretization: Decision Tree

▶ using a decision tree to identify the optimal splitting points that


would determine the bins or contiguous intervals:
Data Discretization: Decision Tree

▶ using a decision tree to identify the optimal splitting points that


would determine the bins or contiguous intervals:
▶ A decision tree evaluates all possible values of a feature and
selects the cut-point that maximizes the class separation by
utilizing a performance metric like the entropy or Gini impurity.
Data Discretization: Decision Tree

▶ using a decision tree to identify the optimal splitting points that


would determine the bins or contiguous intervals:
▶ A decision tree evaluates all possible values of a feature and
selects the cut-point that maximizes the class separation by
utilizing a performance metric like the entropy or Gini impurity.
▶ Then it repeats the process for each node of the first data
separation and for each node of the subsequent data splits, until
a certain stopping criteria is reached.
Data Sampling:

▶ Data may be Big Data


▶ data reduction technique
▶ allows a large data set to be represented by a much smaller
random sample
Data Sampling:

The key principle for effective sampling:


▶ Using a sample will work almost as well as using the entire data
sets, if the sample is representative
Data Sampling:

The key principle for effective sampling:


▶ Using a sample will work almost as well as using the entire data
sets, if the sample is representative
▶ A sample is representative if it has approximately the same
property (of interest) as the original set of data
Data Sampling:

The key principle for effective sampling:


▶ Using a sample will work almost as well as using the entire data
sets, if the sample is representative
▶ A sample is representative if it has approximately the same
property (of interest) as the original set of data
▶ Otherwise we say that the sample introduces some bias
Data Sampling:

The key principle for effective sampling:


▶ Using a sample will work almost as well as using the entire data
sets, if the sample is representative
▶ A sample is representative if it has approximately the same
property (of interest) as the original set of data
▶ Otherwise we say that the sample introduces some bias
▶ What happens if we take a sample from the university campus
for the analysis?
Data Sampling:

A sample is representative if it has approximately the same properties


(of interest) as the original set of data.
Data Sampling:

Types of Sampling
▶ Sampling without replacement
Data Sampling:

Types of Sampling
▶ Sampling without replacement
▶ Sampling with replacement
Data Sampling:

Types of Sampling
▶ Sampling without replacement
▶ Sampling with replacement
▶ Stratified sampling
Data Sampling:

Types of Sampling
▶ Sampling without replacement
▶ Sampling with replacement
▶ Stratified sampling
▶ Split the data into several partitions (strata), then draw random
samples from each Partition
Feature Subset Selection:

Redundant features
▶ Duplicate much or all of the information contained in one or more
other attributes
Feature Subset Selection:

Redundant features
▶ Duplicate much or all of the information contained in one or more
other attributes
▶ Example: purchase price of a product and the amount of sales
tax paid
Irrelevant features
▶ Contain no information that is useful for the data mining task at
hand
Feature Subset Selection:

Redundant features
▶ Duplicate much or all of the information contained in one or more
other attributes
▶ Example: purchase price of a product and the amount of sales
tax paid
Irrelevant features
▶ Contain no information that is useful for the data mining task at
hand
▶ Example: students’ ID is often irrelevant to the task of predicting
students’ GPA
Heuristic Feature Selection Methods:

▶ There are 2d possible sub-features of d features


▶ Several heuristic feature selection methods:
▶ Best single features under the feature independence assumption:
choose by significance tests (Information Gain, Entropy).
▶ Step-wise forward selection:
▶ The best single-feature is picked first
▶ Then next best feature condition to the first,
▶ Step-wise backward elimination:
▶ Repeatedly eliminate the worst feature
▶ Combined forward selection and backward elimination:
Heuristic Feature Selection Methods:

Decision tree induction:


▶ Decision tree induction constructs a flowchart-like structure
▶ where each internal (nonleaf) node denotes a test on an
attribute, each branch corresponds to an outcome of the best,
and each external (leaf) node denotes a class prediction.
▶ At each node, the algorithm chooses the “best” attribute to
partition the data in-to individual classes.
▶ a tree is constructed from the given data.
▶ all attributes that do not appear in the tree are assumed to be
irrelevant.
Decision tree induction:

Decision tree induction:


Decision tree induction:

The weather data example.


Decision tree induction:
Attribute Creation (Feature Generation):

▶ Create new attributes (features) that can capture the important


information in a data set more effectively than the original ones
▶ Improve accuracy
▶ Understanding of structure of high-dimensional data
▶ For example, add the attribute area based on the attributes
height and width.
Attribute Creation (Feature Generation):

Three general methodologies


▶ Attribute extraction
▶ Domain-specific
▶ Example: extracting edges from images
▶ Mapping data to new space
▶ E.g., Fourier transformation, wavelet transformation, manifold
approaches
▶ Attribute construction
▶ Combining features
▶ Data discretization
▶ Example: dividing mass by volume to get density
Attribute Creation (Feature Generation):

Three general methodologies


▶ Attribute extraction
▶ Domain-specific
▶ Example: extracting edges from images
▶ Mapping data to new space
▶ E.g., Fourier transformation, wavelet transformation, manifold
approaches
▶ Attribute construction
▶ Combining features
▶ Data discretization
▶ Example: dividing mass by volume to get density
Conclusion: Data processing

▶ Data Quality
▶ Data Quality Problems
▶ Data Cleaning
▶ Data Transformation
▶ Data Reduction
References

1. T. Dasu and T. Johnson. Exploratory Data Mining and Data


Cleaning. John Wiley & Sons, 2003
2. https://builtin.com/data-science/step-step-explanation-principal-
component-analysis
3. Ethem Alpaydin, Introduction to Machine Learning, Fourth
Edition, MIT Press, 2020
4. Hadley Wickham, Garrett Grolemund, R for data science :
Import, Tidy, Transform, Visualize, And Model Data Paperback,
2017
5. Han, J., Kamber, M., Pei, J. Data mining concepts and
techniques. Morgan Kaufmann. 2011
6. Carl Shan, Henry Wang, William Chen, Max Song. The Data
Science Handbook: Advice and Proceedings of the Insight from
25 Amazing Data Scientists. The Data Science Bookshelf. 2016
7. James, G., Witten, D., T., Tibshirani, R. An Introduction to
statistical learning with applications in R. Springer. 2013

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy