3-Preparing The Data-10-01-2024
3-Preparing The Data-10-01-2024
Data
Data
▶ Nominal/ Categorical
Attributes Values
▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
Attributes Values
▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
▶ Ordinal
Attributes Values
▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
▶ Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in tall, medium, short
Attributes Values
▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
▶ Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in tall, medium, short
▶ Interval
Attributes Values
▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
▶ Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in tall, medium, short
▶ Interval
Examples: calendar dates
Attributes Values
▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
▶ Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in tall, medium, short
▶ Interval
Examples: calendar dates
▶ Ratio
Attributes Values
▶ Nominal/ Categorical
Examples: ID numbers, eye color, zip codes
▶ Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in tall, medium, short
▶ Interval
Examples: calendar dates
▶ Ratio
Examples: length, time, counts
Discrete and Continuous Attributes
▶ Discrete Attribute
Discrete and Continuous Attributes
▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
Discrete and Continuous Attributes
▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
▶ Examples: zip codes, counts, or the set of words in a collection of
documents
Discrete and Continuous Attributes
▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
▶ Examples: zip codes, counts, or the set of words in a collection of
documents
▶ Often represented as integer variables.
Discrete and Continuous Attributes
▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
▶ Examples: zip codes, counts, or the set of words in a collection of
documents
▶ Often represented as integer variables.
▶ Continuous Attribute
Discrete and Continuous Attributes
▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
▶ Examples: zip codes, counts, or the set of words in a collection of
documents
▶ Often represented as integer variables.
▶ Continuous Attribute
▶ Has real numbers as attribute values
Discrete and Continuous Attributes
▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
▶ Examples: zip codes, counts, or the set of words in a collection of
documents
▶ Often represented as integer variables.
▶ Continuous Attribute
▶ Has real numbers as attribute values
▶ Examples: temperature, height, or weight.
Discrete and Continuous Attributes
▶ Discrete Attribute
▶ Has only a finite or countable infinite set of values
▶ Examples: zip codes, counts, or the set of words in a collection of
documents
▶ Often represented as integer variables.
▶ Continuous Attribute
▶ Has real numbers as attribute values
▶ Examples: temperature, height, or weight.
▶ Practically, real values can only be measured and represented
using a finite number of digits.
Data quality
▶ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Data Quality: Handle Noise(Binning)
▶ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
▶ Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Data Quality: Handle Noise(Binning)
▶ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
▶ Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Data Quality: Handle Noise(Binning)
▶ Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
▶ Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Data Quality: Handle Noise(Regression)
▶ Replace noisy or missing values by predicted values
Data Quality: Handle Noise(Regression)
▶ Replace noisy or missing values by predicted values
▶ Requires model of attribute dependencies
Data Quality: Handle Noise(Regression)
▶ Replace noisy or missing values by predicted values
▶ Requires model of attribute dependencies
▶ Can be used for data smoothing or for handling missing data
Data Quality: Handle Noise(Regression)
▶ Replace noisy or missing values by predicted values
▶ Requires model of attribute dependencies
▶ Can be used for data smoothing or for handling missing data
Data Transformation
▶ Works when you know the limits (minimum and maximum) of the
original values.
Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
▶ used when minimums and maximums are not known (i.e., expect
data in the future but want consistency).
Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
▶ used when minimums and maximums are not known (i.e., expect
data in the future but want consistency).
▶ Works when you know the limits (minimum and maximum) of the
original values.
▶ Here Ā and σA are the mean and standard deviation for attribute
A.
Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
▶ used when minimums and maximums are not known (i.e., expect
data in the future but want consistency).
▶ Works when you know the limits (minimum and maximum) of the
original values.
▶ Here Ā and σA are the mean and standard deviation for attribute
A.
Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
▶ used when minimums and maximums are not known (i.e., expect
data in the future but want consistency).
▶ Works when you know the limits (minimum and maximum) of the
original values.
▶ Here Ā and σA are the mean and standard deviation for attribute
A.
Z-score Normalization:
▶ values of an attribute (A), are normalized based on the mean of
A and its standard deviation.
▶ used when minimums and maximums are not known (i.e., expect
data in the future but want consistency).
▶ Works when you know the limits (minimum and maximum) of the
original values.
▶ Here Ā and σA are the mean and standard deviation for attribute
A.
▶ Here Ā and σA are the mean and standard deviation for attribute
A.
Types of Sampling
▶ Sampling without replacement
Data Sampling:
Types of Sampling
▶ Sampling without replacement
▶ Sampling with replacement
Data Sampling:
Types of Sampling
▶ Sampling without replacement
▶ Sampling with replacement
▶ Stratified sampling
Data Sampling:
Types of Sampling
▶ Sampling without replacement
▶ Sampling with replacement
▶ Stratified sampling
▶ Split the data into several partitions (strata), then draw random
samples from each Partition
Feature Subset Selection:
Redundant features
▶ Duplicate much or all of the information contained in one or more
other attributes
Feature Subset Selection:
Redundant features
▶ Duplicate much or all of the information contained in one or more
other attributes
▶ Example: purchase price of a product and the amount of sales
tax paid
Irrelevant features
▶ Contain no information that is useful for the data mining task at
hand
Feature Subset Selection:
Redundant features
▶ Duplicate much or all of the information contained in one or more
other attributes
▶ Example: purchase price of a product and the amount of sales
tax paid
Irrelevant features
▶ Contain no information that is useful for the data mining task at
hand
▶ Example: students’ ID is often irrelevant to the task of predicting
students’ GPA
Heuristic Feature Selection Methods:
▶ Data Quality
▶ Data Quality Problems
▶ Data Cleaning
▶ Data Transformation
▶ Data Reduction
References