We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19
15CSE401
Machine Learning and Data Mining
Lecture 10 Data Discretization July 29,2020 MLDM Team Dr. Bagavathi Sivakumar P Sabarish B A K. Nalinadevi Bindu K R Department of CSE Amrita School of Engineering Coimbatore Data Discretization Data Discretization is the process of putting values into buckets so that there are a limited number of possible states. ⦿ The buckets themselves are treated as ordered and discrete values.
⦿ Data binning, bucketing is a data pre-processing method used to
minimize the effects of small observation errors . ⦿ The original data values are divided into small intervals known as bins and then they are replaced by a general value calculated for that bin. ⦿ This has a smoothing effect on the input data and may also reduce the chances of overfitting in case of small datasets
Dept. of CSE, Amrita School of Engineering, Coimbatore July 2020 2
Data ⦿ Binning Discretization › Top-down split, unsupervised ⦿ Histogram analysis › Top-down split, unsupervised ⦿ Clustering analysis › Unsupervised, top-down split or bottom-up merge ⦿ Decision-tree analysis › Supervised, top-down split ⦿ Correlation (e.g., χ2) analysis › Unsupervised, bottom-up merge Note: All the methods can be applied recursively Dept. of CSE, Amrita School of Engineering, Coimbatore July 2020 3 Data Discretization /Binning ⦿ Equal-width (distance) partitioning › Divides the range into N intervals of equal size: uniform grid › if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B –A)/N.
› The most straightforward, but outliers may dominate presentation › Skewed data is not handled well
Dept. of CSE, Amrita School of Engineering, Coimbatore July 2020 4
Data Discretization /Binning
Dept. of CSE, Amrita School of Engineering, Coimbatore July 2020 5
Data Discretization /Binning ⦿ Equal-depth (frequency) partitioning › Divides the range into N intervals, each containing approximately same number of samples › Good data scaling › Managing categorical attributes can be tricky
Dept. of CSE, Amrita School of Engineering, Coimbatore July 2020 6
Data Discretization /Binning ⦿ Equal-depth (frequency) partitioning
Dept. of CSE, Amrita School of Engineering, Coimbatore July 2020 7
Data binning Three approaches to perform smoothing –
Smoothing by bin means : In smoothing by bin means,
each value in a bin is replaced by the mean value of the bin.
Smoothing by bin median : In this method each bin
value is replaced by its bin median value.
Smoothing by bin boundary : In smoothing by bin
boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 8
Data Smoothing-Binning Approach Sort the array of given data set. Divides the range into N intervals, each containing the approximately same number of samples(Equal-depth partitioning). Store mean/ median/ boundaries in each row.
Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 9
Smoothing by bin means Sorted data for price 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Smoothing by bin means
❑ Bin 1: 9, 9, 9, 9 ❑ Bin 2: 23, 23, 23, 23 ❑ Bin 3: 29, 29, 29, 29
Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 10
Smoothing by bin boundaries Sorted data for price 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Smoothing by bin boundaries
❑ Bin 1: 4, 4, 4, 15 ❑ Bin 2: 21, 21, 25, 25 ❑ Bin 3: 26, 26, 26, 34
Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 11
Smoothing by bin median Sorted data for price 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Smoothing by bin median
❑ Bin 1: 9 9, 9, 9 ❑ Bin 2: 24, 24, 24, 24 ❑ Bin 3: 29, 29, 29, 29
Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 12
Discretization Without Supervision: Binning vs. Clustering
Dat Equal width
a (distance) binning
Equal depth (frequency) K-means clustering leads to