0% found this document useful (0 votes)
10 views19 pages

4 Binning

boom shaka laka

Uploaded by

sasank1613
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views19 pages

4 Binning

boom shaka laka

Uploaded by

sasank1613
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

15CSE401

Machine Learning and Data Mining


Lecture 10 Data Discretization
July 29,2020
MLDM Team
Dr. Bagavathi Sivakumar P
Sabarish B A
K. Nalinadevi
Bindu K R
Department of CSE
Amrita School of Engineering
Coimbatore
Data Discretization
Data Discretization is the process of putting values into buckets so that there
are a limited number of possible states.
⦿ The buckets themselves are treated as ordered and discrete values.

⦿ Data binning, bucketing is a data pre-processing method used to


minimize the effects of small observation errors .
⦿ The original data values are divided into small intervals known as
bins and then they are replaced by a general value calculated for
that bin.
⦿ This has a smoothing effect on the input data and may also
reduce the chances of overfitting in case of small datasets

Dept. of CSE, Amrita School of Engineering, Coimbatore July 2020 2


Data
⦿ Binning
Discretization
› Top-down split, unsupervised
⦿ Histogram analysis
› Top-down split, unsupervised
⦿ Clustering analysis
› Unsupervised, top-down split or bottom-up merge
⦿ Decision-tree analysis
› Supervised, top-down split
⦿ Correlation (e.g., χ2) analysis
› Unsupervised, bottom-up merge
Note: All the methods can be applied recursively
Dept. of CSE, Amrita School of Engineering, Coimbatore July 2020 3
Data Discretization /Binning
⦿ Equal-width (distance) partitioning
› Divides the range into N intervals of equal size: uniform grid
› if A and B are the lowest and highest values of the attribute,

the width of intervals will be: W = (B –A)/N.


› The most straightforward, but outliers may dominate
presentation
› Skewed data is not handled well

Dept. of CSE, Amrita School of Engineering, Coimbatore July 2020 4


Data Discretization /Binning

Dept. of CSE, Amrita School of Engineering, Coimbatore July 2020 5


Data Discretization /Binning
⦿ Equal-depth (frequency)
partitioning
› Divides the range into N intervals, each containing
approximately same number of samples
› Good data scaling
› Managing categorical attributes can be tricky

Dept. of CSE, Amrita School of Engineering, Coimbatore July 2020 6


Data Discretization /Binning
⦿ Equal-depth (frequency)
partitioning

Dept. of CSE, Amrita School of Engineering, Coimbatore July 2020 7


Data binning
Three approaches to perform smoothing –

 Smoothing by bin means : In smoothing by bin means,


each value in a bin is replaced by the mean value of the
bin.

 Smoothing by bin median : In this method each bin


value is replaced by its bin median value.

 Smoothing by bin boundary : In smoothing by bin


boundaries, the minimum and maximum values in a given
bin are identified as the bin boundaries. Each bin value is
then replaced by the closest boundary value.

Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 8


Data Smoothing-Binning
Approach
 Sort the array of given data set.
 Divides the range into N intervals, each containing the
approximately same number of samples(Equal-depth
partitioning).
 Store mean/ median/ boundaries in each row.

Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 9


Smoothing by bin means
Sorted data for price
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Smoothing by bin means


❑ Bin 1: 9, 9, 9, 9
❑ Bin 2: 23, 23, 23, 23
❑ Bin 3: 29, 29, 29, 29

Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 10


Smoothing by bin boundaries
Sorted data for price
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Smoothing by bin boundaries


❑ Bin 1: 4, 4, 4, 15
❑ Bin 2: 21, 21, 25, 25
❑ Bin 3: 26, 26, 26, 34

Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 11


Smoothing by bin median
Sorted data for price
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Smoothing by bin median


❑ Bin 1: 9 9, 9, 9
❑ Bin 2: 24, 24, 24, 24
❑ Bin 3: 29, 29, 29, 29

Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 12


Discretization Without Supervision: Binning vs. Clustering

Dat Equal width


a (distance) binning

Equal depth (frequency) K-means clustering leads to


(binning) better results
Assignment 3
⦿ Perform Binning using Equal width partitioning and Equal-depth (frequency)
partitioning
Temperature values:
64 65 68 69 70 71 72 75 75 80 81 83
85 87

Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 14


Demo of AnswerMiner

https://www.answerminer.com/calculator
s/histogram/

15
Summary
⦿ Data Discretization

Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 16


Next Session
⦿ Correlation (e.g., χ2) analysis

Dept. of CSE, Amrita School of Engineering,Coimbatore July 2020 17


Thank You

18
References

http://hanj.cs.illinois.ed
u/bk3/

19

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy