Unit-1 3
Unit-1 3
Preprocessing
Summary
Dr R.Singh, MJP R.U., Bly
Why Data Preprocessing ?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
Data cleaning
Data integration
Data transformation
Data reduction
Summary
Equi-width
binning: 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Equi-width
binning: 0-22 22-31 62-80
38-44
Dr R.Singh, MJP R.U., Bly
48-55
32-38 44-48 55-62
Smoothing using Binning Methods
First sort the unsorted given data
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
• Smoothing by bin means: W = (w1+w2+….)/N
- Bin 1: 4 + 8 + 9 + 15 = 36/4 = 9
- Bin 2: 21 + 21 + 24 + 25 = 92/4 = 23
- Bin 3: 26 + 28+ 29+ 34 = 116/4 = 29
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29 Dr R.Singh, MJP R.U., Bly
Smoothing using Binning Methods
Smoothing by bin boundaries: [4,15], [21,25], [26,34]
Bin1: 4, 8, 9, 15
Therefore Second element 8 will be treated as 4.
Third element 9 is more near to 4 (9-4= 5) than 15 (15-9= 6).
Therefore, 9 will also be treated as 4.
Bin1: 4, 8, 9, 15
* * Result of Smoothing operation for Bin 1
- is: 4, 4, 4, 15 Dr R.Singh, MJP R.U., Bly
Smoothing using Binning Methods
Smoothing by bin boundaries: [4,15], [21,25], [26,34]
Therefore Result of
* Smoothing by bin boundaries: [4,15], [21,25], [26,34]
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
cluster
outlier
Y1 y=x+1
X1 x (age)
This is used to smooth the data and will help to handle data when
unnecessary data is present.
For the analysis, purpose regression helps to decide the variable which
is suitable for our analysis.Dr R.Singh, MJP R.U., Bly
STEP 2.5: INVALID DATA
Similarly to corrupted data, invalid data is illogical.
For example, users who spend -2 hours on our app, or a person
whose age is 170.
Unlike corrupted data, invalid data does not result from faulty
collection processes, but from issues with data processing
(usually during feature preparation or data cleaning).
For example: You are preparing a report for your CEO about the
average time spent in your recently launched mobile app.
Everything works fine, the activities time looks great, except for a
couple of rogue examples.
You notice some users spent -22 hours in the app. Digging
deeper, you go to the source of this anomaly.
In-app time is calculated as finish_hour - start_hour.
(1 - 23 = - 22).
STEP 2.6: DUPLICATE DATA
Duplicate data means the same values repeating for an observation
point.
e.g. we count more customers than there actually are, or the
average changes because some values are more often
represented.
Data cleaning
Data integration
Data transformation
Data reduction
Data cleaning
Data integration
Data transformation
Data reduction
A1? A6?
Original Data
Approximated
Dr R.Singh, MJP R.U., Bly
Histograms
A popular data
reduction technique 40