0% found this document useful (0 votes)

26 views58 pages

Unit-1 3

Uploaded by

Sohail Ansari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views58 pages

Unit-1 3

Uploaded by

Sohail Ansari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Data Mining

Preprocessing

Dr R. Singh, MJP R.U., Bly

Data Preprocessing

 Why preprocess the data?

 Major Tasks for preprocessing
 Data cleaning
 Data integration
 Data transformation
 Data reduction

 Summary
Dr R.Singh, MJP R.U., Bly
Why Data Preprocessing ?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 noisy: containing errors or outliers
 inconsistent: containing discrepancies in codes or names

 No quality data, no quality mining results!

 Quality decisions must be based on quality data
 Data warehouse needs consistent integration of quality
data
 Required for both OLAP and Data Mining!

Dr R.Singh, MJP R.U., Bly

Major Tasks for preprocessing

Dr R.Singh, MJP R.U., Bly

Data Preprocessing

 Data cleaning
 Data integration
 Data transformation
 Data reduction
 Summary

Dr R.Singh, MJP R.U., Bly

Why clean your data?

 Few reasons for Pre-processing the data:

 It prevents you from wasting time on wobbly or even
faulty analysis
 It prevents you from making the wrong conclusions,
which would make you look bad!
 It makes your analysis run faster. Correct, properly
cleaned and formatted data speed up computation in
advanced algorithms

Dr R.Singh, MJP R.U., Bly

Data Cleaning Process
 DATA CLEANING IS A 3-STEP PROCESS

 STEP 1: FIND THE DIRT

 Start Cleaning what is the Wrong with your data

 STEP 2: SCRUB THE DIRT

 Depending on type of dirt, choose appropriate
data cleaning technique.

 STEP 3: RINSE AND REPEAT

 Once cleaned, repeat the steps 1 and 2

Dr R.Singh, MJP R.U., Bly

STEP 1: FIND THE DIRT
 Start data cleaning by determining what is wrong with
data.
 Are there rows with empty values?
 Entire columns with no data? Which data is missing and why?
 How is data distributed?
 Remember, visualizations are your friends. Plot outliers. Check
distributions to see which groups or ranges are more heavily
represented in your dataset.
 Keep an eye out for the weird:
 are there impossible values? Like “date of birth: male”, “address: -
1234”.
 Is your data consistent?
 Why are the same product names written in uppercase and other
times in camelCase?
 Wear your detective hat and jot down everything interesting,
surprising or even weird.
STEP 2: SCRUB THE DIRT
 Knowing the problem is half the battle. The other
half is solving it.
 How do you solve it, though?
 One ring might rule them all, but one approach is not going
to cut it with all your data cleaning problems.
 Depending on the type of data dirt you’re facing, you’ll
need different cleaning techniques.
 Step 2 is broken down into eight parts:
 Missing Data
 Outliers
 Contaminated Data
 Inconsistent Data
 Invalid Data
 Duplicate Data
 Data Type Issues
 Structural Errors
STEP 2.1: MISSING DATA
 Sometimes, the rows of the data have missing values.
Sometimes, almost entire columns will be empty.
 What to do with missing data?
 Ignoring it is like ignoring the holes in your boat while at sea
- you’ll sink.
 Start by spotting all the different disguises missing
data wears.
 It appears in values such as 0, “0”, empty strings, “Not
Applicable”, “NA”, “#NA”, None, NaN, NULL or Inf etc.
 When you have a general idea of what your missing data
looks like, it is time to answer the crucial question:
 “Is missing data telling me something valuable?”
STEP 2.1: MISSING DATA
 There are 3 main approaches to cleaning missing
data:
1)Drop rows and/or columns with missing data.
 If the missing data is not valuable, just drop the rows (i.e.
specific customers, sensor reading, or other individual
exemplars) from your analysis.
 If entire columns are filled with missing data, drop them as
well.
2)Recode missing data into a different format.
 Numerical computations can break down with missing data.
Transforming or Recoding missing values into a different
column saves the day. For example, the column
“payment_date” with empty rows can be recoded into a
column “payed_yet” with 0 for “no” and 1 for “yes”.
STEP 2.1: MISSING DATA
3) Fill in missing values with “best guesses.”
 Use moving averages and backfilling to estimate the most
probable values of data at that point.
 This is especially crucial for time-series analyses, where
missing data can distort your conclusions.
Age Income Religion Gender
23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or

probabilistic estimates on global value distribution
E.g., put the average income here, or put the most probable income
based on the fact that the person is 39 years old
E.g., put the most frequent religion here
Dr R.Singh, MJP R.U., Bly

Other methods to handle Missing Data?

 Ignore the tuple: usually done when class label is missing
 not effective when the percentage of missing values per attribute varies
considerably.

 Fill in the missing value manually: tedious + infeasible?

 Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
 Use the measure of central tendency for the attribute (e.g.
mean or median) to fill in the missing value
 Use the attribute mean or median for all samples belonging
to the same class to fill in the missing value: smarter
 Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
Simple Discretization Methods: Binning
 Binning method:
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling – good handing of skewed data
Dr R.Singh, MJP R.U., Bly
Simple Discretization Methods: Binning
Example: customer ages number
of values

Equi-width
binning: 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Equi-width
binning: 0-22 22-31 62-80
38-44
Dr R.Singh, MJP R.U., Bly
48-55
32-38 44-48 55-62
Smoothing using Binning Methods
 First sort the unsorted given data
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
• Smoothing by bin means: W = (w1+w2+….)/N
- Bin 1: 4 + 8 + 9 + 15 = 36/4 = 9
- Bin 2: 21 + 21 + 24 + 25 = 92/4 = 23
- Bin 3: 26 + 28+ 29+ 34 = 116/4 = 29

- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29 Dr R.Singh, MJP R.U., Bly
Smoothing using Binning Methods
 Smoothing by bin boundaries: [4,15], [21,25], [26,34]

First, do the smoothing operation on Bin1: 4, 8, 9, 15

 Fix the boundary value first:
 the boundary value for Bin 1 are: [4, --, --, 15] ( i.e. Ist and Last elements)
 Now compare second element 8 with boundaries i.e. [4 & 15],
 Here 8 is more near to 4 (8-4=4) in comparison to 15 (15-8=7)

 Bin1: 4, 8, 9, 15
 Therefore Second element 8 will be treated as 4.
 Third element 9 is more near to 4 (9-4= 5) than 15 (15-9= 6).
Therefore, 9 will also be treated as 4.

 Bin1: 4, 8, 9, 15
* * Result of Smoothing operation for Bin 1
- is: 4, 4, 4, 15 Dr R.Singh, MJP R.U., Bly
Smoothing using Binning Methods
 Smoothing by bin boundaries: [4,15], [21,25], [26,34]

First, do the smoothing operation on Bin 2: 21, 21, 24, 25

 Fix the boundary value first:
 the boundary value for Bin 2 are: [21, --, --, 25] ( i.e. Ist and Last elements)
 Now compare second element 21 with boundaries i.e. [21 & 25],
 Here 21 is more near to 21 (21-21=0) in comparison to 25 (25-21=4)

 Bin 2: 21, 21, 24, 25

 Therefore Second element 21 will be treated as 21.
 Third element 24 is more near to 25 (25-24= 1) than 21 (24-21= 3).
Therefore, 24 will also be treated as 25.

 Bin 2: 21, 21, 24, 25

* * Result of Smoothing operation for Bin 2
- is: 21, 21, 25, 25
Dr R.Singh, MJP R.U., Bly
Smoothing using Binning Methods
 Smoothing by bin boundaries: [4,15], [21,25], [26,34]

First, do the smoothing operation on Bin 3: 26, 28, 29, 34

 Fix the boundary value first:
 the boundary value for Bin 3 are: [26, --, --, 34] ( i.e. Ist and Last elements)
 Now compare second element 28 with boundaries i.e. [26 & 34],
 Here 28 is more near to 26 (28-26= 2) in comparison to 34 (34-28= 6)

 Bin 3: 26, 28, 29, 34

 Therefore Second element 28 will be treated as 26.
 Third element 29 is more near to 28 (29-26= 3) than 34 (34-29= 5).
Therefore, 29 will also be treated as 28.

 Bin 3: 26, 28, 29, 34

* * Result of Smoothing operation for Bin 3
- is: 26, 26, 26, 34
Dr R.Singh, MJP R.U., Bly
Smoothing using Binning Methods
Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34

Therefore Result of
* Smoothing by bin boundaries: [4,15], [21,25], [26,34]
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

Dr R.Singh, MJP R.U., Bly

Home Assignment of Binning Method
 Unsorted data for price in dollars:
 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34

Dr R.Singh, MJP R.U., Bly

STEP 2.2: OUTLIERS
 Outliers are data points which are at an extreme.
 They usually have very high or very low values:
 An Antarctic sensor reading the temperature of 100º
 A customer who buys $0.01 worth of merchandise per year

 How to interpret those?

 Outliers usually signify either very interesting behavior
or a broken collection process.
 Both are valuable information (hey, check your
sensors, before checking your outliers), but proceed
with cleaning only if the behavior is actually
interesting.
STEP 2.2: OUTLIERS
 There are three approaches to dealing with outliers:
1. Remove outliers from the analysis. Having outliers can
mess up your analysis by bringing the averages up or down
and in general distorting your statistics. Remove them by
removing the upper and lower X-percentile of your data.
2. Segment data so outliers are in a separate group. Put all
the “normal-looking” data in one group, and outliers in
another.
3. Keep outliers, but use different statistical methods for
analysis. Weighted means (which put more weight on the
“normal” part of the distribution) and trimmed means are two
common approaches of analyzing datasets with outliers,
without suffering the negative consequences of outliers.
Cluster Analysis
This is used for finding the outliers and also in grouping the data.
Clustering is generally used in unsupervised learning.
salary

cluster

outlier

Dr R.Singh, MJP R.U., Bly age

STEP 2.3: CONTAMINATED DATA

 Contaminated data is another red flag for your

collection process.
 Examples of contaminated data include:
 Wind turbine data in your water plant dataset.
 Purchase information in your customer address dataset. Future
data in your current event time-series data.

 The last one is particularly sneaky.

STEP 2.3: CONTAMINATED DATA
 Example: Imagine having a row of financial trading information for each
day.
 Columns (or features) would include the date, asset type, asking price,
selling price, the difference in asking price from yesterday, the average
asking price for this quarter.
 The average asking price for this quarter is the source of contamination.
 You can only compute the averages once the quarter is over, but that
information would not be given to you on the trading date - thus
introducing future data, which contaminates the present data.
 With corrupted data, there is not much you can do except for removing
it. This requires a lot of domain expertise.
 When lacking domain knowledge, consult non-analytical members of
your team. Make sure to also fix any leakages your data collection
pipeline has so that the data corruption does not repeat with future data
collection.
STEP 2.4: INCONSISTENT DATA
 “Wait, did we sell ‘Apples’, ‘apples’, or ‘APPLES’ this month?
 You have to expect inconsistency in your data. Especially when
there is a higher possibility of human error.
 The best way to spot inconsistent representations of the same
elements in your database is to visualize them.
 Plot bar charts per product category.
 Do a count of rows by category if this is easier.
 When you spot the inconsistency, standardize all elements into the
same format.
 Humans might understand that ‘apples’ is the same as ‘Apples’
(capitalization) which is the same as ‘appels’ (misspelling), but
computers think those three refer to three different things
altogether.
 Lowercasing as default and correcting typos are your friends here.
Regression
y (salary)
Example of linear regression

Y1 y=x+1

X1 x (age)

 This is used to smooth the data and will help to handle data when
unnecessary data is present.
 For the analysis, purpose regression helps to decide the variable which
is suitable for our analysis.Dr R.Singh, MJP R.U., Bly
STEP 2.5: INVALID DATA
 Similarly to corrupted data, invalid data is illogical.
 For example, users who spend -2 hours on our app, or a person
whose age is 170.
 Unlike corrupted data, invalid data does not result from faulty
collection processes, but from issues with data processing
(usually during feature preparation or data cleaning).
 For example: You are preparing a report for your CEO about the
average time spent in your recently launched mobile app.
 Everything works fine, the activities time looks great, except for a
couple of rogue examples.
 You notice some users spent -22 hours in the app. Digging
deeper, you go to the source of this anomaly.
 In-app time is calculated as finish_hour - start_hour.
 (1 - 23 = - 22).
STEP 2.6: DUPLICATE DATA
 Duplicate data means the same values repeating for an observation
point.
 e.g. we count more customers than there actually are, or the
average changes because some values are more often
represented.

 There are different sources of duplicate data:

 Data are combined from different sources, and each source brings
in the same data to our database.
 The user might submit information twice by clicking on the submit
button.
 Our data collection code is off and inserts the same records
multiple times.
STEP 2.6: DUPLICATE DATA
There are three ways to eliminate duplicates:
 Find the same records and delete all but one.
 Pairwise match records, compare them and take the most
relevant one (e.g. the most recent one)
 Combine the records into entities via clustering (e.g. the
cluster of information about customer Harpreet Sahota,
which has all the data associated with it).
STEP 2.7: DATA TYPE ISSUES
 Depending on which data type you work with
 DateTime objects,
 strings,
 integers,
 decimals or floats),
you can encounter problems specific to data types.
STEP 2.7: DATA TYPE ISSUES
2.7.1 Cleaning string
 Strings are usually the messiest part of data cleaning
because they are often human-generated and hence
prone to errors.
 The common cleaning techniques for strings involve:
 Standardizing casing across the strings Removing
whitespace and newlines
 Removing stop words (for some linguistic analyses)
 Hot-encoding categorical variables represented as strings
Correcting typos
 Standardizing encodings
STEP 2.7: DATA TYPE ISSUES
2.7.1 Cleaning string
 Especially the last one can cause a lot of problems.
Encodings are the way of translating between the 0’s and
1’s of computers and the human- readable representation
of text.
 And as there are different languages, there are different
encodings.
 Everyone has seen strings of the type ��. Which
meant our browser or computer could not decode the
string. It is the same as trying to play a cassette on your
gramophone. Both are made for music, but they
represent it in different ways.
 When in doubt, go for UTF-8 as your encoding standard.
STEP 2.7: DATA TYPE ISSUES
2.7.2 Cleaning date and time
 Dates and time can be tricky. Sometimes the error is not
apparent until doing computations (like the activity duration
example above) on date and times. The cleaning process
involves:
 Making sure that all your dates and times are either a
DateTime object or a Unix timestamp (via type coercion).
 Do not be tricked by strings pretending to be a DateTime object,
like “24 Oct 2019”. Check for data type and coerce where necessary.
 Internationalization and time zones.
 DateTime objects are often recorded with the time zone or without
one. Either of those can cause problems. If you are doing region-
specific analysis, make sure to have DateTime in the correct
timezone. If you do not care about internationalization, convert all
DateTime objects to your timezone.
STEP 2.8: STRUCTURAL ERRORS
 Even though we treated data issues comprehensively, there is a
class of problems with data, which arise due to structural errors.
 Structural errors arise during measurement, data transfer, or
other situations
 structural errors can lead to inconsistent data, data
duplication or contamination
 But unlike the treatment advised above, you are not going to
solve structural errors by applying cleaning techniques to
them
 Because you can clean the data all you want, but at the next
import, the structural errors will produce unreliable data again.
 Structural errors are given special treatment to emphasize that
a lot of data cleaning is about preventing data issues rather
than resolving data issues
So you need to review your engineering best practices. Check your ETL pipeline
and how you collect and transform data from their raw data sources to
identify where the source of structural errors is and remove it.
STEP 3: RINSE AND REPEAT
Once cleaned, you repeat steps 1 and 2.

This is helpful for three reasons:

1. You might have missed something. Repeating the cleaning
process helps you catch those pesky hidden issues.
2. Through cleaning, you discover new issues. For example,
once you removed outliers from your dataset, you noticed
that data is not bell- shaped anymore and needs reshaping
before you can analyze it.
3. You learn more about your data. Every time you sweep
through your dataset and look at the distributions of values,
you learn more about your data, which gives you hunches as
to what to analyze.
STEP 3: RINSE AND REPEAT
 Data scientists spend 80% of their time cleaning and
organizing data because of the associated benefits.
 Or as the old machine learning wisdom goes:

 Garbage in, garbage out.

 All algorithms can do is spot patterns.

 And if they need to spot patterns in a mess, they are

going to return “mess” as the governing pattern.
 Clean data beats fancy algorithms any day.
STEP 3: RINSE AND REPEAT

 But cleaning data is not in the sole domain of data

science. High-quality data are necessary for any type
of decision-making.

 From startups launching the next Google search

algorithm to business enterprises relying on
Microsoft Excel for their business intelligence
 clean data is the pillar upon which data-driven decision-
making rests.
AUTOMATE YOUR DATA CLEANING
 By now it is clear how important data cleaning is.
 But it still takes way too long. And it is not the most intellectually
stimulating challenge.
 To avoid losing time, while not neglecting the data cleaning
process, data practitioners automate a lot of repetitive cleaning
tasks.
 Mainly there are two branches of data cleaning that you can
automate:
 Problem discovery. Use any visualization tools that allow you to
quickly visualize missing values and different data distributions.
 Transforming data into the desired form. The majority of data
cleaning is running reusable scripts, which perform the same
sequence of actions. For example: 1) lowercase all strings, 2)
remove whitespace, 3) break down strings into words.
AUTOMATE YOUR DATA CLEANING
 Whether automation is your cup of tea or not, remember the main
steps when cleaning data:
 Identify the problematic data
 Clean the data
 Remove, encode, fill in any missing data
 Remove outliers or analyze them separately
 Purge contaminated data and correct leaking pipelines
 Standardize inconsistent data
 Check if your data makes sense (is valid)
 Deduplicate multiple records of the same dataForesee and
prevent type issues (string issues, DateTime issues)
 Remove engineering errors (aka structural errors)
 Rinse and repeat
 Keep a list of those steps by your side and make sure your data
gives you the valuable insights you need.
Data Preprocessing

 Data cleaning
 Data integration
 Data transformation
 Data reduction

Dr R.Singh, MJP R.U., Bly

Data Integration
 Data integration:
 combines data from multiple sources into a coherent store
 Schema integration
 integrate metadata from different sources
 metadata: data about the data (i.e., data descriptors)
 Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id  B.cust-#
 Detecting and resolving data value conflicts
 for the same real world entity, attribute values from
different sources are different (e.g., J.D.Smith and Jonh
Smith may refer to the same person)
 possible reasons: different representations, different
scales, e.g., metric vs. British units (inches vs. cm)

Dr R.Singh, MJP R.U., Bly

Handling Redundant
Data in Data Integration
 Redundant data occur often when integration
of multiple databases
 The same attribute may have different names in different
databases
 One attribute may be a “derived” attribute in another
table, e.g., annual revenue
 Redundant data may be able to be detected
by correlation analysis
 Careful integration of the data from multiple
sources may help reduce/avoid redundancies
and inconsistencies and improve mining speed
and quality Dr R.Singh, MJP R.U., Bly
Data Preprocessing

 Data cleaning
 Data integration
 Data transformation
 Data reduction

Dr R.Singh, MJP R.U., Bly

Data Transformation
 Smoothing: remove noise from data
 Aggregation: summarization, data cube
construction
 Generalization: concept hierarchy climbing
 Normalization: scaled to fall within a small,
specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Attribute/feature construction
 New attributes constructed from the given ones
Dr R.Singh, MJP R.U., Bly
Normalization: Why normalization?
 Speeds-up some learning techniques (ex.
neural networks)
 Helps prevent attributes with large
ranges outweigh ones with small ranges
 Example:
 income has range 3000-200000
 age has range 10-80
 gender has domain M/F

Dr R.Singh, MJP R.U., Bly

Data Transformation: Normalization
 min-max normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 e.g. convert age=30 to range 0-1, when
min=10,max=80. new_age=(30-10)/(80-10)=2/7
 z-score normalization
v  meanA
v' 
stand_devA
 normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(| v ' |)<1
10
Dr R.Singh, MJP R.U., Bly
Data Preprocessing
 Data cleaning
 Data integration
 Data transformation
 Data reduction

Dr R.Singh, MJP R.U., Bly

Data Reduction Strategies
 Warehouse may store terabytes of data: Complex
data analysis/mining may take a very long time to
run on the complete data set
 Data reduction
 Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same (or
almost the same) analytical results
 Data reduction strategies
 Data cube aggregation
 Dimensionality reduction
 Data compression
 Numerosity reduction
 Discretization and concept hierarchy generation
Dr R.Singh, MJP R.U., Bly
Data Cube Aggregation
 The lowest level of a data cube
 the aggregated data for an individual entity of interest
 e.g., a customer in a phone calling data warehouse.
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 Reference appropriate levels
 Use the smallest representation which is enough to solve
the task
 Queries regarding aggregated information should be
answered using data cube, when possible
Dr R.Singh, MJP R.U., Bly
Dimensionality Reduction
 Feature selection (i.e., attribute subset selection):
 Select a minimum set of features such that the probability
distribution of different classes given the values for those
features is as close as possible to the original distribution
given the values of all features
 reduce # of patterns in the patterns, easier to understand
 Heuristic methods (due to exponential # of
choices):
 step-wise forward selection
 step-wise backward elimination
 combining forward selection and backward elimination
 decision-tree induction
Dr R.Singh, MJP R.U., Bly
Hierarchical Reduction
 Use multi-resolution structure with different
degrees of reduction
 Hierarchical clustering is often performed but tends
to define partitions of data sets rather than
“clusters”
 Parametric methods are usually not amenable to
hierarchical representation
 Hierarchical aggregation
 An index tree hierarchically divides a data set into
partitions by value range of some attributes
 Each partition can be considered as a bucket
 Thus an index tree with aggregates stored at each node is
a hierarchical histogram
Dr R.Singh, MJP R.U., Bly
Heuristic Feature Selection Methods
 There are 2d possible sub-features of d
features
 Several heuristic feature selection methods:
 Best single features under the feature independence
assumption: choose by significance tests.
 Best step-wise feature selection:
 The best single-feature is picked first
 Then next best feature condition to the first, ...
 Step-wise feature elimination:
 Repeatedly eliminate the worst feature
 Best combined feature selection and elimination:
 Optimal branch and bound:
 Use feature elimination and backtracking
Dr R.Singh, MJP R.U., Bly
Example of Decision Tree Induction

Initial attribute set:

{A1, A2, A3, A4, A5, A6}
A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Dr R.Singh, MJP R.U., Bly
Data Compression

Original Data Compressed

Data
lossless

Original Data
Approximated
Dr R.Singh, MJP R.U., Bly
Histograms
 A popular data
reduction technique 40

 Divide data into 35

buckets and store 30
average (or sum) for
each bucket 25
 Can be constructed 20
optimally in one
dimension using 15
dynamic programming 10
 Related to 5
quantization
problems. 0
Dr R.Singh, MJP R.U., Bly
10000 30000 50000 70000 90000
Histogram types
 Equal-width histograms:
 It divides the range into N intervals of equal size
 Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 V-optimal:
 It considers all histogram types for a given number of
buckets and chooses the one with the least variance.
 MaxDiff:
 After sorting the data to be approximated, it defines the
borders of the buckets at points where the adjacent
values have the maximum difference
 Example: split 1,1,4,5,5,7,9,14,16,18,27,30,30,32 to three
buckets
MaxDiff 27-18 and 14-9
Histograms

Hands-On ChatGPT in Excel
100% (3)
Hands-On ChatGPT in Excel
205 pages
Knowledge Discovery Database - Unit 2
No ratings yet
Knowledge Discovery Database - Unit 2
53 pages
WG240128B Tmi VZ
No ratings yet
WG240128B Tmi VZ
49 pages
Mymapúa - List of Coursera Course
No ratings yet
Mymapúa - List of Coursera Course
7 pages
Data Processing - Unit-3
No ratings yet
Data Processing - Unit-3
38 pages
Unit 2
No ratings yet
Unit 2
46 pages
Week2 2
No ratings yet
Week2 2
25 pages
DWDM Unit II
No ratings yet
DWDM Unit II
29 pages
MYOB Installation Guide
No ratings yet
MYOB Installation Guide
2 pages
Data Mining
No ratings yet
Data Mining
31 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
Buffalo Link Station Quad LS-QL-R5 User Manual
No ratings yet
Buffalo Link Station Quad LS-QL-R5 User Manual
96 pages
Outliners
No ratings yet
Outliners
15 pages
Class5 DataPreprocessing DataCleaning 23aug2021
No ratings yet
Class5 DataPreprocessing DataCleaning 23aug2021
14 pages
Part 3
No ratings yet
Part 3
8 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
Unit 2
No ratings yet
Unit 2
34 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
Lecture Notes 1.7 & 1.8
No ratings yet
Lecture Notes 1.7 & 1.8
3 pages
Lesson4performtravel Relatedcomputeroperationsptco 220223060450
No ratings yet
Lesson4performtravel Relatedcomputeroperationsptco 220223060450
38 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Cisco UCS C240 M4 Server Installation and Service Guide - GPU Card Installation (Cisco UCS C-Series Rack Servers) - Cisco
No ratings yet
Cisco UCS C240 M4 Server Installation and Service Guide - GPU Card Installation (Cisco UCS C-Series Rack Servers) - Cisco
28 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Hpe C-Series FC Switch Connectivity Stream For Nx-Os 9.X
No ratings yet
Hpe C-Series FC Switch Connectivity Stream For Nx-Os 9.X
17 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Porting J2ME Apps To Nokia X Using J2ME Android Bridge
No ratings yet
Porting J2ME Apps To Nokia X Using J2ME Android Bridge
11 pages
Pokedex Download: Click Here To Download
0% (2)
Pokedex Download: Click Here To Download
2 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Web Technologies (Topic - 04 SGML)
No ratings yet
Web Technologies (Topic - 04 SGML)
11 pages
Measurement: Mina Younan, Essam H. Houssein, Mohamed Elhoseny, Abdelmgeid A. Ali
No ratings yet
Measurement: Mina Younan, Essam H. Houssein, Mohamed Elhoseny, Abdelmgeid A. Ali
16 pages
Touch Panel Designer - Manual v1.0.6.0
No ratings yet
Touch Panel Designer - Manual v1.0.6.0
14 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
DM 24 Data Cleaning
No ratings yet
DM 24 Data Cleaning
2 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
The Reed Solomon Code
No ratings yet
The Reed Solomon Code
2 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Data Preprocessing - Data Cleaning
100% (2)
Data Preprocessing - Data Cleaning
29 pages
DWDMUNIT2
No ratings yet
DWDMUNIT2
51 pages
Unit 2
No ratings yet
Unit 2
37 pages
Binning
No ratings yet
Binning
5 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Unit 2 - JDBC
No ratings yet
Unit 2 - JDBC
114 pages
Lorem Ipsum Dolor Sit Amet
No ratings yet
Lorem Ipsum Dolor Sit Amet
1 page
PressureDropTool V1.0
No ratings yet
PressureDropTool V1.0
11 pages
Tele Manas Status Updates - 5 Mar'24
No ratings yet
Tele Manas Status Updates - 5 Mar'24
20 pages
Redux: #React Notes
No ratings yet
Redux: #React Notes
24 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
TADM10 - 2 Test Questions
No ratings yet
TADM10 - 2 Test Questions
7 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
ML 4
No ratings yet
ML 4
17 pages
FIXED FOR NEWS-ANNOUNCEMENT-Announcement - Manage-Php
No ratings yet
FIXED FOR NEWS-ANNOUNCEMENT-Announcement - Manage-Php
7 pages
Debugging
No ratings yet
Debugging
22 pages
MLISc GuessPaper Full ModelAnswers
No ratings yet
MLISc GuessPaper Full ModelAnswers
4 pages
4 Binning
No ratings yet
4 Binning
19 pages
Solutions Configurator Faqs
No ratings yet
Solutions Configurator Faqs
42 pages
Vishal - Chatbot Project
No ratings yet
Vishal - Chatbot Project
3 pages
Shubham Mangukiya (Backend-Developer) Resume
No ratings yet
Shubham Mangukiya (Backend-Developer) Resume
1 page
Classsical Encryption Techniques Mukesh
No ratings yet
Classsical Encryption Techniques Mukesh
13 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
CPS and Digital Twin in Industry 4.0 - Group 5
No ratings yet
CPS and Digital Twin in Industry 4.0 - Group 5
11 pages
Cs301 Solved Subjective Final Term by Junaid
No ratings yet
Cs301 Solved Subjective Final Term by Junaid
39 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
DWM
No ratings yet
DWM
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Evolution of Computers 5
No ratings yet
Evolution of Computers 5
9 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Painless Statistics
From Everand
Painless Statistics
Barron's Educational Series
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit-1 3

Uploaded by

Unit-1 3

Uploaded by

Data Mining

Dr R. Singh, MJP R.U., Bly

 Why preprocess the data?

 No quality data, no quality mining results!

Dr R.Singh, MJP R.U., Bly

Dr R.Singh, MJP R.U., Bly

Dr R.Singh, MJP R.U., Bly

 Few reasons for Pre-processing the data:

Dr R.Singh, MJP R.U., Bly

 STEP 1: FIND THE DIRT

 STEP 2: SCRUB THE DIRT

 STEP 3: RINSE AND REPEAT

Dr R.Singh, MJP R.U., Bly

Fill missing values using aggregate functions (e.g., average) or

Other methods to handle Missing Data?

 Fill in the missing value manually: tedious + infeasible?

First, do the smoothing operation on Bin1: 4, 8, 9, 15

First, do the smoothing operation on Bin 2: 21, 21, 24, 25

 Bin 2: 21, 21, 24, 25

 Bin 2: 21, 21, 24, 25

First, do the smoothing operation on Bin 3: 26, 28, 29, 34

 Bin 3: 26, 28, 29, 34

 Bin 3: 26, 28, 29, 34

Dr R.Singh, MJP R.U., Bly

Dr R.Singh, MJP R.U., Bly

 How to interpret those?

Dr R.Singh, MJP R.U., Bly age

 Contaminated data is another red flag for your

 The last one is particularly sneaky.

 There are different sources of duplicate data:

This is helpful for three reasons:

 Garbage in, garbage out.

 And if they need to spot patterns in a mess, they are

 But cleaning data is not in the sole domain of data

 From startups launching the next Google search

Dr R.Singh, MJP R.U., Bly

Dr R.Singh, MJP R.U., Bly

Dr R.Singh, MJP R.U., Bly

Dr R.Singh, MJP R.U., Bly

Dr R.Singh, MJP R.U., Bly

Initial attribute set:

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Original Data Compressed

 Divide data into 35

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.