0% found this document useful (0 votes)

14 views50 pages

DWDM LS3 Fall 24 25

Uploaded by

veilverse.afrin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views50 pages

DWDM LS3 Fall 24 25

Uploaded by

veilverse.afrin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Data Mining:

Concepts and Techniques

(3rd ed.)

— Chapter 3 —
© Jiawei Han, Micheline Kamber, and Jian Pei

1
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary

2
Data Quality: Why Preprocess the Data?

◼ Measures for data quality: A multidimensional view

◼ Accuracy: correct or wrong, accurate or not
◼ Completeness: not recorded, unavailable, …
◼ Consistency: some modified but some not, …
◼ Timeliness: timely update?
◼ Believability: how trustable the data are correct?
◼ Interpretability: how easily the data can be understood?

3
Major Tasks in Data Preprocessing

4
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy

data, identify or remove outliers, and

resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data

cubes, or files
◼ Data reduction
◼ Dimensionality reduction

◼ Numerosity reduction

◼ Data compression

◼ Data transformation and data

discretization
◼ Normalization

◼ Concept hierarchy generation

5
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
6
Data Cleaning
◼ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
◼ incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
◼ e.g., Occupation=“ ” (missing data)
◼ noisy: containing noise, errors, or outliers
◼ e.g., Salary=“−10” (an error)
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C”
◼ discrepancy between duplicate records
◼ Intentional
◼ Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data

◼ Data is not always available

◼ E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
◼ Missing data may be due to
◼ equipment malfunction
◼ inconsistent with other recorded data and thus deleted
◼ data not entered due to misunderstanding
◼ certain data may not be considered important at the time of
entry
◼ not register history or changes of the data
◼ Missing data may need to be inferred
8
How to Handle Missing Data?
◼ Ignore the tuple
◼ usually done when class label is missing (when doing classification)
◼ Fill in the missing value manually: tedious + infeasible?
◼ Fill in it automatically with
◼ a global constant : e.g., “unknown”, a new class?!
◼ the attribute mean
◼ the attribute mean for all samples belonging to the same class:
smarter

9
Noisy Data
◼ Noise: random error or variance in a measured variable
◼ Incorrect attribute values may be due to
◼ faulty data collection instruments

◼ data entry problems

◼ data transmission problems

◼ technology limitation

◼ inconsistency in naming convention

◼ Other data problems which require data cleaning

◼ duplicate records

◼ incomplete data

◼ inconsistent data

10
How to Handle Noisy Data?

◼ Binning
◼ first sort data and partition into (equal-frequency) bins

◼ then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.

◼ Regression
◼ smooth by fitting the data into regression functions

◼ Clustering
◼ detect and remove outliers

◼ Combined computer and human inspection

◼ detect suspicious values and check by human (e.g., deal with

possible outliers)

11
Data Cleaning as a Process
◼ Data discrepancy detection
◼ Use metadata (e.g., domain, range, dependency,
distribution)
◼ Check uniqueness rule, consecutive rule and null rule

◼ Use commercial tools

◼ Data scrubbing: use simple domain knowledge (e.g.,

postal code, spell-check) to detect errors and make

corrections
◼ Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and

clustering to find outliers)

12
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
13
Data Integration
◼ Data integration:
◼ Combines data from multiple sources into a coherent store
◼ Schema integration: e.g., A.cust-id  B.cust-#
◼ Integrate metadata from different sources
◼ Entity identification problem:
◼ Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
◼ Detecting and resolving data value conflicts
◼ For the same real-world entity, attribute values from different sources are
different
◼ Possible reasons: different representations, different scales, e.g., metric vs.
British units
14
Data Integration Challenges
Challenge Description Solution Relevant Tools
Implement data quality checks like
Incomplete, incorrect, or redundant Informatica Data Quality, Talend
Data Quality Issues cleansing, deduplication, and
data from different sources. Data Quality
normalization.
Different data formats and schemas Use ETL tools to map fields,
Data Format and Schema
across systems (e.g., JSON vs. CSV, standardize formats, and reconcile Informatica, Talend, MuleSoft
Incompatibility
different field names). schema differences.
Delays in data availability affect
Use data streaming solutions or Apache Kafka, AWS Kinesis,
Data Latency real-time analytics and decision-
micro-batching to reduce latency. Google Dataflow
making.
Apply data encryption, access
Risk of exposure when sharing Informatica, IBM InfoSphere,
Data Security and Privacy control, data masking, and comply
sensitive data across systems. Talend
with data protection regulations.
Use MDM to enforce data
Data Consistency and Inconsistent data across sources, SAP Master Data Governance,
consistency and CDC to synchronize
Synchronization leading to conflicting information. Informatica, IBM InfoSphere
changes in real-time.
Large datasets slow down integration Use scalable cloud platforms and Google BigQuery, Amazon Redshift,
Scalability Issues
processes, impacting performance. distributed processing frameworks. Apache Spark
Combining structured, semi- Use integration tools that support
Heterogeneous Data Sources structured, and unstructured data is multiple data types and establish Talend, Informatica, MuleSoft
complex. ETL workflows to transform data.
Handling Redundancy in Data Integration

◼ Redundant data occur often when integration of multiple

databases
◼ Object identification: The same attribute or object may
have different names in different databases
◼ Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
◼ Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
◼ Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
16
Correlation Analysis (Numeric Data)
◼ Correlation coefficient (also called Pearson’s product moment
coefficient)

i=1 (ai − A)(bi − B) 

n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B

where n is the number of tuples, and are the respective means of A and
B, σA and σB are the respective standard B of A and B, and Σ(aibi)
A deviation
is the sum of the AB cross-product.
◼ If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
◼ rA,B = 0: independent (or no correlation);
◼ rAB < 0: negatively correlated
17
Visually Evaluating Correlation

18
Visually Evaluating Correlation
Scatter plots showing the similarity from –1 to 1.

19
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
20
20
Data Reduction Strategies
◼ Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but produces the same (or almost the same) analytical results
◼ Why data reduction?
A database/data warehouse may store terabytes of data. Complex data analysis
may take a very long time to run on the complete data set.
◼ Data reduction strategies
◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

21
Data Reduction 1: Dimensionality Reduction
◼ Curse of dimensionality
◼ When dimensionality increases, data becomes increasingly sparse
◼ Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
◼ Dimensionality reduction
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining
◼ Allow easier visualization
◼ Dimensionality reduction techniques
◼ Principal Component Analysis (PCA)
◼ Supervised and nonlinear techniques (e.g., feature selection)

22
Principal Component Analysis (PCA)
◼ Find a projection that captures the largest amount of variation in data
◼ The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance matrix,
and these eigenvectors define the new space

x1
24
Attribute Subset Selection
◼ Another way to reduce dimensionality of data
◼ Redundant attributes
◼ Duplicate much or all the information contained in one or
more other attributes
◼ E.g., purchase price of a product and the amount of sales tax
paid
◼ Irrelevant attributes
◼ Contain no information that is useful for the data mining
task at hand
◼ E.g., students' ID is often irrelevant to the task of predicting
students' GPA

25
Data Reduction 2: Numerosity Reduction
◼ Reduce data volume by choosing alternative, smaller forms of
data representation
◼ Parametric methods (e.g., regression)
◼ Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the data
◼ Non-parametric methods
◼ Do not assume models

◼ Major families: histograms, clustering, sampling, …

26
Parametric Data Reduction: Regression
and Log-Linear Models
◼ Linear regression y= mx + c
◼ Data modeled to fit a

Dependent variable
straight line
◼ Often uses the least-
square method to fit the
line Independent variable

◼ Multiple regression
Dependent variable
◼ Allows a response
variable Y to be modeled
as a linear function of
multidimensional feature Independent variable

vector
y= b0 + b1 X1 + b2 X2 + - - - - bn Xn

27
y
Regression Analysis
Y1

◼ Regression analysis: A collective name for

Y1’
techniques for the modeling and analysis of y=x+1
numerical data consisting of values of a
dependent variable (Y) (also called response
X1 x
variable or measurement) and of one or more
independent variables (X) (also called
explanatory variables or predictors)
◼ Used for prediction
◼ The parameters are estimated to give a "best (including forecasting of
fit" of the data time-series data),
inference, hypothesis
◼ Most commonly the best fit is evaluated by
testing, and modeling of
using the least squares method, but other
causal relationships
criteria have also been used

28
Regression Analysis Models
◼ Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
◼ Using the least squares criterion to the known values of Y1, Y2, …, X1, X2,
….
◼ Multiple regression: Y = b0 + b1 X1 + b2 X2
◼ Many nonlinear functions can be transformed into the above

Follow Slide 27

29
40
35
30

Histogram Analysis
25
20
15
10
5
0

100000
10000

20000

30000

40000

50000

60000

70000

80000

90000

◼ Divide data into buckets and

store average (sum) for each
bucket
◼ Partitioning rules:
◼ Equal-width: equal bucket
range
◼ Equal-frequency (or
equal-depth)

30
Clustering
◼ Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
◼ Can be very effective if data is clustered
◼ Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
◼ There are many choices of clustering definitions and clustering
algorithms

31
Sampling

◼ Sampling: obtaining a small sample s to represent the whole data set N

◼ Allow a mining algorithm to run in complexity that is potentially sub-
linear to the size of the data
◼ Key principle: Choose a representative subset of the data
◼ Simple random sampling may have very poor performance in the
presence of skew
◼ Develop adaptive sampling methods, e.g., stratified sampling

32
Types of Sampling

◼ Simple random sampling

◼ There is an equal probability of selecting
any item
◼ Sampling without replacement

Population
◼ Once an object is selected, it is removed
from the population
◼ Sampling with replacement
◼ A selected object is not removed from the

Population
population
◼ Stratified sampling:
◼ Partition the data set, and draw samples
from each partition (proportionally, i.e.,
approximately the same percentage of the
data)
◼ Used in conjunction with skewed data
Population

33
Sampling: With or without Replacement

Raw Data
34
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

35
Data Cube Aggregation

◼ Data Cube
◼ A data cube is a multidimensional array or structure used to represent data in a way that
supports fast querying and analysis, typically in the context of Online Analytical
Processing (OLAP). It is designed to help with the aggregation, summarization, and
analysis of large datasets by organizing them across multiple dimensions. These
dimensions could represent different attributes or categories of the data, like time,
geography, products, etc. For example, Customer Location dimension, Product
dimension, and Time dimension.

36
3
7

Data Cube Aggregation

◼ The lowest level of a data cube (base cuboid)

◼ The aggregated data for an individual entity of interest
◼ E.g., a customer in the context of a phone calling data
warehouse
◼ Multiple levels of aggregation in data cubes
◼ Further reduce the size of data to deal with
◼ Reference appropriate levels
◼ Use the smallest representation which is enough to solve the
task
◼ Queries regarding aggregated information should be answered
using data cube, when possible
3
8

Data Cube Aggregation

Data Reduction 3: Data Compression
◼ String compression
◼ There are extensive theories and well-tuned algorithms

◼ Typically lossless, but only limited manipulation is possible

without expansion
◼ Audio/video compression
◼ Typically, lossy compression, with progressive refinement

◼ Sometimes small fragments of signal can be reconstructed

without reconstructing the whole

◼ Dimensionality and numerosity reduction may also be considered
as forms of data compression

39
Data Compression

Original Data Compressed

Data
lossless

Original Data
Approximated

40
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
41
Data Transformation
◼ A function that maps the entire set of values of a given attribute to a new set of
replacement values s.t. each old value can be identified with one of the new values
◼ Methods
◼ Smoothing: Remove noise from data
◼ Attribute/feature construction
◼ New attributes constructed from the given ones
◼ Aggregation: Summarization, data cube construction
◼ Normalization: Scaled to fall within a smaller, specified range
◼ min-max normalization
◼ z-score normalization
◼ normalization by decimal scaling
◼ Discretization: Concept hierarchy climbing

42
Normalization
◼ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 − 12,000
1.0]. Then $73,000 is mapped to 98,000 − 12,000 (1.0 − 0) + 0 = 0.716
◼ Z-score normalization (μ: mean, σ: standard deviation):
v − A
v' =
 A

73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
◼ Normalization by decimal scaling
v
v'= j Where j is the smallest integer such that Max(|ν’|) < 1
10
43
Discretization
◼ Three types of attributes
◼ Nominal—values from an unordered set, e.g., color, profession
◼ Ordinal—values from an ordered set, e.g., military or academic rank
◼ Numeric—real numbers, e.g., integer or real numbers
◼ Discretization: Divide the range of a continuous attribute into intervals
◼ Interval labels can then be used to replace actual data values
◼ Reduce data size by discretization

44
Data Discretization Methods
◼ Typical methods: All the methods can be applied recursively
◼ Binning
◼ Top-down split, unsupervised
◼ Histogram analysis
◼ Top-down split, unsupervised
◼ Clustering analysis (unsupervised, top-down split or bottom-
up merge)
◼ Decision-tree analysis (supervised, top-down split)
◼ Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)

45
Simple Discretization: Binning

◼ Equal-width (distance) partitioning

◼ Divides the range into N intervals of equal size: uniform grid
◼ if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
◼ The most straightforward, but outliers may dominate presentation
◼ Skewed data is not handled well

◼ Equal-depth (frequency) partitioning

◼ Divides the range into N intervals, each containing approximately same
number of samples
◼ Good data scaling
◼ Managing categorical attributes can be tricky
46
Simple Discretization: Binning

47
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

48
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
49
Summary
◼ Data quality: accuracy, completeness, consistency, timeliness, believability,
interpretability
◼ Data cleaning: e.g., missing/noisy values, outliers
◼ Data integration from multiple sources:
◼ Entity identification problem

◼ Remove redundancies

◼ Detect inconsistencies

◼ Data reduction
◼ Dimensionality reduction

◼ Numerosity reduction

◼ Data compression

◼ Data transformation and data discretization

◼ Normalization

50
References
◼ D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
◼ A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
◼ H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
◼ M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
◼ H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
◼ J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
◼ T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
◼ R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995
51

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Effectiveness of PNP Checkpoint in Reducing Road Violation
67% (3)
Effectiveness of PNP Checkpoint in Reducing Road Violation
29 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Correlation
No ratings yet
Correlation
14 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Unit - II
No ratings yet
Unit - II
56 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Mining
No ratings yet
Mining
63 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Why Data Preprocessing
No ratings yet
Why Data Preprocessing
7 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
03 Preprocessing
No ratings yet
03 Preprocessing
65 pages
Lec 7
No ratings yet
Lec 7
45 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
03 Preprocessing
No ratings yet
03 Preprocessing
38 pages
CH 3
No ratings yet
CH 3
68 pages
3 Processing
No ratings yet
3 Processing
79 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
03preprocessing 20160222
No ratings yet
03preprocessing 20160222
65 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
DMW Module 2
No ratings yet
DMW Module 2
32 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
66 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Lake Development with Big Data: Explore architectural approaches to building Data Lakes that ingest, index, manage, and analyze massive amounts of data using Big Data technologies
From Everand
Data Lake Development with Big Data: Explore architectural approaches to building Data Lakes that ingest, index, manage, and analyze massive amounts of data using Big Data technologies
Pradeep Pasupuleti
No ratings yet
DWDM LS2 Fall 24 25
No ratings yet
DWDM LS2 Fall 24 25
42 pages
DWDM LS1 Fall 24 25
No ratings yet
DWDM LS1 Fall 24 25
42 pages
DWDM Assignments Fall 24 25
No ratings yet
DWDM Assignments Fall 24 25
4 pages
Statistical Data Analysis
No ratings yet
Statistical Data Analysis
23 pages
Relationship Between Hope Optimism and Life Satisfaction Among Adolescents
No ratings yet
Relationship Between Hope Optimism and Life Satisfaction Among Adolescents
6 pages
12 Bivariate Data Analysis: Regression and Correlation Methods
No ratings yet
12 Bivariate Data Analysis: Regression and Correlation Methods
22 pages
Exponential and Logarithmic
No ratings yet
Exponential and Logarithmic
15 pages
Local Media4010165449277992119
100% (1)
Local Media4010165449277992119
117 pages
Regression and Correlation Analysisxy
No ratings yet
Regression and Correlation Analysisxy
23 pages
Res Notes
No ratings yet
Res Notes
10 pages
Cramer
No ratings yet
Cramer
7 pages
Applied Sciences: Materials Have Driven The Historical Development of The Tennis Racket
No ratings yet
Applied Sciences: Materials Have Driven The Historical Development of The Tennis Racket
15 pages
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
No ratings yet
Reg. No.: 39110009 Colab Notebook Link: Name: Abivirshan Suresh
27 pages
Concept of Exact and Approximate Numbers - N.pal, S.sarkar
No ratings yet
Concept of Exact and Approximate Numbers - N.pal, S.sarkar
473 pages
Segev (1987) - Estratégia e Desempenho
No ratings yet
Segev (1987) - Estratégia e Desempenho
13 pages
Determining Surface Roughness Level Based On Texture Analysis
No ratings yet
Determining Surface Roughness Level Based On Texture Analysis
10 pages
Analysis of The Factors Affecting Customers' Purchase Intention: The Mediating Role of Perceived Value
No ratings yet
Analysis of The Factors Affecting Customers' Purchase Intention: The Mediating Role of Perceived Value
9 pages
Data Transformation
No ratings yet
Data Transformation
58 pages
Fly Rock Prediction by Multiple Regression Analysis in Esfordi Phosphate Mine of Iran
No ratings yet
Fly Rock Prediction by Multiple Regression Analysis in Esfordi Phosphate Mine of Iran
11 pages
Influence of Overcrowded Classrooms On Students' Academic Performance in Secondary Schools of Nigeria
No ratings yet
Influence of Overcrowded Classrooms On Students' Academic Performance in Secondary Schools of Nigeria
19 pages
Manuscript Mental Health and Academic Achievement
No ratings yet
Manuscript Mental Health and Academic Achievement
9 pages
Sas Procs
No ratings yet
Sas Procs
8 pages
The Effect of Applying The Organization Enterprise Resource Planning System (ERP) in The Quality of Internal Audit - A Case of Jordanian Commercial Banks
No ratings yet
The Effect of Applying The Organization Enterprise Resource Planning System (ERP) in The Quality of Internal Audit - A Case of Jordanian Commercial Banks
9 pages
Ch-11-Correlation (Prashant Kirad)
No ratings yet
Ch-11-Correlation (Prashant Kirad)
11 pages
Evaluation of CO2 INjectivity From Waterflood Values
No ratings yet
Evaluation of CO2 INjectivity From Waterflood Values
9 pages
Arivalagan Arumugam,, Tamilnadu India
No ratings yet
Arivalagan Arumugam,, Tamilnadu India
19 pages
Coefficient of Determination Formula
No ratings yet
Coefficient of Determination Formula
8 pages
Python For Data Sceince l1 Hands On
No ratings yet
Python For Data Sceince l1 Hands On
5 pages
Correlation of Study Time and Quarterly Assessment Scores in Biology, Chemistry 2 and Physics of STEM Students
100% (1)
Correlation of Study Time and Quarterly Assessment Scores in Biology, Chemistry 2 and Physics of STEM Students
40 pages
Module-for-Blended-Thesis Writing Lesson 15
No ratings yet
Module-for-Blended-Thesis Writing Lesson 15
8 pages
CHA EL: Capital Markets: Taking Stock
No ratings yet
CHA EL: Capital Markets: Taking Stock
10 pages
Hadoop Hive Cheat Sheet - Developer Guide For SQL To HiveQL - Qubole
No ratings yet
Hadoop Hive Cheat Sheet - Developer Guide For SQL To HiveQL - Qubole
19 pages
F. Y. B. Sc. (Computer Science) Examination - 2010: Total No. of Questions: 5) (Total No. of Printed Pages: 4
No ratings yet
F. Y. B. Sc. (Computer Science) Examination - 2010: Total No. of Questions: 5) (Total No. of Printed Pages: 4
76 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

DWDM LS3 Fall 24 25

Uploaded by

DWDM LS3 Fall 24 25

Uploaded by

Data Mining:

Concepts and Techniques

◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

◼ Measures for data quality: A multidimensional view

data, identify or remove outliers, and

◼ Data transformation and data

◼ Concept hierarchy generation

◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

◼ Data is not always available

◼ data entry problems

◼ data transmission problems

◼ inconsistency in naming convention

◼ Other data problems which require data cleaning

◼ then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc.

◼ Combined computer and human inspection

◼ Use commercial tools

◼ Data scrubbing: use simple domain knowledge (e.g.,

postal code, spell-check) to detect errors and make

relationship to detect violators (e.g., correlation and

◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

◼ Redundant data occur often when integration of multiple

i=1 (ai − A)(bi − B) 

◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Major families: histograms, clustering, sampling, …

◼ Regression analysis: A collective name for

◼ Divide data into buckets and

◼ Sampling: obtaining a small sample s to represent the whole data set N

◼ Simple random sampling

Raw Data Cluster/Stratified Sample

Data Cube Aggregation

◼ The lowest level of a data cube (base cuboid)

Data Cube Aggregation

◼ Typically lossless, but only limited manipulation is possible

◼ Sometimes small fragments of signal can be reconstructed

without reconstructing the whole

Original Data Compressed

◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

◼ Equal-width (distance) partitioning

◼ Equal-depth (frequency) partitioning

◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

◼ Data transformation and data discretization

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.