2-Data Fundamentals For BI - Part1
2-Data Fundamentals For BI - Part1
BI in a Business
1
Agenda
2
Data Quality: Why Preprocess the Data?
3
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
4
Data Cleaning
5
Data Cleaning: Incomplete (Missing) Data
6
How to Handle Missing Data?
◼ Ignore the tuple: deleting the entire record (row) if it has missing
data — not effective when the % of missing values per attribute varies
considerably
◼ Fill in the missing value manually: looking at each missing value
and trying to find the correct information. This is accurate but very
time-consuming and often impossible if you have a lot of data.
◼ Fill in it automatically with
◼ a global constant : e.g., “unknown”, a new class?!
◼ the attribute mean (average)
◼ the attribute mean for all samples belonging to the same class
(smarter)
◼ the most probable value: Try to predict the most likely value based
on other information by using inference-based such as Bayesian
formula or decision tree 7
Data Cleaning: Noisy Data
◼ technology limitation
8
How to Handle Noisy Data?
◼ Binning
◼ first sort data and partition into (equal-frequency) bins
◼ Example: ages: 10, 12, 15, 18, 20, 22, 25, 28, 30. You could
create three bins: (10-19) (20-29) (30+).
◼ Smoothing by bin means: Replace each age in the 10-19 bin
with the average age of that bin (which would be around 14).
Do the same for the other bins.
◼ Smoothing by bin median: Replace each age in the 10-19
bin with the middle age of that bin.
◼ Smoothing by bin boundaries: Replace each age in the 10-
19 bin with the closest bin boundary (either 10 or 19).
9
How to Handle Noisy Data?
◼ Regression
◼ smooth by fitting the data into regression functions
◼ Clustering
◼ Groups similar data points together
◼ detect and remove outliers (are often far away from the
clusters )
◼ Combined computer and human inspection
◼ detect suspicious values and check by human (e.g., deal
10
Data Cleaning as a Process
◼ It's like detective work to find errors and then fixing them.
11
Data Cleaning as a Process
12
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
13
Data Integration
Clinton" in another
◼ Detecting and resolving data value conflicts: For the same real-
world entity, attribute values from different sources are different.
◼ E.g., One database might say a customer's age is 30, while another
15
Handling Redundancy: Correlation Analysis (Nominal Data)
17
Handling Redundancy: Correlation Analysis (Numeric Data)
i =1 (ai − A)(bi − B)
n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B
◼ where n is the number of tuples
◼ A and B are the respective means of A and B,
◼ σA and σB are the respective standard deviation of A and B,
◼ Σ(aibi) is the sum of the AB cross-product.
18
Handling Redundancy: Correlation Analysis (Numeric Data)
◼ Scatter plots
showing the
similarity from –1
to 1.
◼ The closer r is to -
1 or 1, the
stronger the linear
relationship.
◼ The closer r is to
0, the weaker the
linear relationship.
20
Correlation (viewed as linear relationship)
◼ Correlation measures the linear relationship between
objects
◼ To compute correlation, we standardize data objects, A
and B, and then take their dot product
correlation( A, B) = A'•B'
21
Handling Redundancy: Covariance (Numeric Data)
Correlation coefficient:
22
Handling Redundancy: Covariance (Numeric Data)
◼ Suppose two stocks A and B have the following prices in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
◼ Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
25
Data Reduction
26
Data Reduction Strategies
◼ Wavelet transforms
◼ Data compression
27
Data Reduction: Dimensionality Reduction
◼ Curse of dimensionality
◼ When dimensionality increases, data becomes increasingly sparse
◼ Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
◼ The possible combinations of subspaces will grow exponentially
◼ It becomes difficult to find meaningful patterns in the data.
◼ It requires more storage space and processing power.
◼ Dimensionality reduction benefits:
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining
◼ Allow easier visualization
28
2
9
◼ Wavelet transform
30
3
1
• Example:
• Image: Wavelets are excellent for image compression (like JPEG
2000). They can identify sharp changes in an image (like edges) very
efficiently.
• Audio: If you have a recording with a sudden loud noise, wavelets
can pinpoint exactly when that noise occurred.
• Medical: Wavelets are used in analyzing EEG data, where sudden
spikes or changes in the signal can be important.
Wavelet Transformation
Haar2 Daubechie4
33
3
4
Wavelet Transformation
Wavelet Transformation
Haar2 Daubechie4
39