0% found this document useful (0 votes)
4 views39 pages

2-Data Fundamentals For BI - Part1

The document outlines the fundamentals of data in business intelligence, focusing on data types, sources, quality, governance, and compliance. It emphasizes the importance of data preprocessing tasks such as cleaning, integration, reduction, and transformation to ensure data quality and reliability. Key techniques for handling missing and noisy data, as well as methods for data integration and reduction, are discussed to improve data analysis outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views39 pages

2-Data Fundamentals For BI - Part1

The document outlines the fundamentals of data in business intelligence, focusing on data types, sources, quality, governance, and compliance. It emphasizes the importance of data preprocessing tasks such as cleaning, integration, reduction, and transformation to ensure data quality and reliability. Key techniques for handling missing and noisy data, as well as methods for data integration and reduction, are discussed to improve data analysis outcomes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Data Fundamentals for

BI in a Business

1
Agenda

• Understanding data in the context of business


operations
• Data types, sources, and quality
• Data governance and compliance considerations
for business

2
Data Quality: Why Preprocess the Data?

◼ Measures for data quality (checking if information is good or bad):


A multidimensional view
◼ Accuracy: correct or wrong, accurate or not
◼ Completeness: not recorded, unavailable, …
◼ Consistency: some modified but some not, dangling,..
◼ Timeliness: timely update?
◼ Believability: how trustable the data are correct?
◼ Interpretability: how easily the data can be understood?

3
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

4
Data Cleaning

◼ Data in the Real World Is Dirty: Lots of potentially incorrect data,


e.g., instrument faulty, human or computer error, transmission error
◼ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
◼ e.g., Occupation=“ ” (missing data)
◼ noisy: containing noise, errors, or outliers
◼ e.g., Salary=“−10” (an error)
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C”
◼ Intentional (e.g., disguised missing data)
◼ Jan. 1 as everyone’s birthday?

5
Data Cleaning: Incomplete (Missing) Data

◼ Missing data may be due to


◼ equipment malfunction
◼ inconsistent with other recorded data and thus deleted
◼ data not entered due to misunderstanding
◼ certain data may not be considered important at the
time of entry
◼ not register history or changes of the data
◼ Missing data may need to be inferred

6
How to Handle Missing Data?
◼ Ignore the tuple: deleting the entire record (row) if it has missing
data — not effective when the % of missing values per attribute varies
considerably
◼ Fill in the missing value manually: looking at each missing value
and trying to find the correct information. This is accurate but very
time-consuming and often impossible if you have a lot of data.
◼ Fill in it automatically with
◼ a global constant : e.g., “unknown”, a new class?!
◼ the attribute mean (average)
◼ the attribute mean for all samples belonging to the same class
(smarter)
◼ the most probable value: Try to predict the most likely value based
on other information by using inference-based such as Bayesian
formula or decision tree 7
Data Cleaning: Noisy Data

◼ Noise: random error or variance in a measured variable.


It makes the data less clear and harder to analyze

◼ Incorrect attribute values may be due to


◼ faulty data collection instruments

◼ data entry problems

◼ data transmission problems

◼ technology limitation

◼ inconsistency in naming convention

8
How to Handle Noisy Data?

◼ Binning
◼ first sort data and partition into (equal-frequency) bins

◼ Example: ages: 10, 12, 15, 18, 20, 22, 25, 28, 30. You could
create three bins: (10-19) (20-29) (30+).
◼ Smoothing by bin means: Replace each age in the 10-19 bin
with the average age of that bin (which would be around 14).
Do the same for the other bins.
◼ Smoothing by bin median: Replace each age in the 10-19
bin with the middle age of that bin.
◼ Smoothing by bin boundaries: Replace each age in the 10-
19 bin with the closest bin boundary (either 10 or 19).

9
How to Handle Noisy Data?

◼ Regression
◼ smooth by fitting the data into regression functions

◼ Clustering
◼ Groups similar data points together

◼ detect and remove outliers (are often far away from the

clusters )
◼ Combined computer and human inspection
◼ detect suspicious values and check by human (e.g., deal

with possible outliers)

10
Data Cleaning as a Process
◼ It's like detective work to find errors and then fixing them.

◼ Data discrepancy detection: finding data that doesn't make sense


◼ Use metadata (e.g., domain, range, dependency, distribution)
◼ Check field overloading: one field is used to store multiple
pieces of information. (A "contact" field might contain both a phone
number and an email address)
◼ Check uniqueness rule (Each customer ID should be unique),
consecutive rule (numbers should usually be in sequence) and null
rule (Certain fields, like "required fields", should not be empty)
◼ Use commercial tools designed to find data discrepancies
◼ Data scrubbing: This uses simple knowledge to find and fix errors.
use simple domain knowledge (e.g., postal code, spell-check)
◼ Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)

11
Data Cleaning as a Process

◼ Data migration and integration: moving and combining data from


different sources.
◼ Data migration tools: transfer data and make changes during
the transformation (A tool might convert dates from one format to
another during migration)
◼ ETL (Extraction/Transformation/Loading) tools: allow users
to specify transformations through a graphical user interface

12
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

13
Data Integration

◼ Data integration: Combines data from multiple sources into a


coherent store
◼ Schema integration: matching up the structure of the data from
different sources. e.g., A.cust-id  B.cust-#
◼ Entity identification problem: if records from different sources refer
to the same real-world entity
◼ e.g., "Bill Clinton" in one database the same person as "William

Clinton" in another
◼ Detecting and resolving data value conflicts: For the same real-
world entity, attribute values from different sources are different.
◼ E.g., One database might say a customer's age is 30, while another

says it's 32)


◼ Possible reasons: different representations, different scales, e.g.,

metric vs. British units


14
Handling Redundancy in Data Integration

◼ Redundant data occur often when integration of multiple


databases
◼ Object identification: The same attribute or object may have
different names in different databases
◼ Derivable data: One attribute may be calculated (derived) from
other attribute in another database, e.g., annual revenue
◼ Redundant attributes may be able to be detected by
correlation analysis (how strongly two things are related) and
covariance analysis (how two things change together)
◼ Careful integration of the data from multiple sources improves
mining speed and quality

15
Handling Redundancy: Correlation Analysis (Nominal Data)

◼ Χ2 (chi-square) test: is a statistical test that helps us understand if


there's a relationship between two categorical variables (variables
that can be divided into categories, like colors, types of fruit, or
survey responses).
(Observed − Expected ) 2
2 = 
Expected
◼ The larger the Χ2 value, the more likely the variables are related
(Strong Relationship)
◼ The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
◼ Correlation does not imply causality (Just because two things are
related (correlated), it doesn't mean one causes the other.)
◼ # of hospitals and # of car-theft in a city are correlated
◼ Both are causally linked to the third variable: population
16
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)


Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

◼ Numbers in parenthesis are expected counts calculated


based on the data distribution in the two categories)
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
 =
2
+ + + = 507.93
90 210 360 840
◼ It shows that like_science_fiction and play_chess are
correlated in the group

17
Handling Redundancy: Correlation Analysis (Numeric Data)

◼ Correlation coefficient (also called Pearson’s product


moment coefficient), is a statistical measure that expresses the
extent to which two variables are linearly related, meaning how much
they change together at a constant rate.

i =1 (ai − A)(bi − B) 
n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B
◼ where n is the number of tuples
◼ A and B are the respective means of A and B,
◼ σA and σB are the respective standard deviation of A and B,
◼ Σ(aibi) is the sum of the AB cross-product.

18
Handling Redundancy: Correlation Analysis (Numeric Data)

• Positive Correlation (r > 0): As one variable increases, the other


tends to increase.
• Example: Height and weight are often positively correlated. Taller
people tend to be heavier. The closer r is to 1, the stronger the
positive relationship.

• Negative Correlation (r < 0): As one variable increases, the other


tends to decrease.
• Example: Temperature and hot chocolate sales might have
negative covariance. Higher temperatures tend to mean fewer hot
chocolate sales.

• No Correlation (r = 0): The variables don't seem to have any linear


relationship.
• Example: Shoe size and IQ are likely to have a correlation close to
zero.
19
Visually Evaluating Correlation

◼ Scatter plots
showing the
similarity from –1
to 1.
◼ The closer r is to -
1 or 1, the
stronger the linear
relationship.
◼ The closer r is to
0, the weaker the
linear relationship.

20
Correlation (viewed as linear relationship)
◼ Correlation measures the linear relationship between
objects
◼ To compute correlation, we standardize data objects, A
and B, and then take their dot product

a'k = (ak − mean( A)) / std ( A)


b'k = (bk − mean( B)) / std ( B)

correlation( A, B) = A'•B'

21
Handling Redundancy: Covariance (Numeric Data)

◼ Covariance measures how two variables change together. It's similar to


correlation, but instead of standardizing the data (making the standard
deviation equal to one), keep the original scales of the variables.

Correlation coefficient:

◼ where n is the number of tuples,


◼ A and B are the respective mean or expected values of A and B,
◼ σA and σB are the respective standard deviation of A and B.

22
Handling Redundancy: Covariance (Numeric Data)

• Positive covariance: If CovA,B > 0 then A is greater than its average,


B also tends to be greater than its average, and vice versa. They move
up or down together.
• Example: Height and weight might have positive covariance. Taller
people tend to be heavier.

• Negative covariance: If CovA,B < 0 then A is greater than its


average, B tends to be smaller than its average. They move in opposite
directions.
• Example: Temperature and hot chocolate sales might have negative
covariance. Higher temperatures tend to mean fewer hot chocolate
sales.

• Independence: CovA,B = 0 There's no linear relationship. Knowing the


value of A doesn't tell you anything about B.
• Example: Shoe size and IQ score likely have zero (or very close to
zero) covariance.
23
Co-Variance: An Example

◼ It can be simplified in computation as

◼ Suppose two stocks A and B have the following prices in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

◼ Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?

◼ E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

◼ E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

◼ Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

◼ Thus, A and B rise together since Cov(A, B) > 0.


Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

25
Data Reduction

◼ Data reduction: Obtain a reduced representation of the


data set that is much smaller in volume but yet produces
the same (or almost the same) analytical results

◼ Why data reduction? A database/data warehouse may


store terabytes of data. Complex data analysis may take a
very long time to run on the complete data set.

26
Data Reduction Strategies

◼ Data reduction strategies


◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

27
Data Reduction: Dimensionality Reduction

◼ Curse of dimensionality
◼ When dimensionality increases, data becomes increasingly sparse
◼ Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
◼ The possible combinations of subspaces will grow exponentially
◼ It becomes difficult to find meaningful patterns in the data.
◼ It requires more storage space and processing power.
◼ Dimensionality reduction benefits:
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining
◼ Allow easier visualization

28
2
9

Data Reduction: Dimensionality Reduction

◼ Dimensionality reduction techniques:


◼ Wavelet transforms
◼ Principal Component Analysis
◼ Supervised and nonlinear techniques (e.g., feature
selection)
Mapping Data to a New Space

◼ Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

30
3
1

What Is Wavelet Transform?

◼ These are mathematical techniques


that can decompose data (signal)
into different frequency
components, to identify and discard
less important components.

◼ it's powerful for analysing signals


that have sudden changes or
discontinuities. It breaks down a
signal into "wavelets," which are
localized waves of different scales.
3
2

What Is Wavelet Transform?

• The Wavelet Transform represents the signal in terms of these wavelets,


specifying both the frequencies present and when (or where) they occur.

• Example:
• Image: Wavelets are excellent for image compression (like JPEG
2000). They can identify sharp changes in an image (like edges) very
efficiently.
• Audio: If you have a recording with a sudden loud noise, wavelets
can pinpoint exactly when that noise occurred.
• Medical: Wavelets are used in analyzing EEG data, where sudden
spikes or changes in the signal can be important.
Wavelet Transformation
Haar2 Daubechie4

◼ Discrete wavelet transform (DWT) is a tool for analyzing signals


at different levels of detail. It can focus on the most important parts
of the signal and discard the rest, resulting in smaller file sizes without
significantly sacrificing quality. The DWT for:
◼ linear signal processing, multi-resolution analysis: The DWT is
used for processing signals (like audio, images, or sensor data) and look
at the signal at different levels of detail (multi-resolution).
◼ Compressed approximation: The DWT stores only a small fraction of
the strongest of the wavelet coefficients(focusing on the most important
parts of the signal and discarding the less important parts).
◼ lossy compression: some information is lost during compression, The
DWT minimizes the impact on how the song sounds.

33
3
4

Wavelet Transformation

◼ How the DWT Works

◼ Padding: The length of the data needs to be a power of 2. If it's


not, add zeros to the end (padding).

◼ Smoothing & Difference Functions: The DWT uses two special


functions: one that "smooths" the data (averaging nearby
values) and one that calculates the "difference" between nearby
values (like highlighting changes).

◼ Recursive Application: The DWT applies these two functions to


pairs of data points. This process applies two functions recursively
on the smoothed data until reaches the desired length (the desired
level of detail).
3
5

Wavelet Transformation
Haar2 Daubechie4

◼ The Haar wavelet is a specific type of wavelet that is used


in the Discrete Wavelet Transform (DWT).

◼ The Haar Wavelet Transform is a technique used to


decompose a signal (like a sound or image) into different
frequency components.
Wavelet Decomposition
◼ The table shows a simplified example of how the Haar works:

• Resolution: This indicates the level of detail. Higher resolution (8)


means more detail, lower resolution (1) means less detail (more
general).
• Averages: These represent the smoothed or averaged values of the
data at each resolution.
• Detail Coefficients: These capture the details or fluctuations that
were lost when smoothing the data.
36
Wavelet Decomposition
◼ Starting Data (Resolution 8): 8 data points (as measurements or
samples of a signal (e.g., sound wave, temperature readings). e.g., sound
wave, temperature readings). : [2, 2, 0, 2, 3, 5, 4, 4].
◼ First Transformation (Resolution 4):
◼ Averages: The DWT applies a "smoothing" function to pairs of data

points and calculates the averages:


• (2+2)/2 = 2 (0+2)/2 = 1
• (3+5)/2 = 4 (4+4)/2 = 4
• So the averages: [2, 1, 4, 4].
◼ Detail Coefficients: The DWT also calculates the "differences" or
"details" lost in the smoothing process:
• (2-2)/2 = 0
• (0-2)/2 = -1
• (3-5)/2 = -1
• (4-4)/2 = 0

◼the detail coefficients: [0, -1, -1, 0].


◼ Second Transformation (Resolution 2): …..
Wavelet Decomposition
◼ So S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to
S^ = [23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
◼ Drawing a tree for the Haar transform can be helpful to visualize the
process of decomposing a signal into different frequency components.
◼ The Structure: Hierarchical Decomposition
References

◼ Data Mining: Concepts and Techniques,


Jiawei Han, Micheline Kamber and Jian Pei
◼ "Data Science for Business" by Foster Provost and Tom
Fawcett

39

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy