0% found this document useful (0 votes)

4 views39 pages

2-Data Fundamentals For BI - Part1

The document outlines the fundamentals of data in business intelligence, focusing on data types, sources, quality, governance, and compliance. It emphasizes the importance of data preprocessing tasks such as cleaning, integration, reduction, and transformation to ensure data quality and reliability. Key techniques for handling missing and noisy data, as well as methods for data integration and reduction, are discussed to improve data analysis outcomes.

Uploaded by

mohamed2004mowaffak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views39 pages

2-Data Fundamentals For BI - Part1

Uploaded by

mohamed2004mowaffak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Data Fundamentals for

BI in a Business

1
Agenda

• Understanding data in the context of business

operations
• Data types, sources, and quality
• Data governance and compliance considerations
for business

2
Data Quality: Why Preprocess the Data?

◼ Measures for data quality (checking if information is good or bad):

A multidimensional view
◼ Accuracy: correct or wrong, accurate or not
◼ Completeness: not recorded, unavailable, …
◼ Consistency: some modified but some not, dangling,..
◼ Timeliness: timely update?
◼ Believability: how trustable the data are correct?
◼ Interpretability: how easily the data can be understood?

3
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

4
Data Cleaning

◼ Data in the Real World Is Dirty: Lots of potentially incorrect data,

e.g., instrument faulty, human or computer error, transmission error
◼ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
◼ e.g., Occupation=“ ” (missing data)
◼ noisy: containing noise, errors, or outliers
◼ e.g., Salary=“−10” (an error)
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C”
◼ Intentional (e.g., disguised missing data)
◼ Jan. 1 as everyone’s birthday?

5
Data Cleaning: Incomplete (Missing) Data

◼ Missing data may be due to

◼ equipment malfunction
◼ inconsistent with other recorded data and thus deleted
◼ data not entered due to misunderstanding
◼ certain data may not be considered important at the
time of entry
◼ not register history or changes of the data
◼ Missing data may need to be inferred

6
How to Handle Missing Data?
◼ Ignore the tuple: deleting the entire record (row) if it has missing
data — not effective when the % of missing values per attribute varies
considerably
◼ Fill in the missing value manually: looking at each missing value
and trying to find the correct information. This is accurate but very
time-consuming and often impossible if you have a lot of data.
◼ Fill in it automatically with
◼ a global constant : e.g., “unknown”, a new class?!
◼ the attribute mean (average)
◼ the attribute mean for all samples belonging to the same class
(smarter)
◼ the most probable value: Try to predict the most likely value based
on other information by using inference-based such as Bayesian
formula or decision tree 7
Data Cleaning: Noisy Data

◼ Noise: random error or variance in a measured variable.

It makes the data less clear and harder to analyze

◼ Incorrect attribute values may be due to

◼ faulty data collection instruments

◼ data entry problems

◼ data transmission problems

◼ technology limitation

◼ inconsistency in naming convention

8
How to Handle Noisy Data?

◼ Binning
◼ first sort data and partition into (equal-frequency) bins

◼ Example: ages: 10, 12, 15, 18, 20, 22, 25, 28, 30. You could
create three bins: (10-19) (20-29) (30+).
◼ Smoothing by bin means: Replace each age in the 10-19 bin
with the average age of that bin (which would be around 14).
Do the same for the other bins.
◼ Smoothing by bin median: Replace each age in the 10-19
bin with the middle age of that bin.
◼ Smoothing by bin boundaries: Replace each age in the 10-
19 bin with the closest bin boundary (either 10 or 19).

9
How to Handle Noisy Data?

◼ Regression
◼ smooth by fitting the data into regression functions

◼ Clustering
◼ Groups similar data points together

◼ detect and remove outliers (are often far away from the

clusters )
◼ Combined computer and human inspection
◼ detect suspicious values and check by human (e.g., deal

with possible outliers)

10
Data Cleaning as a Process
◼ It's like detective work to find errors and then fixing them.

◼ Data discrepancy detection: finding data that doesn't make sense

◼ Use metadata (e.g., domain, range, dependency, distribution)
◼ Check field overloading: one field is used to store multiple
pieces of information. (A "contact" field might contain both a phone
number and an email address)
◼ Check uniqueness rule (Each customer ID should be unique),
consecutive rule (numbers should usually be in sequence) and null
rule (Certain fields, like "required fields", should not be empty)
◼ Use commercial tools designed to find data discrepancies
◼ Data scrubbing: This uses simple knowledge to find and fix errors.
use simple domain knowledge (e.g., postal code, spell-check)
◼ Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)

11
Data Cleaning as a Process

◼ Data migration and integration: moving and combining data from

different sources.
◼ Data migration tools: transfer data and make changes during
the transformation (A tool might convert dates from one format to
another during migration)
◼ ETL (Extraction/Transformation/Loading) tools: allow users
to specify transformations through a graphical user interface

12
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

13
Data Integration

◼ Data integration: Combines data from multiple sources into a

coherent store
◼ Schema integration: matching up the structure of the data from
different sources. e.g., A.cust-id  B.cust-#
◼ Entity identification problem: if records from different sources refer
to the same real-world entity
◼ e.g., "Bill Clinton" in one database the same person as "William

Clinton" in another
◼ Detecting and resolving data value conflicts: For the same real-
world entity, attribute values from different sources are different.
◼ E.g., One database might say a customer's age is 30, while another

says it's 32)

◼ Possible reasons: different representations, different scales, e.g.,

metric vs. British units

14
Handling Redundancy in Data Integration

◼ Redundant data occur often when integration of multiple

databases
◼ Object identification: The same attribute or object may have
different names in different databases
◼ Derivable data: One attribute may be calculated (derived) from
other attribute in another database, e.g., annual revenue
◼ Redundant attributes may be able to be detected by
correlation analysis (how strongly two things are related) and
covariance analysis (how two things change together)
◼ Careful integration of the data from multiple sources improves
mining speed and quality

15
Handling Redundancy: Correlation Analysis (Nominal Data)

◼ Χ2 (chi-square) test: is a statistical test that helps us understand if

there's a relationship between two categorical variables (variables
that can be divided into categories, like colors, types of fruit, or
survey responses).
(Observed − Expected ) 2
2 = 
Expected
◼ The larger the Χ2 value, the more likely the variables are related
(Strong Relationship)
◼ The cells that contribute the most to the Χ2 value are those whose
actual count is very different from the expected count
◼ Correlation does not imply causality (Just because two things are
related (correlated), it doesn't mean one causes the other.)
◼ # of hospitals and # of car-theft in a city are correlated
◼ Both are causally linked to the third variable: population
16
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

◼ Numbers in parenthesis are expected counts calculated

based on the data distribution in the two categories)
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
 =
2
+ + + = 507.93
90 210 360 840
◼ It shows that like_science_fiction and play_chess are
correlated in the group

17
Handling Redundancy: Correlation Analysis (Numeric Data)

◼ Correlation coefficient (also called Pearson’s product

moment coefficient), is a statistical measure that expresses the
extent to which two variables are linearly related, meaning how much
they change together at a constant rate.

i =1 (ai − A)(bi − B) 
n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B
◼ where n is the number of tuples
◼ A and B are the respective means of A and B,
◼ σA and σB are the respective standard deviation of A and B,
◼ Σ(aibi) is the sum of the AB cross-product.

18
Handling Redundancy: Correlation Analysis (Numeric Data)

• Positive Correlation (r > 0): As one variable increases, the other

tends to increase.
• Example: Height and weight are often positively correlated. Taller
people tend to be heavier. The closer r is to 1, the stronger the
positive relationship.

• Negative Correlation (r < 0): As one variable increases, the other

tends to decrease.
• Example: Temperature and hot chocolate sales might have
negative covariance. Higher temperatures tend to mean fewer hot
chocolate sales.

• No Correlation (r = 0): The variables don't seem to have any linear

relationship.
• Example: Shoe size and IQ are likely to have a correlation close to
zero.
19
Visually Evaluating Correlation

◼ Scatter plots
showing the
similarity from –1
to 1.
◼ The closer r is to -
1 or 1, the
stronger the linear
relationship.
◼ The closer r is to
0, the weaker the
linear relationship.

20
Correlation (viewed as linear relationship)
◼ Correlation measures the linear relationship between
objects
◼ To compute correlation, we standardize data objects, A
and B, and then take their dot product

a'k = (ak − mean( A)) / std ( A)

b'k = (bk − mean( B)) / std ( B)

correlation( A, B) = A'•B'

21
Handling Redundancy: Covariance (Numeric Data)

◼ Covariance measures how two variables change together. It's similar to

correlation, but instead of standardizing the data (making the standard
deviation equal to one), keep the original scales of the variables.
◼

Correlation coefficient:

◼ where n is the number of tuples,

◼ A and B are the respective mean or expected values of A and B,
◼ σA and σB are the respective standard deviation of A and B.

22
Handling Redundancy: Covariance (Numeric Data)

• Positive covariance: If CovA,B > 0 then A is greater than its average,

B also tends to be greater than its average, and vice versa. They move
up or down together.
• Example: Height and weight might have positive covariance. Taller
people tend to be heavier.

• Negative covariance: If CovA,B < 0 then A is greater than its

average, B tends to be smaller than its average. They move in opposite
directions.
• Example: Temperature and hot chocolate sales might have negative
covariance. Higher temperatures tend to mean fewer hot chocolate
sales.

• Independence: CovA,B = 0 There's no linear relationship. Knowing the

value of A doesn't tell you anything about B.
• Example: Shoe size and IQ score likely have zero (or very close to
zero) covariance.
23
Co-Variance: An Example

◼ It can be simplified in computation as

◼ Suppose two stocks A and B have the following prices in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

◼ Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?

◼ E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

◼ E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

◼ Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

◼ Thus, A and B rise together since Cov(A, B) > 0.

Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

25
Data Reduction

◼ Data reduction: Obtain a reduced representation of the

data set that is much smaller in volume but yet produces
the same (or almost the same) analytical results

◼ Why data reduction? A database/data warehouse may

store terabytes of data. Complex data analysis may take a
very long time to run on the complete data set.

26
Data Reduction Strategies

◼ Data reduction strategies

◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

27
Data Reduction: Dimensionality Reduction

◼ Curse of dimensionality
◼ When dimensionality increases, data becomes increasingly sparse
◼ Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
◼ The possible combinations of subspaces will grow exponentially
◼ It becomes difficult to find meaningful patterns in the data.
◼ It requires more storage space and processing power.
◼ Dimensionality reduction benefits:
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining
◼ Allow easier visualization

28
2
9

Data Reduction: Dimensionality Reduction

◼ Dimensionality reduction techniques:

◼ Wavelet transforms
◼ Principal Component Analysis
◼ Supervised and nonlinear techniques (e.g., feature
selection)
Mapping Data to a New Space

◼ Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

30
3
1

What Is Wavelet Transform?

◼ These are mathematical techniques

that can decompose data (signal)
into different frequency
components, to identify and discard
less important components.

◼ it's powerful for analysing signals

that have sudden changes or
discontinuities. It breaks down a
signal into "wavelets," which are
localized waves of different scales.
3
2

What Is Wavelet Transform?

• The Wavelet Transform represents the signal in terms of these wavelets,

specifying both the frequencies present and when (or where) they occur.

• Example:
• Image: Wavelets are excellent for image compression (like JPEG
2000). They can identify sharp changes in an image (like edges) very
efficiently.
• Audio: If you have a recording with a sudden loud noise, wavelets
can pinpoint exactly when that noise occurred.
• Medical: Wavelets are used in analyzing EEG data, where sudden
spikes or changes in the signal can be important.
Wavelet Transformation
Haar2 Daubechie4

◼ Discrete wavelet transform (DWT) is a tool for analyzing signals

at different levels of detail. It can focus on the most important parts
of the signal and discard the rest, resulting in smaller file sizes without
significantly sacrificing quality. The DWT for:
◼ linear signal processing, multi-resolution analysis: The DWT is
used for processing signals (like audio, images, or sensor data) and look
at the signal at different levels of detail (multi-resolution).
◼ Compressed approximation: The DWT stores only a small fraction of
the strongest of the wavelet coefficients(focusing on the most important
parts of the signal and discarding the less important parts).
◼ lossy compression: some information is lost during compression, The
DWT minimizes the impact on how the song sounds.

33
3
4

Wavelet Transformation

◼ How the DWT Works

◼ Padding: The length of the data needs to be a power of 2. If it's

not, add zeros to the end (padding).

◼ Smoothing & Difference Functions: The DWT uses two special

functions: one that "smooths" the data (averaging nearby
values) and one that calculates the "difference" between nearby
values (like highlighting changes).

◼ Recursive Application: The DWT applies these two functions to

pairs of data points. This process applies two functions recursively
on the smoothed data until reaches the desired length (the desired
level of detail).
3
5

Wavelet Transformation
Haar2 Daubechie4

◼ The Haar wavelet is a specific type of wavelet that is used

in the Discrete Wavelet Transform (DWT).

◼ The Haar Wavelet Transform is a technique used to

decompose a signal (like a sound or image) into different
frequency components.
Wavelet Decomposition
◼ The table shows a simplified example of how the Haar works:

• Resolution: This indicates the level of detail. Higher resolution (8)

means more detail, lower resolution (1) means less detail (more
general).
• Averages: These represent the smoothed or averaged values of the
data at each resolution.
• Detail Coefficients: These capture the details or fluctuations that
were lost when smoothing the data.
36
Wavelet Decomposition
◼ Starting Data (Resolution 8): 8 data points (as measurements or
samples of a signal (e.g., sound wave, temperature readings). e.g., sound
wave, temperature readings). : [2, 2, 0, 2, 3, 5, 4, 4].
◼ First Transformation (Resolution 4):
◼ Averages: The DWT applies a "smoothing" function to pairs of data

points and calculates the averages:

• (2+2)/2 = 2 (0+2)/2 = 1
• (3+5)/2 = 4 (4+4)/2 = 4
• So the averages: [2, 1, 4, 4].
◼ Detail Coefficients: The DWT also calculates the "differences" or
"details" lost in the smoothing process:
• (2-2)/2 = 0
• (0-2)/2 = -1
• (3-5)/2 = -1
• (4-4)/2 = 0

◼the detail coefficients: [0, -1, -1, 0].

◼ Second Transformation (Resolution 2): …..
Wavelet Decomposition
◼ So S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to
S^ = [23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
◼ Drawing a tree for the Haar transform can be helpful to visualize the
process of decomposing a signal into different frequency components.
◼ The Structure: Hierarchical Decomposition
References

◼ Data Mining: Concepts and Techniques,

Jiawei Han, Micheline Kamber and Jian Pei
◼ "Data Science for Business" by Foster Provost and Tom
Fawcett

Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
62 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Module 2
No ratings yet
Module 2
62 pages
Unit I
No ratings yet
Unit I
57 pages
Data Mining: Dosen: Dr. Vitri Tundjungsari
No ratings yet
Data Mining: Dosen: Dr. Vitri Tundjungsari
64 pages
Week2 2
No ratings yet
Week2 2
25 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
3 Processing
No ratings yet
3 Processing
79 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
DP
No ratings yet
DP
44 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
Data Mining 3
No ratings yet
Data Mining 3
57 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
EX378 Red Hat Certified Cloud-Native Developer Exam Free Dumps
No ratings yet
EX378 Red Hat Certified Cloud-Native Developer Exam Free Dumps
15 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
03 Preprocessing
No ratings yet
03 Preprocessing
64 pages
Slide 05 Chapter3 Data Preprocessing
No ratings yet
Slide 05 Chapter3 Data Preprocessing
58 pages
Data Mining
No ratings yet
Data Mining
40 pages
Unit 3
No ratings yet
Unit 3
164 pages
DM Merged
No ratings yet
DM Merged
169 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
63 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
Module 5 03preprocessing
No ratings yet
Module 5 03preprocessing
63 pages
Mining
No ratings yet
Mining
63 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
RDBMS Complete Practical
No ratings yet
RDBMS Complete Practical
30 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Algorithmic Aspects of Cloud Computing 4th
100% (1)
Algorithmic Aspects of Cloud Computing 4th
195 pages
Chapter 3
No ratings yet
Chapter 3
56 pages
Lecture#2 Data Mining MS (DEIM) Spring 2025
No ratings yet
Lecture#2 Data Mining MS (DEIM) Spring 2025
61 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
g12 Important Questions Database Concepts
No ratings yet
g12 Important Questions Database Concepts
7 pages
CH 3
No ratings yet
CH 3
68 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Mobile Showroom Management System Project Report
No ratings yet
Mobile Showroom Management System Project Report
30 pages
03 Preprocessing
No ratings yet
03 Preprocessing
38 pages
03 Preprocessing
No ratings yet
03 Preprocessing
65 pages
Lec 7
No ratings yet
Lec 7
45 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
56 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
40 pages
Correlation
No ratings yet
Correlation
14 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
IICS MCQs
100% (1)
IICS MCQs
7 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
64 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Software Design Document
No ratings yet
Software Design Document
16 pages
2006 Sahil Take
No ratings yet
2006 Sahil Take
68 pages
Dell EMC Unisphere For PowerMax Online Help V9.0.1
No ratings yet
Dell EMC Unisphere For PowerMax Online Help V9.0.1
860 pages
Cs 3610 Software Engineering Summer Software Requirements Specification Document Project Title Road Repair Tracking System
No ratings yet
Cs 3610 Software Engineering Summer Software Requirements Specification Document Project Title Road Repair Tracking System
24 pages
Knime: Presented By-Jaimini Solanki Suchita Mishra Stuti Smart
No ratings yet
Knime: Presented By-Jaimini Solanki Suchita Mishra Stuti Smart
7 pages
Excel Practice - ExcelR
No ratings yet
Excel Practice - ExcelR
177 pages
Pay Roll System
No ratings yet
Pay Roll System
14 pages
Oracle Database: From Wikipedia, The Free Encyclopedia
No ratings yet
Oracle Database: From Wikipedia, The Free Encyclopedia
1 page
Application of Linked Data Technologies in Digital Libraries
No ratings yet
Application of Linked Data Technologies in Digital Libraries
4 pages
Advanced Database Lab
No ratings yet
Advanced Database Lab
36 pages
Chapter 4 LAU
No ratings yet
Chapter 4 LAU
73 pages
Ccs341-Dw-Int I Key-Set Ii - Ar
No ratings yet
Ccs341-Dw-Int I Key-Set Ii - Ar
14 pages
Data Structure Cheat Sheet
No ratings yet
Data Structure Cheat Sheet
29 pages
Correction File
No ratings yet
Correction File
1 page
Networks Lecture 1
No ratings yet
Networks Lecture 1
28 pages
Networks Lecture 2
No ratings yet
Networks Lecture 2
21 pages
Juno Li
No ratings yet
Juno Li
5 pages
Learning Curves in Open, Laparoscopic, and Robotic.18
No ratings yet
Learning Curves in Open, Laparoscopic, and Robotic.18
12 pages
Ramly
No ratings yet
Ramly
2 pages
Lecture 5 Modes of Operation
No ratings yet
Lecture 5 Modes of Operation
30 pages
DM Lect 6 - Recommender Systems
No ratings yet
DM Lect 6 - Recommender Systems
46 pages
Aissce CS Practical Exam 2021
No ratings yet
Aissce CS Practical Exam 2021
5 pages
1-Introduction To Business Intelligence in A Business Environment
No ratings yet
1-Introduction To Business Intelligence in A Business Environment
40 pages
Available Actions For Certify System Interface
No ratings yet
Available Actions For Certify System Interface
4 pages
Ontapcuoiky SE445E G A C I K 2024
No ratings yet
Ontapcuoiky SE445E G A C I K 2024
12 pages
Manish Resume
No ratings yet
Manish Resume
1 page
5-Data Analytics in A Business Operations and BI Marketing Models
No ratings yet
5-Data Analytics in A Business Operations and BI Marketing Models
29 pages
3-Data Fundamentals For BI - Part2
No ratings yet
3-Data Fundamentals For BI - Part2
44 pages
Networks Lecture 5
No ratings yet
Networks Lecture 5
29 pages
Lec5-Regular Simplex Method and Dual Simplex Method
No ratings yet
Lec5-Regular Simplex Method and Dual Simplex Method
48 pages
Statistical Inference INF312 - Is - Lecture 03 - Part 3
No ratings yet
Statistical Inference INF312 - Is - Lecture 03 - Part 3
18 pages
NLP File
No ratings yet
NLP File
21 pages
DM Lect 9 - Classification - Decision Trees
No ratings yet
DM Lect 9 - Classification - Decision Trees
39 pages
Lecture 1 - Introduction To Data Security
No ratings yet
Lecture 1 - Introduction To Data Security
46 pages
Rishika Lekkala
No ratings yet
Rishika Lekkala
2 pages
DM Lec 6
No ratings yet
DM Lec 6
4 pages
MST Unit-5
No ratings yet
MST Unit-5
14 pages
Statistical Inference INF312 - Is - Lecture 03 - Part 2
No ratings yet
Statistical Inference INF312 - Is - Lecture 03 - Part 2
2 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Data Insights: The Science of Data Analysis
From Everand
Data Insights: The Science of Data Analysis
Lexa N. Palmer
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

2-Data Fundamentals For BI - Part1

Uploaded by

2-Data Fundamentals For BI - Part1

Uploaded by

Data Fundamentals for

• Understanding data in the context of business

◼ Measures for data quality (checking if information is good or bad):

◼ Data in the Real World Is Dirty: Lots of potentially incorrect data,

◼ Missing data may be due to

◼ Noise: random error or variance in a measured variable.

◼ Incorrect attribute values may be due to

◼ data entry problems

◼ data transmission problems

◼ inconsistency in naming convention

with possible outliers)

◼ Data discrepancy detection: finding data that doesn't make sense

◼ Data migration and integration: moving and combining data from

◼ Data integration: Combines data from multiple sources into a

says it's 32)

metric vs. British units

◼ Redundant data occur often when integration of multiple

◼ Χ2 (chi-square) test: is a statistical test that helps us understand if

Play chess Not play chess Sum (row)

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

◼ Numbers in parenthesis are expected counts calculated

◼ Correlation coefficient (also called Pearson’s product

• Positive Correlation (r > 0): As one variable increases, the other

• Negative Correlation (r < 0): As one variable increases, the other

• No Correlation (r = 0): The variables don't seem to have any linear

a'k = (ak − mean( A)) / std ( A)

◼ Covariance measures how two variables change together. It's similar to

◼ where n is the number of tuples,

• Positive covariance: If CovA,B > 0 then A is greater than its average,

• Negative covariance: If CovA,B < 0 then A is greater than its

• Independence: CovA,B = 0 There's no linear relationship. Knowing the

◼ It can be simplified in computation as

◼ E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

◼ E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

◼ Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

◼ Thus, A and B rise together since Cov(A, B) > 0.

◼ Data reduction: Obtain a reduced representation of the

◼ Why data reduction? A database/data warehouse may

◼ Data reduction strategies

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

Data Reduction: Dimensionality Reduction

◼ Dimensionality reduction techniques:

Two Sine Waves Two Sine Waves + Noise Frequency

What Is Wavelet Transform?

◼ These are mathematical techniques

◼ it's powerful for analysing signals

What Is Wavelet Transform?

• The Wavelet Transform represents the signal in terms of these wavelets,

◼ Discrete wavelet transform (DWT) is a tool for analyzing signals

◼ How the DWT Works

◼ Padding: The length of the data needs to be a power of 2. If it's

◼ Smoothing & Difference Functions: The DWT uses two special

◼ Recursive Application: The DWT applies these two functions to

◼ The Haar wavelet is a specific type of wavelet that is used

◼ The Haar Wavelet Transform is a technique used to

• Resolution: This indicates the level of detail. Higher resolution (8)

points and calculates the averages:

◼the detail coefficients: [0, -1, -1, 0].

◼ Data Mining: Concepts and Techniques,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.