DMDW Full PDF
DMDW Full PDF
2 DM, DMDW
Road Map
3 DM, DMDW
Definition ([Liu 11])
4 DM, DMDW
Definition ([Ullman 09, 10])
5 DM, DMDW
Definition ([Wikipedia])
Data mining (the analysis step of the
"Knowledge Discovery in Databases" process,
or KDD), an interdisciplinary subfield
of computer science, is the computational
process of discovering patterns in large data
sets ("big data") involving methods at the
intersection of artificial intelligence, machine
learning, statistics, and database systems.
6 DM, DMDW
Definition ([Wikipedia])
The overall goal of the data mining process is to
extract information from a data set and transform
it into an understandable structure for further
use.
Aside from the raw analysis step, it involves
database and data management aspects, data
preprocessing, model and inference
considerations, interestingness metrics,
complexity considerations, post-processing of
discovered structures, visualization, and online
updating.
Florin Radulescu, Course 1
7 DM, DMDW
Definition ([Kimball, Ross 02])
A class of undirected queries, often against the most
atomic data, that seek to find unexpected patterns in the
data.
The most valuable results from data mining are
clustering, classifying, estimating, predicting, and finding
things that occur together.
There are many kinds of tools that play a role in data
mining, including decision trees, neural networks,
memory- and case-based reasoning tools, visualization
tools, genetic algorithms, fuzzy logic, and classical
statistics.
Generally, data mining is a client of the data warehouse.
Florin Radulescu, Course 1
8 DM, DMDW
Conclusions
The data mining process converts data into valuable
knowledge that can be used for decision support
Data mining is a collection of data analysis
methodologies, techniques and algorithms for
discovering new patterns
Data mining is used for large data sets
Data mining process is automated (no need for human
intervention)
Data mining and Knowledge Discovery in Databases
(KDD) are considered by some authors to be the same
thing. Other authors list data mining as the analysis step
in the KDD process - after data cleaning and
transformation and before results visualization /
evaluation.
Florin Radulescu, Course 1
9 DM, DMDW
Success stories (1)
Some early success stories in using data mining (from
[Ullman 03]):
• Decision trees constructed from bank-loan histories to
produce algorithms to decide whether to grant a loan.
• Patterns of traveler behavior mined to manage the sale
of discounted seats on planes, rooms in hotels, etc.
• “Diapers and beer" Observation that customers buying
diapers are more likely to buy beer than average,
allowing supermarkets to place beer and diapers nearby,
knowing that many customers would walk between them.
Placing potato chips between increased sales of all three
items.
Florin Radulescu, Course 1
10 DM, DMDW
Success stories (2)
• Skycat and Sloan Sky Survey: clustering sky objects by
their radiation levels in different bands allowed
astronomers to distinguish between galaxies, nearby
stars, and many other kinds of celestial objects.
• Comparison of the genotype of people with/without a
condition allowed the discovery of a set of genes that
together account for many cases of diabetes. This sort of
mining will become much more important as the human
genome is constructed.
11 DM, DMDW
What is not Data Mining
Find a certain person in an employee database
Compute the minimum, maximum, sum, count or
average values based on table/tables columns
Use a search engine to find your name occurrences on
the web
12 DM, DMDW
DM software (1)
In ([Mikut, Reischl 11]) DM software programs are classified in 9
categories:
Data mining suites (DMS) focus on data mining and include
numerous methods and support feature tables and time series.
Examples:
Commercial: IBM SPSS Modeler, SAS Enterprise Miner, DataEngine,
GhostMiner, Knowledge Studio, NAG Data Mining Components,
STATISTICA
Open source: RapidMiner
Business intelligence packages (BIs) include basic data mining
functionality - statistical methods in business applications, are often
restricted to feature tables and time series and large feature tables
are supported. Examples:
Commercial: IBM Cognos 8 BI, Oracle DataMining, SAPNetweaver
Business Warehouse, Teradata Database, DB2 Data Warehouse from
IBM
Open source: Pentaho
Florin Radulescu, Course 1
13 DM, DMDW
DM software (2)
Mathematical packages (MATs) provide a large and extendable set
of algorithms and visualization routines. Examples:
Commercial: MATLAB, R-PLUS
Open source: R, Kepler
Integration packages (INTs) are extendable bundles of many
different open-source algorithms
Stand-alone software (KNIME, the GUI-version of WEKA, KEEL, and
TANAGRA)
Larger extension package for tools from the MAT type
Extensions (EXT) are smaller add-ons for other tools such as
Excel, Matlab, R, with limited but quite useful functionality.
Examples:
Artificial neural networks for Excel (Forecaster XL and XLMiner)
MATLAB (Matlab Neural Networks Toolbox).
Florin Radulescu, Course 1
14 DM, DMDW
DM software (3)
Data mining libraries (LIBs) implement data mining methods as a
bundle of functions and can be embedded in other software tools
using an Application Programming Interface. Examples: Neurofusion
for C++, WEKA, MLC++, JAVA Data Mining Package, LibSVM
Specialties (SPECs) are similar to DMS tools, but implement only
one special family of methods such as artificial neural networks.
Examples: CART, Bayesia Lab, C5.0, WizRule, Rule Discovery
System, MagnumOpus, JavaNNS, Neuroshell, NeuralWorks Predict,
RapAnalyst.
Research (RES) are usually the first implementations of new
algorithms, with restricted graphical support and without automation
support. RES tools are mostly open source. WEKA and RapidMiner
started in this category.
Solutions (SOLs) describe a group of tools that are customized to
narrow application fields. Examples: for text mining: GATE, image
processing: ITK, ImageJ, drug discovery: Molegro Data Modeler
15 DM, DMDW
Communities involved
DATABASE
SYSTEMS AI
DATA
MINING
CLUSTERING VISUALIZATION
16 DM, DMDW
Road Map
17 DM, DMDW
Data mining steps (1)
1. Data collection: Data gathering from existing
databases or (for Internet documents) from Web
crawling.
2. Data preprocessing, including:
– Data cleaning: replace (or remove) missing values, smooth
noisy data, remove or just identify outliers, remove
inconsistencies.
– Data integration: integration of data from multiple sources, with
possible different data types and structures and also handling
of duplicate or inconsistent data.
– Data transformation: data normalization (or standardization),
summarizations, generalization, new attributes construction,
etc.
18 DM, DMDW
Data mining steps (2)
2. Data preprocessing (cont):
– Data reduction (called also feature extraction): not all the
attributes are necessary for the particular Data Mining process
we want to perform. Only relevant attributes are selected for
further processing reducing the total size of the dataset (and
the time needed for running the algorithm).
– Discretization: some algorithms work only on discrete data.
For that reason the values for continuous attributes must be
replaced with discrete ones from a limited set. One example is
replacing age (number) with an attribute having only three
values: Young, Middle-age and Old.
19 DM, DMDW
Data mining steps (3)
3. Pattern extraction and discovery. This is the stage
where the data mining algorithm is used to obtain the
result. Some authors consider that Data Mining is
reduced only at this step, the whole process being
called KDD.
4. Visualization: because data mining extracts hidden
properties/information from data it is necessary to
visualize the results for a better understanding and
evaluation. Also needed for the input data.
5. Evaluation of results: not everything that outputs
from a data mining algorithm is a valuable fact or
information. Some of them are statistic truths and
others are not interesting/useful for our activity. Expert
judgment is necessary in evaluating the results
Florin Radulescu, Course 1
20 DM, DMDW
Bonferroni principle (1)
A true information discovered by a ‘data mining’ process
can be a statistical truth. Example (from [Ullman 03]):
In 1950’s David Rhine, a parapsychologist, tested
students in order to find if they have or not extrasensorial
perception (ESP).
He asked them to guess the color of 10 successive
cards – red or black. The result was that 1/1000 of them
guessed all 10 cards (he declared they have ESP).
Re-testing only these students he found that they have
lost ESP after knowing they have this feature
David Rhine did not realize that the probability of
guessing 10 successive cards is 1/1024 = 1/210 ,
because the probability for each of these 10 cards is ½
(red or black).
Florin Radulescu, Course 1
21 DM, DMDW
Bonferroni principle (2)
This kind of results may be included in the output of a
data mining algorithm but must be recognized as a
statistical truth and not a real data mining output.
This fact is also the object of the Bonferroni principle.
This can be synthesized as below:
22 DM, DMDW
Road Map
23 DM, DMDW
Method types
Prediction methods. These methods use some
variables to predict the values of other variables. A
good example for that category is classification. Based
on known, labeled data, classification algorithms build
models that can be used for classifying new, unseen
data.
Description methods. Algorithms in this category find
patterns that can describe the inner structure of the
dataset. For example clustering algorithms find groups
of similar objects in a dataset (called clusters) and
possible isolated objects, far away from any cluster,
called outliers.
24 DM, DMDW
Algorithms
25 DM, DMDW
Classification
Input:
• A set of k classes C = {c1, c2, …, ck}
• A set of n labeled items D = {(d1, ci1), (d2, ci2), …, (dn
cin)}. The items are d1, …, dn, each item dj being labeled
with class cj C. D is called the training set.
• For calibration of some algorithms a validation set is
required. This validation set contains also labeled items
not included in the training set.
Output:
• A model or method for classifying new items (a
classifier). The set of new items that will be classified
using the model/method is called the test set
Florin Radulescu, Course 1
26 DM, DMDW
Example
Let us consider a medical set of items where each item
is a patient of a hospital emergency unit (RO: UPU).
There are 5 classes, representing maximum waiting time
categories: C0, C10, C30, C60 and C120, Ck meaning
the patient waits maximum k minutes.
We may represent these data in tabular format
The output of a classification algorithm using this training
set may be for example a decision tree or a set of
ordered rules.
The model may be used to classify future patients and
assign a waiting time label to them
27 DM, DMDW
Emergency unit training set
Name (or Vital Danger 0 resource 1 resource >1 >1 resource Waiting
ID) risk? if needed needed resource needed and time
waits? needed vital function s (class
affected label)
28 DM, DMDW
Result: decision tree
29 DM, DMDW
Regression (1)
Regression is related with statistics.
Meaning: predicting a value of a given
continuous valued variable based on the values
of other variables, assuming a linear or
nonlinear model of dependency ([Tan,
Steinbach, Kumar 06]).
Used in prediction and forecasting - its use
overlaps machine learning.
Regression analysis is also used to understand
relationship between independent variables and
dependent variable and can be used to
infer causal relationships between them.
Florin Radulescu, Course 1
30 DM, DMDW
Regression (2)
31 DM, DMDW
Example
32 DM, DMDW
Deviation detection
Deviation detection or anomaly detection means discovering
significant deviation from the normal behavior. Outliers are a
significant category of abnormal data.
Deviation detection can be used in many circumstances:
Data mining algorithm running stage: often such information may
be important for business decisions and scientific discovery.
Auditing: such information can reveal problems or mal-practices.
Fraud detection in a credit card system: fraudulent claims often
carry inconsistent information that can reveal fraud cases.
Intrusion detection in a computer network may rely on abnormal
data.
Data cleaning (part of data preprocessing): such information can
be detected and possible mistakes may be corrected in this
stage.
33 DM, DMDW
Deviation detection techniques
Distance based techniques (example: k-nearest
neighbor).
One Class Support Vector Machines.
Predictive methods (decision trees, neural
networks).
Cluster analysis based outlier detection.
Pointing at records that deviate from association
rules
Hotspot analysis
34 DM, DMDW
Algorithms
35 DM, DMDW
Clustering
Input:
A set of n objects D = {d1, d2, …, dn} (called usually points).
The objects are not labeled and there is no set of class labels
defined.
A distance function (dissimilarity measure) that can be used to
compute the distance between any two points. Low valued
distance means ‘near’, high valued distance means ‘far’.
Some algorithms also need a predefined value for the number
of clusters in the produced result.
Output:
A set of object (point) groups called clusters where points in
the same cluster are near one to another and points from
different clusters are far one from another, considering the
distance function.
36 DM, DMDW
Example
Having a set of points in a 2 dimensional space, find the
natural clusters formed by these points.
INITIAL AFTER CLUSTERING
37 DM, DMDW
Association Rule Discovery
Let us consider:
A set of m items I = {i1, i2, …, im}.
A set of n transactions T = { t1, t2, …, tn},
each transaction containing a subset of I,
so if tk T then tk = {ik1, ik2, …, ikj} where j
depends on k.
Then:
A rule is a construction X Y where X and
Y are itemsets.
Florin Radulescu, Course 1
38 DM, DMDW
Association Rule Discovery
The support of a rule is the number/proportion of
transactions containing the union between the left and
the right part of the rule (and is equal with the support of
this union as an itemset):
support(X Y) = support(XY)
The confidence of a rule is the proportion of
transactions containing Y in the set of transactions
containing X:
confidence(X Y) = support(XY) / support(X).
We accept a rule as a valid one if the support and the
confidence of the rule are at least equal with some given
thresholds.
Florin Radulescu, Course 1
39 DM, DMDW
Association Rule Discovery
Input:
A set of m items I = {i1, i2, …, im}.
A set of n transactions T = { t1, t2, …, tn}, each transaction
containing a subset of I, so if tk T then tk = {ik1, ik2, …, ikj}
where j depends on k.
A threshold s for the support, given either as a percent or in
absolute value. If an itemset X I is part of w transactions
then w is the support of X. If w >= s then X is called frequent
itemset
A second threshold c for rule confidence.
Output:
The set of frequent itemsets in T, having support >= s
The set of rules derived from T, having support >= s and
confidence >= c
Florin Radulescu, Course 1
40 DM, DMDW
Example
Consider the following set of transactions:
Transaction Items
ID
41 DM, DMDW
Sequences
The model:
Itemset: a set of n distinct items
I = {i1, i2, …, in }
Event: a non-empty collection of items; we can
assume that items are in a given order (e.g.
lexicographic): (i1,i2 … ik)
Sequence : an ordered list of events:
< e1 e2 … em >
42 DM, DMDW
Sequential Pattern Discovery
Input:
A set of sequences S (or a sequence database).
A Boolean function that can test if a sequence S1 is included
(or is a subsequence) of a sequence S2. In that case S2 is
called a super sequence of S1.
A threshold s (percent or absolute value) needed for finding
frequent sequences.
Output:
The set of frequent sequences, i.e. the set of sequences that
are included in at least s sequences from S.
Sometimes a set of rules can be derived from the set of
frequent sequences, each rule being of the form S1 S2
where S1 and S2 are sequences.
43 DM, DMDW
Examples
In a bookstore we can find frequent sequences like:
{(Book_on_C, Book_on_C++), (Book_on_Perl)}
44 DM, DMDW
Summary
This first course presented:
A list of alternative definitions of Data Mining and some examples of
what is Data Mining and what is not Data Mining
A discussion about the researchers communities involved in Data
Mining and about the fact that Data Mining is a cluster of
subdomains
The steps of the Data Mining process from collecting data located in
existing repositories (data warehouses, archives or operational
systems) to the final evaluation step.
A brief description of the main subdomains of Data Mining with some
examples for each of them.
45 DM, DMDW
References
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data, Second Edition, Springer, 1-13.
[Tan, Steinbach, Kumar 06] Pang-Ning Tan, Michael Steinbach,
Vipin Kumar, 2006. Introduction to Data Mining, Adisson-Wesley, 1-
16.
[Kimbal, Ross 02] Ralph Kimball, Margy Ross, 2002. The Data
Warehouse Toolkit, Second Edition, John Wiley and Sons, 1-16, 396
[Mikut, Reischl 11] Ralf Mikut and Markus Reischl, Data mining
tools, 2011, Wiley Interdisciplinary Reviews: Data Mining and
Knowledge Discovery, Volume 1, Issue 5,
http://onlinelibrary.wiley.com/doi/10.1002/widm.24/pdf
[Ullman] Jeffrey Ullman, Data Mining Lecture Notes, 2003-2009,
web page: http://infolab.stanford.edu/~ullman/mining/mining.html
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org
46 DM, DMDW
2. Data preprocessing
Data types
Measuring data
Data cleaning
Data integration
Data transformation
Data reduction
Data discretization
Summary
Florin Radulescu, Note de curs
2 DMDW-2
Data types
3 DMDW-2
Categorical vs. Numerical
Categorical data, consisting in names representing
some categories, meaning that they belong to a
definable category. Example: color (with categories
red, green, blue and white) or gender (male, female).
The values of this type are not ordered, the usual
operations that may be performed being equality and
set inclusion.
Numerical data, consisting in numbers from a
continuous or discrete set of values.
Values are ordered, so testing this order is possible (<,
>, etc).
Sometimes we must or may convert categorical data in
numerical data by assigning a numeric value (or code)
for each label.
Florin Radulescu, Note de curs
4 DMDW-2
Scale types
5 DMDW-2
Scale types
6 DMDW-2
Nominal
Values belonging to a nominal scale are
characterized by labels.
Values are unordered and equally weighted.
We cannot compute the mean or the median
from a set of such values
Instead, we can determine the mode, meaning
the value that occurs most frequently.
Nominal data are categorical but may be
treated sometimes as numerical by assigning
numbers to labels.
Florin Radulescu, Note de curs
7 DMDW-2
Ordinal
Values of this type are ordered but the difference or
distance between two values cannot be determined.
The values only determine the rank order /position in the
set.
Examples: the military rank set or the order of
marathoners at the Olympic Games (without the times)
For these values we can compute the mode or the
median (the value placed in the middle of the ordered
set) but not the mean.
These values are categorical in essence but can be
treated as numerical because of the assignment of
numbers (position in set) to the values
Florin Radulescu, Note de curs
8 DMDW-2
Interval
These are numerical values.
For interval scaled attributes the difference between two
values is meaningful.
Example: the temperature using Celsius scale is an
interval scaled attribute because the difference between
10 and 20 degrees is the same as the difference
between 40 and 50 degrees.
Zero does not mean ‘nothing’ but is somehow arbitrarily
fixed. For that reason negative values are also allowed.
We can compute the mean, the standard deviation or we
can use regression to predict new values.
9 DMDW-2
Ratio
Ratio scaled attributes are like interval scaled
attributes but zero means ‘nothing’.
Negative values are not allowed.
The ratio between two values is meaningful.
Example: age - a 10 years child is two times
older than a 5 years child.
Other examples: temperature in Kelvin, mass in
kilograms, length in meters, etc.
All mathematical operations can be performed,
for example logarithms, geometric and harmonic
means, coefficient of variation
Florin Radulescu, Note de curs
10 DMDW-2
Binary data
Sometimes an attribute may have only two values, as the
gender in a previous example. In that case the attribute is
called binary.
Symmetric binary: when the two values are of the same weight
and have equal importance (as in the gender case)
Asymmetric binary: one of the values is more important than
the other. Example: a medical bulletin containing blood tests for
identifying the presence of some substances, evaluated by
‘Present’ or ‘Absent’ for each substance. In that case ‘Present’ is
more important that ‘Absent’.
Binary attributes can be treated as interval or ratio scaled but
in most of the cases these attributes must be treated as
nominal (binary symmetric) or ordinal (binary asymmetric)
There are a set of similarity and dissimilarity (distance)
functions specific to binary attributes.
Florin Radulescu, Note de curs
11 DMDW-2
Road Map
Data types
Measuring data
Data cleaning
Data integration
Data transformation
Data reduction
Data discretization
Summary
Florin Radulescu, Note de curs
12 DMDW-2
Measuring data
13 DMDW-2
Central tendency - Mean
Consider a set of n values of an attribute: x1, x2, …, xn.
Mean: The arithmetic mean or average value is:
μ = (x1 + x2 + …+ xn) / n
If the values x have different weights, w1, …, wn , then
the weighted arithmetic mean or weighted average is:
μ = (w1x1 + w2x2 + …+ wnxn) / (w1 + w2 + …+ wn)
If the extreme values are eliminated from the set
(smallest 1% and biggest 1%) a trimmed mean is
obtained.
14 DMDW-2
Central tendency - Median
Median: The median value of an ordered set is
the middle value in the set.
Example: Median for {1, 3, 5, 7, 1001, 2002,
9999} is 7.
If n is even the median is the mean of the middle
values:
the median of {1, 3, 5, 7, 1001, 2002} is 6
(arithmetic mean of 5 and 7).
15 DMDW-2
Central tendency - Mode
Mode (RO: valoarea modala): The mode of a
dataset is the most frequent value.
A dataset may have more than a single mode.
For 1, 2 and 3 modes the dataset is called
unimodal, bimodal and trimodal.
When each value is present only once there is
no mode in the dataset.
For a unimodal dataset the mode is a measure
of the central tendency of data. For these
datasets we have the empirical relation:
mean – mode = 3 x (mean – median)
Florin Radulescu, Note de curs
16 DMDW-2
Central tendency - Midrange
17 DMDW-2
Dispersion (1)
Range. The range is the difference between the largest
and smallest values.
Example: for {1, 3, 5, 7, 1001, 2002, 9999} range is 9999
– 1 = 9998.
kth percentile. The kth percentile is a value xj having the
property that k percent of the values are less or equal
than xj.
Example: the median is the 50th percentile.
The most used percents are the median and the 25th and
75th percentiles, called also quartiles (ro: cuartile).
Notation: Q1 for 25% and Q3 for 75%.
18 DMDW-2
Dispersion (2)
Computing method: There are more than one different
methods for computing Q1, Q2 and Q3. The most
obvious method is the following:
Put the values of the data set in ascending order
Compute the median using its definition. It divides the
ordered dataset into two halves (lower and upper),
neither one including the median.
The median value is Q2
The median of the lower half is Q1 (or the lower
quartile)
The median of the upper half is Q3 (or the upper
quartile)
Florin Radulescu, Note de curs
19 DMDW-2
Dispersion (3)
Interquartile range (IQR) is the difference between Q3
and Q1 (ro: interval intercuartilic):
IQR = Q3 – Q1
Potential outliers are values more than 1.5 x IQR below
Q1 or above Q3.
Five-number summary. Sometimes the median and the
quartiles are not enough for representing the spread of
the values
The smallest and biggest values must be considered
also.
(Min, Q1, Median, Q3, Max) is called the five-number
summary.
20 DMDW-2
Dispersion (4)
Examples:
For {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}
Range = 10; Midrange = 6;
Q1 = 3; Q2 = 6; Q3 = 9; IQR = 9 - 3 = 6
For {1, 3, 3, 4, 5, 6, 6, 7, 8, 8}
Range = 7; Midrange = 4.5;
Q1 = 3; Q2 = 5.5 [=(5+6)/2]; Q3 = 7; IQR = 7 - 3 = 4
21 DMDW-2
Dispersion (5)
Standard deviation. The standard deviation of n values
(observations) is:
22 DMDW-2
Road Map
Data types
Measuring data
Data cleaning
Data integration
Data transformation
Data reduction
Data discretization
Summary
Florin Radulescu, Note de curs
23 DMDW-2
Objectives
The main objectives of data cleaning are:
Replace (or remove) missing values,
Smooth noisy data,
Remove or just identify outliers
24 DMDW-2
NULL values
When a NULL value is present in data it may be:
1. Legal NULL value: Some attributes are
allowed to contain a NULL value. In such a
case the value must be replaced by something
like ‘Not applicable’ and not a NULL value.
2. Missing value: The value existed at
measurement time but was not collected.
25 DMDW-2
Missing values (1)
May appear from various reasons:
human/hardware/software problems,
data not collected (considered unimportant at
collection time),
deleted data due to inconsistencies, etc.
There are two solutions in handling missing
data:
1. Ignore the data point / example with missing
attribute values. If the number of errors is
limited and these errors are not for sensitive
data, removing them may be a solution.
Florin Radulescu, Note de curs
26 DMDW-2
Missing values (2)
2. Fill in the missing value. This may be done in
several ways:
Fill in manually. This option is not feasible in
most of the cases due to the huge volume of
the datasets that must be cleaned.
Fill in with a (distinct from others) value ‘not
available’ or ‘unknown’.
Fill in with a value measuring the central
tendency, for example attribute mean,
median or mode.
Florin Radulescu, Note de curs
27 DMDW-2
Missing values (3)
2. Fill in the missing value - cont.
Fill in with a value measuring the central
tendency but only on a subset (for example,
for labeled datasets, only for examples
belonging to the same class).
The most probable value, if that value may
be determined, for example by decision
trees, expectation maximization (EM),
Bayes, etc.
28 DMDW-2
Smooth noisy data
29 DMDW-2
Binning
30 DMDW-2
Example
Consider the following ordered data for some attribute:
1, 2, 4, 6, 9, 12, 16, 17, 18, 23, 34, 56, 78, 79, 81
Initial bins Use mean for Use median for Use bin boundaries
binning binning for binning
1, 2, 4, 6, 9 4, 4, 4, 4, 4 4, 4, 4, 4, 4 1, 1, 1, 9, 9
12, 16, 17, 18, 23 17, 17, 17, 17, 17 17, 17, 17, 17, 17 12, 12, 12, 23, 23
34, 56, 78, 79, 81 66, 66, 66, 66, 66 78, 78, 78, 78 34, 34, 81, 81, 81
31 DMDW-2
Result
So the smoothing result is:
Initial: 1, 2, 4, 6, 9, 12, 16, 17, 18, 23, 34, 56, 78, 79, 81
Using the mean: 4, 4, 4, 4, 4, 17, 17, 17, 17, 17, 66, 66,
66, 66, 66
Using the median: 4, 4, 4, 4, 4, 17, 17, 17, 17, 17, 78,
78, 78, 78, 78
Using the bin boundaries: 1, 1, 1, 9, 9, 12, 12, 12, 23, 23,
34, 34, 81, 81, 81
32 DMDW-2
Outliers
An outlier (ro: valoare aberanta / punct izolat) is an
attribute value numerically distant from the rest of
the data.
Outliers may be sometimes correct values: for example,
the salary of the CEO of a company may be much bigger
that all other salaries. But in most of the cases outliers
are and must be handled as noise.
Outliers must be identified and then removed (or
replaced, as any other noisy value) because many data
mining algorithms are sensitive to outliers.
For example any algorithm using the arithmetic mean
(one of them is k-means) may produce erroneous results
because the mean is very sensitive to outliers.
Florin Radulescu, Note de curs
33 DMDW-2
Identifying outliers
Use of IQR: values more than 1.5 x IQR below
Q1 or above Q3 are potential outliers. Boxplots
may be used to identify these outliers (boxplots
are a method for graphical representation of
data dispersion).
Use of standard deviation: values that are
more than two standard deviations away from
the mean for a given attribute are also
potential outliers.
Clustering. After clustering a certain dataset
some points are outside any cluster (or far
away from any cluster center.
Florin Radulescu, Note de curs
34 DMDW-2
Road Map
Data types
Measuring data
Data cleaning
Data integration
Data transformation
Data reduction
Data discretization
Summary
Florin Radulescu, Note de curs
35 DMDW-2
Objectives
36 DMDW-2
Schema integration
Must identify the translation of every source
scheme to the final scheme (entity identification
problem)
Subproblems:
The same thing is called differently in every data
source. Example: the customer id may be called
Cust-ID, Cust#, CustID, CID in different sources.
Different things are called with the same name in
different sources. Example: for employees data, the
attribute ‘City’ means city where resides in a source
and city of birth in another source.
Florin Radulescu, Note de curs
37 DMDW-2
Duplicates
38 DMDW-2
Redundancy
Redundancy: Some information may be
deduced / computed.
For example age may be deduced from
birthdate, annual salary may be computed from
monthly salary and other bonuses recorded for
each employee.
Redundancy must be removed from the dataset
before running the data mining algorithm
Note that in existing data warehouses some
redundancy is allowed.
Florin Radulescu, Note de curs
39 DMDW-2
Inconsistencies
Inconsistencies are conflicting values for a set of
attributes.
Example Birthdate = January 1, 1980, Age = 12
represents an obvious inconsistency but we may
find other inconsistencies that are not so
obvious.
For detecting inconsistencies extra knowledge
about data is necessary: for example, the
functional dependencies attached to a table
scheme can be used.
Available metadata describing the content of the
dataset may help in removing inconsistencies.
Florin Radulescu, Note de curs
40 DMDW-2
Road Map
Data types
Measuring data
Data cleaning
Data integration
Data transformation
Data reduction
Data discretization
Summary
Florin Radulescu, Note de curs
41 DMDW-2
Objectives
42 DMDW-2
Normalization
All attributes are scaled to fit a specified range:
0 to 1,
-1 to 1 or generally
|v| <= r where r is a given positive value.
Needed when the importance of some attributes
is bigger only because the range of the values of
that attributes is bigger.
Example: Euclidian distance between A(0.5,
101) and B(0.01, 2111) is ≈ 2010, determined
almost exclusively by the second dimension.
43 DMDW-2
Normalization
We can achieve normalization using:
Min-max normalization:
vnew = (v – vmin) / (vmax – vmin)
For positive values the formula is:
vnew = v / vmax
z-score normalization ( is the standard deviation):
vnew = (v – vmean) /
Decimal scaling: vnew = v / 10n
where n is the smallest integer for that all numbers become
(as absolute value) less than the range r (for r = 1, all
new values of v are <= 1) then
Florin Radulescu, Note de curs
44 DMDW-2
Feature construction
New attribute construction is called also feature
construction.
It means: building new attributes based on the values of
existing ones.
Example: if the dataset contains an attribute ‘Color’ with
only three distinct values {Red, Green, Blue} then three
attributes may be constructed: ‘Red’, ‘Green’ and ‘Blue’
where only one of them equals 1 (based on the value of
‘Color’) and the other two 0.
Another example: use a set of rules, decision trees or
other tools to build new attribute values from existing
ones. New attributes will contain the class labels
attached by the rules / decision tree used / labeling tool.
45 DMDW-2
Summarization
At this step aggregate functions may be used to
add summaries to the data.
Examples: adding sums for daily, monthly and
annual sales, counts and averages for a number
of customers or transactions, and so on.
All these summaries are used for the ‘slice and
dice’ process when data is stored in a data
warehouse.
The result is a data cube and each summary
information is attached to a level of granularity.
Florin Radulescu, Note de curs
46 DMDW-2
Road Map
Data types
Measuring data
Data cleaning
Data integration
Data transformation
Data reduction
Data discretization
Summary
Florin Radulescu, Note de curs
47 DMDW-2
Objectives
48 DMDW-2
Reduction methods (1)
Methods that may be used for data reduction (see [Han,
Kamber 06]) :
Data cube aggregation, already discussed.
Attribute selection: keep only relevant attributes. This
can be made by:
stepwise forward selection (start with an empty set and add
attributes),
stepwise backward elimination (start with all attributes and
remove some of them one by one)
a combination of forward selection and backward elimination.
decision tree induction: after building the decision tree, only
attributes used for decision nodes are kept.
49 DMDW-2
Reduction methods (2)
Dimensionality reduction: encoding mechanisms
are used to reduce the data set size or compress
data.
A popular method is Principal Component Analysis
(PCA): given N data vectors having n dimensions,
find K <= N orthogonal vectors (called principal
components) that can be used for representing data.
A PCA example is presented on the following slide,
for a multivariate Gaussian distribution (source:
wikipedia).
50 DMDW-2
PCA example
PCA for a multivariate Gaussian distribution (source:
http://2011.igem.org/Team:USTC-Software/parameter )
51 DMDW-2
Reduction methods (3)
Numerosity reduction: the data are replaced
by smaller data representations such as
parametric models (only the model parameters
are stored in this case) or nonparametric
methods: clustering, sampling, histograms.
Discretization and concept hierarchy
generation, discussed in the following
paragraph.
52 DMDW-2
Road Map
Data types
Measuring data
Data cleaning
Data integration
Data transformation
Data reduction
Data discretization
Summary
Florin Radulescu, Note de curs
53 DMDW-2
Objectives
There are many data mining algorithms that
cannot use continuous attributes. Replacing
these continuous values with discrete ones is
called discretization.
Even for discrete attributes, is better to have a
reduced number of values leading to a reduced
representation of data. This may be performed
by concept hierarchies.
54 DMDW-2
Discretization (1)
55 DMDW-2
Discretization (1)
Popular methods to perform discretization - cont:
2. Histograms: like binning, histograms partition values for an
attribute in buckets. Each bucket has a different label and labels
replace values.
3. Entropy based intervals: each attribute value is considered a
potential split point (between two intervals) and an information
gain is computed for it (reduction of entropy by splitting at that
point). Then the value with the greatest information gain is
picked. In this way intervals may be constructed in a top-down
manner.
4. Cluster analysis: after clustering, all values in the same cluster
are replaced with the same label (the cluster-id for example)
56 DMDW-2
Concept hierarchies
57 DMDW-2
Concept hierarchies
For categorical data the goal is to replace a bigger set of
values with a smaller one (categorical data are discrete
by definition):
Manually define a partial order for a set of attributes. For
example the set {Street, City, Department, Country} is partially
ordered, Street City Department Country. In that case we
can construct an attribute ‘Localization’ at any level of this
hierarchy, by using the n rightmost attributes (n = 1 .. 4).
Specify (manually) high level concepts for value sets of low level
attribute values associated with. For example {Muntenia, Oltenia,
Dobrogea} Tara_Romaneasca.
Automatically identify a partial order between attributes, based
on the fact that high level concepts are represented by attributes
containing a smaller number of values compared with low level
ones.
58 DMDW-2
Summary
This second course presented:
Data types: categorical vs. numerical, the four scales (nominal,
ordinal, interval and ratio) and binary data.
A short presentation of data preprocessing steps and some ways to
extract important characteristics of data: central tendency (mean,
mode, median, etc) and dispersion (range, IQR, five-number
summary, standard deviation and variance).
A description of every preprocessing step:
cleaning,
integration,
transformation,
reduction and
discretization
59 DMDW-2
References
[Han, Kamber 06] Jiawei Han, Micheline Kamber, Data Mining:
Concepts and Techniques, Second Edition, Morgan Kaufmann
Publishers, 2006, 47-101
[Stevens 46] Stevens, S.S, On the Theory of Scales of
Measurement. Science June 1946, 103 (2684): 677–680.
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org
[Liu 11] Bing Liu, 2011. CS 583 Data mining and text mining course
notes, http://www.cs.uic.edu/~liub/teach/cs583-fall-11/cs583.html
60 DMDW-2
Association Rules and
Sequential Patterns
Prof.dr.ing. Florin Radulescu
Universitatea Politehnica din Bucureşti
Road Map
2 DMDW-3
Objectives
Association rule learning was introduced in the article
“Mining Association Rules Between Sets of Items in
Large Databases” by Agrawal, Imielinski and Swami (see
references), article presented at the 1993 SIGMOD
Conference (ACM SIGMOD means ACM Special
Interest Group on Management of Data).
One of the best known applications is finding
relationships (rules) between products as recorded by
POS systems in supermarkets.
For example, the statement that 85% of baskets that
contain bread contain also mineral water is a rule with
bread as antecedent and mineral water as consequent.
Florin Radulescu, Note de curs
3 DMDW-3
Objectives
The original article (Agrawal et al.) lists some
examples of the expected results:
Find all rules that have “Diet Coke” as
consequent,
Find all rules that have “bagels” in the
antecedent,
Find all rules that have “sausage” in the
antecedent and “mustard” in the consequent,
Find all the rules relating items located on
shelves A and B in the store,
Find the “best” k rules (considering rule support)
that have “bagels” in the consequent
Florin Radulescu, Note de curs
4 DMDW-3
Frequent itemsets and rules
5 DMDW-3
Items and transactions
Let I = {i1, i2, …, in} be a set of items. For example, items may
be all products sold in a supermarket or all words contained in
some documents.
A transaction t is a set of items, with t I. Examples of
transactions are market baskets containing products or
documents containing words.
A transaction dataset (or database) T is a set of
transactions, T = {t1, t2, …, tm}. Each transaction may contain
a different number of items and the dataset may be stored in
a SGBD managed database or in a text file.
An itemset S is a subset of I. If v = |S| is the number of items
in S (or the cardinal of S), then S is called a v-itemset.
6 DMDW-3
Items and transactions
The support of an itemset X, sup(X), is equal to the
number (or proportion) of transactions in T containing
X. The support may be given as an absolute value
(number of transactions) or proportion or percent
(proportion of transactions).
In many cases we want to find itemsets with a support
greater or equal with a given value (percent) s. Such
an itemset is called frequent itemset and, in the
market basket example, it contains items that can be
found together in many baskets (where the measure
of ‘many’ is s).
These frequent itemsets are the source of all ‘powerful’
association rules.
Florin Radulescu, Note de curs
7 DMDW-3
Example 1
Let us consider a dataset containing market basket
transactions:
I = {laptop, mouse, tablet, hard-drive, monitor,
keyboard, DVD-drive, CD-drive, flash-memory, . . .}
T = {t1, t2, …, tm}
t1 = {laptop, mouse, tablet}
t2 = {hard-drive, monitor, laptop, keyboard, DVD-drive}
. . .
tm = {keyboard, mouse, tablet)
8 DMDW-3
Example 2
If items are words and transactions are documents,
where each document is considered a bag of words,
then we can have:
T = {Doc1, Doc2, …, Doc6}
Doc1 = {rule, tree, classification}
Doc2 = {relation, tuple, join, algebra, recommendation}
Doc3 = {variable, loop, procedure, rule}
Doc4 = {clustering, rule, tree, recommendation}
Doc5 = {join, relation, selection, projection,
classification}
Doc6 = {rule, tree, recommendation}
9 DMDW-3
Example 2
In that case:
sup({rule, tree}) = 3 or 50% or 0.5
sup({relation, join}) = 2 or 33.33% or 1/3
If the threshold is s = 50% (or 0.5) then {rule, tree} is
frequent and {relation, join} is not.
10 DMDW-3
Frequent itemsets and rules
11 DMDW-3
Association rules
12 DMDW-3
Association rules
If m is the number of transactions in T then:
sup(X → Y) = sup(X Y) - as absolute value or
sup(X → Y) = sup(X Y) / m - as proportion
conf(X → Y) = sup(X Y) / sup(X)
where the support of an itemset is given as absolute value
(number of transactions).
Trans. Containing
X Y
Trans. Containing X
T, |T| = m
Florin Radulescu, Note de curs
13 DMDW-3
Association rules
14 DMDW-3
Association rules
15 DMDW-3
Finding association rules
16 DMDW-3
Finding association rules
That means that the process of finding all the rules
given the minimum support and the minimum
confidence has three steps:
Step 1. Find all frequent itemsets containing at least
two items, considering the given minimum support
minsup.
Step 2. For each frequent itemset U found in step 1
list all splits (X, Y) with X Y= and X Y=U. Each
split generates a rule X → Y.
Step 3. Compute the confidence of each rule. Keep
only the rules with confidence at least minconf.
Florin Radulescu, Note de curs
17 DMDW-3
Example 3
Consider the set of six transactions in Example 2:
Doc1 = {rule, tree, classification}
Doc2 = {relation, tuple, join, algebra,
recommendation}
Doc3 = {variable, loop, procedure, rule}
Doc4 = {clustering, rule, tree, recommendation}
Doc5 = {join, relation, selection, projection,
classification}
Doc6 = {rule, tree, recommendation}
18 DMDW-3
Example 3
With a minimum support of 50% we find that
{rule, tree} is a frequent itemset. The two rules
derived from this itemset have the same
minimum support:
rule → tree
with sup = 50% and conf = 3 / 4 = 75% and
tree → rule
with sup = 50% and conf = 3 / 3 = 100%
If the minimum confidence required is 80% then
only the second rule is kept, the first being
considered not enough powerful.
Florin Radulescu, Note de curs
19 DMDW-3
Frequent itemsets and rules
20 DMDW-3
Goals for mining transactions
Goal 1: Find frequent itemsets. Frequent
itemsets can be used not only to find rules but also
for marketing purposes.
As an example, in a supermarket, frequent
itemsets helps marketers to place items in an
effort to control the way customers walk through
the store:
Items that are sold together are placed for
example in distant corners of the store such that
customers must go from one product to another
possibly putting other products in the basket on
the way.
Florin Radulescu, Note de curs
21 DMDW-3
Goal 2
22 DMDW-3
Diapers → Beer
In [Whitehorn 06] this example is described as follows:
“Some time ago, Wal-Mart decided to combine the
data from its loyalty card system with that from its
point of sale systems.
The former provided Wal-Mart with demographic data
about its customers, the latter told it where, when and
what those customers bought.
Once combined, the data was mined extensively and
many correlations appeared.
Some of these were obvious; people who buy gin are
also likely to buy tonic. They often also buy lemons.
However, one correlation stood out like a sore thumb
because it was so unexpected.
Florin Radulescu, Note de curs
23 DMDW-3
Diapers → Beer
On Friday afternoons, young American males who
buy diapers (nappies) also have a predisposition
to buy beer.
No one had predicted that result, so no one would
ever have even asked the question in the first place.
Hence, this is an excellent example of the difference
between data mining and querying.”
24 DMDW-3
Goal 3
In [Ullman 03-09] is listed also a third goal for
mining transactions:
Goal 3: Find causalities. In the case of the rule
Diapers → Beer a natural question is if the left
part of the rule (buying diapers) causes the right
part (buy also beer).
Causal rules can be used in marketing: a low
price of diapers will attract diaper buyers and an
increase of the beer price will grow the overall
sales numbers.
Florin Radulescu, Note de curs
25 DMDW-3
Algorithms
There are many algorithms for finding frequent
itemsets and consequently the association rules
in a dataset.
All these algorithms are developed for huge
volumes of data, meaning that the dataset is too
large to be loaded and processed in the main
memory.
For that reason minimizing the number of times
the data are read from the disk become a key
feature of each algorithm.
Florin Radulescu, Note de curs
26 DMDW-3
Road Map
27 DMDW-3
Apriori algorithm
28 DMDW-3
Apriori principle
The Apriori principle states that any subset of a
frequent itemset is also a frequent itemset.
Example 4: If {1, 2, 3, 4} is a frequent itemset then all
its four subsets with 3 values are also frequent: {1,
2, 3}, {1, 2, 4}, {1, 3, 4} and {2, 3, 4}.
Consequently each frequent v-itemset is the reunion
of v frequent (v-1)-itemsets.
That means we can determine the frequent itemsets
with dimension v examining only the set of all
frequent itemsets with dimension (v-1).
Florin Radulescu, Note de curs
29 DMDW-3
Apriori principle
30 DMDW-3
Apriori principle
It is a level-wise approach where each step
requires a full scan of the dataset (residing on
disk).
A diagram is presented in the next slide where Ci
is the set of candidates for frequent i-itemsets
and Li is the actual set of frequent i-itemsets.
C1 is the set of all itemsets found in transactions
(a subset of I) and may be obtained either by a
reunion of all transactions in T or by considering
C1 = I (in that case some items may have a zero
support)
Florin Radulescu, Note de curs
31 DMDW-3
Using Apriori Principle
C1 L1
C2 L2
C3 L3
Ck Lk
32 DMDW-3
Apriori Algorithm
The algorithm is described in [Agrawal, Srikant 94] and
uses the level-wise approach described before:
A first scan of the dataset leads to the L1 (the set of
frequent items). For each transaction t in T and for
each item a in t the count of a is increased
(a.count++). At the end of the scan L1 will contain all
items with a count at least minsup (given as absolute
value).
For k=2, 3, … the process continues by generating the
set of candidates Ck and then counting the support of
each candidate by a full scan of the dataset.
Process ends when Lk is empty.
33 DMDW-3
Apriori Algorithm
Algorithm Apriori (T)
L1 = scan (T);
for (k = 2; Lk-1 ; k++) do
Ck apriori-gen(Lk-1);
for each transaction t T do
for each candidate c Ck do
if c is contained in t then
c.count++;
end
end
Lk {c Ck | c.count minsup}
end
return L = Lk;
Florin Radulescu, Note de curs
34 DMDW-3
Candidate generation
Candidate generation is also described in the
original algorithm as having two steps: the join
step and the prune step. The first builds a larger
set of candidates and the last removes some of
them proved impossible to be frequent.
In the join step each candidate is obtained from
two different frequent (k-1)-itemsets containing
(k-2) identical items:
Ck = { {i1, …, ik-1, i’k-1} | p={i1, …, ik-1} Lk-1 ,
q={i1, …, i’k-1} Lk-1 , ik-1 < i’k-1}
Florin Radulescu, Note de curs
35 DMDW-3
Candidate generation
36 DMDW-3
Join
37 DMDW-3
Join and prune
38 DMDW-3
Example 5
Consider again the set of six transactions in
Example 2:
Doc1 = {rule, tree, classification}
Doc2 = {relation, tuple, join, algebra,
recommendation}
Doc3 = {variable, loop, procedure, rule}
Doc4 = {clustering, rule, tree, recommendation}
Doc5 = {join, relation, selection, projection,
classification}
Doc6 = {rule, tree, recommendation}
and a minimum support of 50% (minsup=3).
Florin Radulescu, Note de curs
39 DMDW-3
Step 1
40 DMDW-3
Step 2
Considering:
rule < tree < recommendation
From the join C2 = { {rule, tree}, {rule,
recommendation}, {tree, recommendation} }.
The prune step does not modify C2.
The second scan of the transaction dataset
leads to the following pair support values:
41 DMDW-3
Step 2
Step 2 {rule, tree} 3
{rule, recommendation} 2
{tree, recommendation} 2
42 DMDW-3
Example 6
Consider the transaction dataset {(1, 2, 3, 5), (2, 3,
4), (3, 4, 5)} and the minsup s = 50% (or s = 3/2;
because s must be an integer s = 2)
C1 = {1, 2, 3, 4, 5}
L1 = {2, 3, 4, 5}
C2 = {(2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5) }
L2 = {(2, 3), (3, 4), (3, 5)}
After the join step C3 = {(3, 4, 5)} - obtained by
joining (3, 4) and (3, 5).
In the prune step (3, 4, 5) is removed because
its subset (4, 5) is not in L2
Florin Radulescu, Note de curs
43 DMDW-3
Example 6
After the prune step C3 = , so L3 = , and the
process stops. L = L1 L2 = {(2), (3), (4), (5), (2,
3), (3, 4), (3, 5)} or as maximal itemsets L = L2.
The rules generated from itemsets are:
2 → 3, 3 → 2, 3 → 4, 4 → 3, 3 → 5, 5 → 3.
The support of these rules is at least 50%.
Considering a minconf equal to 80% only 3 rules
have a confidence greater or equal with minconf:
2 → 3, 4 → 3, 5 → 3 (with conf = 2/2 = 100%).
The rules having 3 in the antecedent have a
confidence of 67% (because sup(3) = 3 and sup(2,
3) = sup(3, 4) = sup(3, 5) = 2).
Florin Radulescu, Note de curs
44 DMDW-3
Apriori summary
45 DMDW-3
Road Map
46 DMDW-3
FP-Growth
47 DMDW-3
Build the FP-tree
48 DMDW-3
Build the FP-tree
49 DMDW-3
Build the FP-tree
Pass 2 - cont.:
Each node has a counter.
If two transactions have the same prefix the two
branches overlap on the nodes of the common
prefix and the counter of those nodes are
incremented.
Also, nodes with the same item are linked by
orthogonal paths.
50 DMDW-3
Example 7
TID Items
1 a, b, c
2 a, b, c, d
3 a, b, f
4 a, b, d
5 c, d, e
51 DMDW-3
Example 7
Item Support
a 4
b 4
c 3
d 3
e 1
f 1
52 DMDW-3
Example 7
53 DMDW-3
Example 7
null
TID Items
1 a, b, c
a:4
2 a, b, c, d
Item Support
3 a, b
a 4 b:4
4 a, b, d
b 4
5 c, d
c 3 c:2 c:1
d 3
54 DMDW-3
Extract frequent itemsets
After building the FP-tree the algorithm
starts to build partial trees (called
conditional FP-trees) ending with a given
item (a suffix).
The item is not present in the tree but all
frequent itemsets generated from that
conditional tree will contain that item.
In building the conditional FP-tree, non-
frequent items are skipped (but the branch
remains if there are still nodes on it).
Florin Radulescu, Note de curs
55 DMDW-3
Extract frequent itemsets
56 DMDW-3
d conditional FP-tree
null
a:4
a 2 a:2
Item Support
b:4
a 4
b 2
b 4
c:2 c:1
c 3
c 2 b:2
d 3
c:1 c:1
57 DMDW-3
c conditional FP-tree
Because all items have a support below minsup, no
itemset containing d is frequent.
The same situation is for the c conditional FP-tree:
null
a:4 null
Item Support
Item Support
a 4 b:4
a 2 a:2
b 4
c 3 c:2 c:1
d 3 b 2
d:1 d:1 d:1
b:2
58 DMDW-3
b conditional FP-tree
b conditional FP-tree:
null
null
a:4
Item Support
c 3 c:2 c:1
a 4 a:4
d 3
59 DMDW-3
Results
If there are more than one item with support above or
equal minsup in a conditional FP-tree then the algorithm
is run again against the conditional FP-tree to find
itemsets with more than two items.
For example, if the minsup=2 then from the c conditional
FP-tree the algorithm will produce {a, c} and {b, c}. Then
the same procedure may be run against this tree for
suffix bc, obtaining {a, b, c}. Also from the d conditional
FP-tree first {c, d}, {b, d} and {a, d} are obtained, and
then, for the suffix bd, {a, b, d} is obtained.
Suffix cd leads to unfrequent items in the FP-tree and
suffix ad produces {a, d}, already obtained.
60 DMDW-3
Road Map
61 DMDW-3
Data formats
Table format
In this case a dataset is stored in a two columns
table:
Transactions(Transaction-ID, Item) or
T(TID, Item)
where all the lines of a transaction have the same
TID and the primary key contains both columns
(so T does not contain duplicate rows).
62 DMDW-3
Data formats
Text file format
In that case the dataset is a textfile containing a
transaction per line. Each line may contain a
transaction ID (TID) as the first element or this TID
may be missing, the line number being a virtual TID.
Example 8:
10 12 34 67 78 45 89 23 67 90 line 1
789 12 45 678 34 56 32 line 2
........
Also in this case any software package must either
have a native textfile input option or must contain a
conversion module from text to the needed format
Florin Radulescu, Note de curs
63 DMDW-3
Data formats
Custom format
Many data mining packages use a custom format for the input
data.
An example is the ARFF format used by Weka, presented
below. Weka means Waikato Environment for Knowledge
Analysis) is a popular open source suite of machine
learning software developed at the University of Waikato, New
Zealand.
ARFF stands for Atribute-Relation File Format. An .arff file is
an ASCII file containing a table (called also relation). The file
has two parts:
A Header part containing the relation name, the list of
attributes and their types.
A Data part containing the row values of the relation,
comma separated.
Florin Radulescu, Note de curs
64 DMDW-3
ARFF example
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall MARSHALL%PLU@io.arc.nasa.gov)
% (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa Florin Radulescu, Note de curs
4.9,3.1,1.5,0.1,Iris-setosa
65 DMDW-3
Road Map
66 DMDW-3
Class association rules (CARs)
As for clasical association rules, the model for CARs
considers a set of items I = {i1, i2, …, in} and a set of
transactions T = {t1, t2, …, tm}. The difference is that
each transaction is labeled with a class label c, where
c C, with C containing all class labels and C ∩ I = .
A class association rule is a construction with the
following syntax:
X→y
where X I and y C.
The definition of the support and confidence for a
class association rule is the same with the case of
association rules.
67 DMDW-3
Example 10
Consider the set of six transactions in Example 2, now
labeled with class labels from C = {database,
datamining, programming}:
Doc1 {rule, tree, classification} datamining
Doc2 {relation, tuple, join, algebra, recommendation} database
Doc3 {variable, loop, procedure, rule} programming
Doc4 {clustering, rule, tree, recommendation} datamining
Doc5 {join, relation, selection, projection, classification} database
Doc6 {rule, tree, recommendation} datamining
68 DMDW-3
Example 10
Then the CARs:
rule → datamining;
recommendation → database
has:
sup(rule → datamining) = 3/6 = 50%,
conf(rule → datamining) = 3/4 = 75%.
sup(recommendation → database) = 1/6 17%,
conf(recommendation → database) = 1/3 33%
For a minsup=50% and a minconf=50% the first rule
stands and the second is rejected.
Florin Radulescu, Note de curs
69 DMDW-3
Mining CARs
Algorithm for mining CARs using a modified Apriori
algorithm (see [Liu 11] ):
At the first pass over the algorithm computes F1 where
F1= { the set of CARs with a single item on the left side
verifying a given minsup and minconf}.
At step k, Ck is built from Fk-1 and then, passing
through the data and counting for each member of Ck
the support and the confidence, Fk is determined.
Candidate generation is almost the same as for
association rules with the only difference that in the
join step only CARs with the same class in the right
side are joined.
70 DMDW-3
Candidates generation
71 DMDW-3
Road Map
72 DMDW-3
Sequential patterns model
Itemset: a set of n distinct items
I = {i1, i2, …, in }
Event: a non-empty collection of items; we can
assume that items are in a given (e.g.
lexicographic) order: (i1,i2 … ik)
Sequence : an ordered list of events: < e1 e2 …
em >
Length of a sequence: the number of items in
the sequence
Example: <AM, CDE, AE> has length 7
Florin Radulescu, Note de curs
73 DMDW-3
Sequential patterns model
Size of a sequence: the number of itemsets in the
sequence
Example: <AM, CDE, AE> has size 3
K-sequence : sequence with k items, or with
length k
Example: <B, AC> is a 3-sequence
Subsequence and supersequence: <e1 e2 …eu>
is a subsequence of or included in <f1 f2 …fv> (and
the last is a supersequence of the first sequence
or contains that sequence) if there are some
integers 1 j1 < j2 < … < ju-1 < ju v such that e1
fj1 and e2 fj2 and … and eu fju.
Florin Radulescu, Note de curs
74 DMDW-3
Sequential patterns model
Sequence database X: a set of sequences
Frequent sequence (or sequential pattern):
a sequence included in more than s
members of the sequence database X;
s is the user-specified minimum support.
The number of sequences from X containing
a given sequence is called the support of
that sequence.
So, a frequent sequence is a sequence with
a support at least s where s is the minsup
specified by the user.
Florin Radulescu, Note de curs
75 DMDW-3
Example 11
<A, BC> is a subsequence of <AB, E, ABCD>
<AB, C> is not a subsequence of <ABC>
Consider a minsup=50% and the following sequence
database:
Sequence ID Sequence
1 <A, B, C>
2 <AB, C, AD>
3 <ABC, BCE>
5 <B, E>
76 DMDW-3
Example 11
77 DMDW-3
Algorithms
Apriori
GSP (Generalized Sequential Pattern)
FreeSpan (Frequent pattern-projected
Sequential pattern mining)
PrefixSpan (Prefix-projected Sequential
pattern mining)
SPADE (Sequential PAttern Discovery
using Equivalence classes)
Florin Radulescu, Note de curs
78 DMDW-3
GSP Algorithm
79 DMDW-3
GSP Algorithm
80 DMDW-3
Example 12
81 DMDW-3
Summary
This third course presented:
What are frequent itemsets and rules and their
relationship
Apriori and FP-growth algorithms for discovering
frequent itemsets.
Data formats for discovering frequent itemsets
What are class association rules and how can be
mined
An introduction to sequential patterns and the GSP
algorithm
Next week: Supervised learning – part 1.
Florin Radulescu, Note de curs
82 DMDW-3
References
[Agrawal, Imielinski, Swami 93] R. Agrawal; T. Imielinski; A. Swami:
Mining Association Rules Between Sets of Items in Large Databases",
SIGMOD Conference 1993: 207-216, (http://rakesh.agrawal-
family.com/papers/sigmod93assoc.pdf)
[Agrawal, Srikant 94] Rakesh Agrawal and Ramakrishnan Srikant. Fast
algorithms for mining association rules in large databases. Proceedings
of the 20th International Conference on Very Large Data Bases, VLDB,
pages 487-499, Santiago, Chile, September 1994
(http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf)
[Srikant, Agrawal 96] R. Srikant, R. Agrawal: "Mining Sequential
Patterns: Generalizations and Performance Improvements", to appear
in Proc. of the Fifth Int'l Conference on Extending Database Technology
(EDBT), Avignon, France, March 1996, (http://rakesh.agrawal-
family.com/papers/edbt96seq.pdf)
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data, Second Edition, Springer, chapter 2.
[Ullman 03-09] Jeffrey Ullman, Data Mining Lecture Notes, 2003-2009,
web page: http://infolab.stanford.edu/~ullman/mining/mining.html
Florin Radulescu, Note de curs
83 DMDW-3
References
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org
[Whitehorn 06] Mark Whitehorn, The parable of the beer and diapers,
web page: http://www.theregister.co.uk/2006/08/15/beer_diapers/
[Silverstein et al. 00] Silverstein, C., Brin, S., Motwani, R., Ullman, J. D.
2000. Scalable techniques for mining causal structures. Data Mining
Knowl. Discov. 4, 2–3, 163–192., www.vldb.org/conf/1998/p594.pdf
[Verhein 08] Florian Verhein, Frequent Pattern Growth (FP-Growth)
Algorithm, An Introduction, 2008,
http://www.florian.verhein.com/teaching/2008-01-09/fp-growth-
presentation_v1%20(handout).pdf
[Pietracaprina, Zandolin 03] Andrea Pietracaprina and Dario Zandolin:
Mining Frequent Itemsets using Patricia Tries,
[Zhao, Bhowmick 03] Qiankun Zhao, Sourav S. Bhowmick, Sequential
Pattern Mining: A Survey, Technical Report, CAIS, Nanyang
Technological University, Singapore, No. 2003118 , 2003,
(http://cs.nju.edu.cn/zhouzh/zhouzh.files/course/dm/reading/reading04/
zhao_techrep03.pdf)
Florin Radulescu, Note de curs
84 DMDW-3
Supervised Learning
- Part 1 -
Road Map
2 DMDW-4
Objectives
3 DMDW-4
Definitions
4 DMDW-4
Regression
Regression comes from statistics.
Meaning: predicting a value of a given
continuous variable based on the values of other
variables, assuming a linear or nonlinear model
of dependency ([Tan, Steinbach, Kumar 06]).
Used in prediction and forecasting - its use
overlaps machine learning.
Regression analysis is also used to understand
the relationships between independent variables
and dependent variables and can be used to
infer causal relationships between them.
Florin Radulescu, Note de curs
5 DMDW-4
Example
6 DMDW-4
Classification
Input:
A set of k classes C = {c1, c2, …, ck}
A set of n labeled items D = {(d1, ci1), (d2, ci2), …, (dn
cin)}. The items are d1, …, dn, each item dj being labeled
with class cj C. D is called the training set.
For calibration of some algorithms, a validation set is
also required. This validation set contains also labeled
items not included in the training set.
Output:
A model or method for classifying new items.
The set of new items that will be classified using this
model/method is called the test set
Florin Radulescu, Note de curs
7 DMDW-4
Example. Model: decision tree
8 DMDW-4
Input data format
9 DMDW-4
Play tennis dataset
Day Outlook Temperature Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
10 DMDW-4
Approaches
The interest in supervised learning is shared
between:
statistics,
data mining and
artificial intelligence
There are a wide range of problems solved
by supervised learning techniques, the
number of algorithms or methods in this
category is very large.
There are many approaches and in this
course (and the next) only some of them are
covered, as follows:
Florin Radulescu, Note de curs
11 DMDW-4
Decision trees
Decision trees: an example is the UPU
decision tree as described in the previous
example.
In a decision tree non-leaf nodes contain
decisions based on the attributes of the
examples (attributes of argument di) and
each leaf ci is a class from C.
ID3 and C4.5 - two well known algorithms
for building decision trees - are presented
in this lesson .
Florin Radulescu, Note de curs
12 DMDW-4
Rule induction systems
13 DMDW-4
Rule induction systems
Residence
Bucharest Other
Class = No Fails
>3 <=3
14 DMDW-4
Rule induction systems
15 DMDW-4
Classification using association rules
16 DMDW-4
Naïve Bayesian classification
17 DMDW-4
Support vector machines
18 DMDW-4
KNN
19 DMDW-4
Ensemble methods
20 DMDW-4
Road Map
21 DMDW-4
Accuracy and error rate
For estimating the efficiency of a classifier several
measures may be used:
Accuracy (or predictive accuracy) is the proportion of
correctly classified test examples :
22 DMDW-4
Other measures
In some cases where examples are classified in only
two classes (called Positive and Negative) other
measures can be also defined.
Consider the confusion matrix containing the number of
correctly and incorrectly classified examples (Positive
examples as well as Negative examples):
23 DMDW-4
Other measures
TP = the number of correct classifications for Positive examples.
TN = the number of correct classifications for Negative
examples.
FP = the number of incorrect classifications for Negative
examples.
FN = the number of incorrect classifications for Positive
examples.
Precision is the proportion of the correctly classified
Positive examples in the set of examples classified as
Positive:
Precision = TP / (TP + FP)
24 DMDW-4
Other measures
25 DMDW-4
Other measures
26 DMDW-4
Other measures
27 DMDW-4
Other measures
28 DMDW-4
Evaluation methods
29 DMDW-4
The holdout method
In this case the data set D is split in two: a
training set and a test set.
The test set is also called holdout set (from
here the name of the method).
The classifier obtained using the training set
is used for classification of examples from the
test set.
Because these examples are also labeled
accuracy, precision, recall and other
measures can then be obtained and based
on them the classifier is evaluated.
Florin Radulescu, Note de curs
30 DMDW-4
Cross validation method
There are several versions of cross validation:
1. k-fold cross validation. The data set D is split in
k disjoint subsets with the same size. For each
subset a classifier is built and run using that
subset as test set and the reunion of all k-1
remaining subsets as training set. In this way k
values for accuracy are obtained (one for each
classifier). The mean of these values is the final
accuracy. The usual value for k is 10.
2. 2-fold cross validation. For k=2 the above
method has the advantage of using large sets
both for training and testing.
Florin Radulescu, Note de curs
31 DMDW-4
Cross validation method
32 DMDW-4
Bootstrap method
33 DMDW-4
Bootstrap method
34 DMDW-4
Why 63.2?
From Data Mining: Concepts and Techniques*: Suppose we
are given a data set of d tuples. “Where does the figure,
63.2%, come from?” Each tuple has a probability of 1/d of
being selected, so the probability of not being chosen is (1-
1/d).
We have to select d times, so the probability that a tuple will
not be chosen during this whole time is (1- 1/d)d.
If d is large, the probability approaches e-1 = 0.368 Thus,
36.8% of tuples will not be selected for training and thereby
end up in the test set, and the remaining 63.2% will form the
training set.
* Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, Second
Edition, 2006, Morgan Kaufman, page 365.
Florin Radulescu, Note de curs
35 DMDW-4
Scoring and ranking
Sometimes the user is interested only in a single class
(called the Positive class for short), for example
buyers of a certain type of gadgets or players of a
certain game.
If the classifier returns a probability estimate (PE) for
each example in the test case to belong to the Positive
class (indicating the likelihood to belong to that class)
we can score each example by the value of this PE.
After that we can rank all examples based on their PE
and draw a lift curve.
The classifier method is good if the lift curve is way
above the random line in the lift chart – see example.
36 DMDW-4
Scoring and ranking
37 DMDW-4
Example (from [Microsoft])
38 DMDW-4
Example (from [Microsoft])
39 DMDW-4
Example (from [Microsoft])
40 DMDW-4
Lift curve example
Source: http://www.saedsayad.com/model_evaluation_c.htm
41 DMDW-4
Example (from [Microsoft])
If the company randomly selects 5,000
customers, they can expect to receive only
500 responses, based on the typical
response rate. This scenario is what
the random line in the lift chart represents.
However, if the marketing department uses a
mining model to target their mailing, they can
expect a larger response rate because they
can target those customers who are most
likely to respond.
Florin Radulescu, Note de curs
42 DMDW-4
Example (from [Microsoft])
43 DMDW-4
Example (from [Microsoft])
44 DMDW-4
Road Map
45 DMDW-4
What is a decision tree
A very common way to represent a classification
model or algorithm is a decision tree. Having a
training set D and a set of n example attributes A,
each labeled example in D is like: (a1 = v1, a2 = v2,
…, an = vn). Based on these attributes a decision
tree can be built having:
a. Internal nodes are attributes (with no path
containing twice the same attribute).
b. Branches refer to discrete values (one or more) or
intervals for these attributes. Sometimes more
complex conditions may be used for branching.
46 DMDW-4
What is a decision tree
c. Leafs are labeled with classes. For each leaf
a support and a confidence may be
computed: support is the proportion of
examples matching the path from root to that
leaf and confidence is the classification
accuracy for examples matching that path.
When passing from decision trees to rules,
each rule has the same support and
confidence as the leaf from where it comes.
d. Any example match a single path of the tree
(so a single leaf or class).
Florin Radulescu, Note de curs
47 DMDW-4
Example
Outlook
Wind
Humidity
(3/14, 3/3) (2/14, 2/2) (4/14, 4/4) (3/14, 3/3) (2/14, 2/2)
48 DMDW-4
Decision trees
49 DMDW-4
Decision trees
Wind
Strong Weak
50 DMDW-4
ID3
51 DMDW-4
ID3
The training set is also divided, each branch inheriting
the examples matching the attribute value of the
branch.
Process repeats for each descendant until all
examples have the same class (in that case the node
becomes a leaf labeled with that class) or all attributes
have been used (the node also become a leaf labeled
with the mode value – the majority class).
An attribute cannot be chosen twice on the same path;
from the moment it was chosen for a node it will never
be tested again for the descendants of that node.
Florin Radulescu, Note de curs
52 DMDW-4
Best attribute
53 DMDW-4
Entropy
54 DMDW-4
Information gain
55 DMDW-4
Example
For Play tennis dataset there are four attributes
for the root of the decision tree: Outlook,
Temperature, Humidity and Wind.
The entropy of the whole dataset and the
weighted values for dividing using the four
attributes are:
56 DMDW-4
For each attribute
57 DMDW-4
Best attribute: Outlook
The next table contains the values for entropy
and gain.
The best attribute for the root node is Outlook,
with a maximum gain of 0.25:
58 DMDW-4
Notes on and extensions of ID3
59 DMDW-4
Notes on and extensions of ID3
3. Sometimes (when only few examples are associated
with leaves) the tree overfits the training data and
does not work well on test examples.
To avoid overfitting the tree may be simplified by
pruning:
Pre-pruning: growing is stopped before normal end. The
leaves are not 100% pure and are labeled with the majority
class (the mode value).
Post-pruning: after running the algorithm some sub-trees
are replaced by leaves. Also in this case the labels are
mode values for the matching training examples. Post-
pruning is better because in pre-pruning is hard to
estimate when to stop.
60 DMDW-4
Notes on and extensions of ID3
4. Some attribute A may be continuous. Values for
A may be partitioned in two intervals:
A t and A > t.
The value of t may be selected as follows:
Sort examples upon A
Pick the average of two consecutive values
where the class changes as candidate.
For each candidates found in previous step
compute the gain if partitioning is made using
that value. The candidate with the maximum
gain is considered for partitioning.
Florin Radulescu, Note de curs
61 DMDW-4
Notes on and extensions of ID3
62 DMDW-4
Notes on and extensions of ID3
5. Attribute cost: some attributes are more expensive
than others (measured not only in money).
It is better that lower-cost attributes to be closer to
the root than other attributes.
For example, for an emergency unit it is better to
test the pulse and temperature first and only when
necessary perform a biopsy.
This may be done by weighting the gain by the cost:
63 DMDW-4
C4.5
C4.5 is the improved version of ID3, and was
developed also by Ross Quinlan (as well as
C5.0). Some characteristics:
Numeric (continuous) attributes are allowed
deal sensibly with missing values
(see https://www.quora.com/In-simple-language-how-does-C4-5-deal-with-missing-values)
64 DMDW-4
C4.5
The most important improvements from ID3
are:
2. Post pruning is performed in order to reduce the
tree size. The pruning is made only if it reduces
the estimated error. There are two prune
methods:
Sub-tree replacement: A sub-tree is replaced with a leaf
but each sub-tree is considered only after all its sub-
trees. This is a bottom-up approach.
Sub-tree raising: A node is raised and replaces a higher
node. But in this case some examples must be
reassigned. This method is considered less important
and slower than the first.
65 DMDW-4
Road Map
66 DMDW-4
Rules
Wind
Humidity
(3/14, 3/3) (2/14, 2/2) (4/14, 4/4) (3/14, 3/3) (2/14, 2/2)
67 DMDW-4
Rules
The rules are (one for each path):
68 DMDW-4
Rule induction
In the case of a set of rules extracted from a decision
tree, rules are mutually exclusive and exhaustive.
But rules may be obtained directly from the training
data set by sequential covering.
A classifier built by sequential covering consists in an
ordered or unordered list of rules (called also decision
list), obtained as follows:
Rules are learned one at a time
After a rule is learned, the tuples covered by that rule are
removed
The process repeats on the remaining tuples until some
stopping criteria are met (no more training examples, the
quality of a rule returned is below a user-specified
threshold, …)
Florin Radulescu, Note de curs
69 DMDW-4
Fining rules
70 DMDW-4
Ordered rules
The algorithm:
RuleList
Rule learn-one-rule(D)
while Rule AND D do
RuleList RuleList + Rule // append Rule at the end of RuleList
D = D – {examples covered by Rule}
Rule learn-one-rule(D)
Endwhile
// append majority class as last/default rule:
RuleList RuleList + {c|c is the majority class}
return RuleList
71 DMDW-4
Learn-one-rule
Function learn-one-rule is built considering all
possible attribute-value pairs (Attribute op Value)
where Value may be also an interval.
The process tries to find the left side of a new
rule and this left side is a condition.
At the end the rule is constructed using as right
side the majority class of the examples covered
by the left side condition.
72 DMDW-4
Learn-one-rule
1. Start with an empty Rule and a set of BestRules
containing this rule:
Rule
BestRules {Rule}
2. For each member b of BestRules and for each
possible attribute-value pair p evaluate the combined
condition b p. If this condition is better than Rule
then it replaces the old value of Rule.
3. At the end of the process a best rule with an
incremented dimension is found. Also in BestRules
the best n combined conditions discovered at this
step are kept (implementing a beam search).
73 DMDW-4
Learn-one-rule
4. The evaluation of a rule may be done using the
entropy of the set containing examples covered
by that rule.
5. Repeat steps 2 and 3 until no more conditions
are added in BestRules. Note that a condition
must verify a given threshold at evaluation time,
so BestRules may have less then n members.
6. If Rule is evaluated and found enough efficient
(considering the given threshold) then Rule → c
is returned otherwise an empty rule is the result.
The class c is the majority class of the examples
covered by Rule.
Florin Radulescu, Note de curs
74 DMDW-4
Unordered rules
The algorithm:
RuleList
foreach class c ∈ C do
D = Pos Neg // Pos = { examples of class c from D}
// Neg = D - Pos
while Pos do
Rule learn-one-rule(Pos, Neg, c);
if Rule =
then
Quitloop
else
RuleList RuleList + Rule // append Rule at the end of RuleList
Pos = Pos – {examples covered by Rule}
Neg = Neg – {examples covered by Rule}
endif
endwhile
endfor
Florin Radulescu, Note de curs
return RuleList
75 DMDW-4
learn-one-rule again
For learning a rule two steps are performed: grow a
new rule and then prune it.
Pos and Neg are split in two parts each: GrowPos,
GrowNeg, PrunePos and PruneNeg.
The first part is used for growing a new rule and the
second for pruning.
At the ‘grow’ step a new condition/rule is build, as in
the previous algorithm.
Only the best condition is kept at each step (and not
best n).
Evaluation for the new best condition C’ obtained by
adding an attribute-value pair to C is made using a
different gain:
Florin Radulescu, Note de curs
76 DMDW-4
learn-one-rule again
where:
p0, n0: the number of positive/negative examples
covered by C in GrowPos/ GrowNeg.
p1, n1: the number of positive/negative examples
covered by C’ in GrowPos/ GrowNeg.
The rule maximizing this gain is returned by the
‘grow’ step.
Florin Radulescu, Note de curs
77 DMDW-4
learn-one-rule again
At the ‘prune’ step sub-conditions are deleted
from the rule and the deletion that maximize the
function below is chosen:
78 DMDW-4
IREP
procedure IREP(Pos, Neg)
begin
Ruleset :=
while Pos do
// grow and prune a new rule
split (Pos, Neg) into (GrowPos, GrowNeg) and (PrunePos, PruneNeg)
Rule := GrowRule(GrowPos, GrowNeg)
Rule := PruneRule(Rule, PrunePos, PruneNeg)
if the error rate of Rule on (PrunePos, PruneNeg) exceeds 50%
then return Ruleset
else add Rule to Ruleset
remove examples covered by Rule from (Pos, Neg)
endif
endwhile
return Ruleset
end
Florin Radulescu, Note de curs
79 DMDW-4
Summary
This course presented:
What is supervised learning: definitions, data formats
and approaches.
Evaluation of classifiers: accuracy and other error
measures and evaluation methods: holdout set, cross
validation, bootstrap and scoring and ranking.
Decision trees building and two algorithms developed
by Ross Quinlan (ID3 and C4.5) .
Rule induction systems
Next week: Supervised learning – part 2.
80 DMDW-4
References
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data, Second Edition, Springer, chapter 3.
[Han, Kamber 06] Jiawei Han and Micheline Kamber, Data
Mining: Concepts and Techniques, 2nd ed., The Morgan
Kaufmann Series in Data Management Systems, Jim Gray,
Series Editor Morgan Kaufmann Publishers, March 2006. ISBN
1-55860-901-6
[Sanderson 08] Robert Sanderson, Data mining course notes,
Dept. of Computer Science, University of Liverpool 2008,
Classification: Evaluation
http://www.csc.liv.ac.uk/~azaroth/courses/current/comp527/lectur
es/comp527-13.pdf
81 DMDW-4
References
[Quinlan 86] Quinlan, J. R. 1986. Induction of Decision Trees.
Mach. Learn. 1, 1 (Mar. 1986), 81-106,
http://www.cs.nyu.edu/~roweis/csc2515-
2006/readings/quinlan.pdf
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org
[Microsoft] Lift Chart (Analysis Services - Data Mining),
http://msdn.microsoft.com/en-us/library/ms175428.aspx
[Cohen 95] William W. Cohen, Fast Effective Rule Induction, in
“Machine Learning: Proceedings of the Twelfth International
Conference” (ML95),
http://sci2s.ugr.es/keel/pdf/algorithm/congreso/ml-95-ripper.pdf
82 DMDW-4
Supervised Learning
- Part 2 -
Road Map
2 DMDW-5
CAR definition
IF:
I is a set of items, I = {i1, i2, …, in},
C a set of classes (C ∩ I = ), and
T a set of transactions, T = {t1, t2, …, tm} where
each transaction is labeled with a class label c
C,
THEN:
a class association rule (CAR) is a construction
with the following syntax:
X→y
where X I and y C.
Florin Radulescu, Note de curs
3 DMDW-5
Example: Dataset
4 DMDW-5
Support and confidence
5 DMDW-5
Example: support and confidence
6 DMDW-5
Using CARs
7 DMDW-5
Strongest rule
8 DMDW-5
Strongest rule
9 DMDW-5
Strongest rule: example
10 DMDW-5
Subset of rules
This method is used in Classification Based on
Associations (CBA). In this case, having a
training dataset D and a set of CARs R, the
objectives are:
A. to order R using their support and
confidence, R = {r1, r2, …, rn}:
1. First rules with highest confidence
2. For the same confidence use the support to
order the rules
3. For the same support and confidence order by
rule generation-time (rules generated first are
‘greater’ than rules generated later).
Florin Radulescu, Note de curs
11 DMDW-5
Subset of rules
B. to select a subset S of R covering D:
1. Start with an empty set S
2. Consider ordered rules from R in sequence: for
each rule r
If D and r correctly classifies at least one example
in D then add r at the end of S and remove covered
examples from D.
3. Stop when D is empty
4. Add the majority class as default classification.
The result is:
Classifier = <ri1, ri2, …, rik, majority-class>
Florin Radulescu, Note de curs
12 DMDW-5
Using CARs
13 DMDW-5
Build new attributes (features)
In this approach the training dataset is enriched with
new attributes, one for each CAR:
FOREACH transaction
IF transaction is covered by the left part of the CAR
THEN the value of the attribute is 1 (or TRUE)
ELSE the value of the new attribute is 0 (or FALSE)
ENDIF
ENDFOR
There are also other methods to build classifiers using
a set of CARs, for examples grouping rules and
measuring the strength of each group, etc.
14 DMDW-5
Use of association rules
Association rules (not CARs) may be used in
recommendation systems:
The rules are ordered by their confidence and
support and then may be used, considering them
in this order, for labeling new examples
Labels are not classes but other items
(recommendations).
For example, based on a set of association
rules containing books, the system may
recommend new books to customers based
on their previous orders.
Florin Radulescu, Note de curs
15 DMDW-5
Road Map
16 DMDW-5
Naïve Bayes: Overview
This approach is a probabilistic one.
The algorithms based on Bayes theorem compute
for each test example not a single class but a
probability of each class in C (the set of classes).
If the dataset has k attributes, A1, A2, …, Ak, the
objective is to compute for each class c C = {c1,
c2, …, cn) the probability of the test example (a1,
a2, …, ak) to belong to the class c:
Pr(Class = c | A1 = a1, …, Ak = ak)
If classification is needed, the class with the
highest probability may be assigned to that
example.
Florin Radulescu, Note de curs
17 DMDW-5
Bayes theorem
18 DMDW-5
Example
“Students in PR106 are 60% from the AI
M.Sc. module and 40% from other
modules.
20% of the students are placed in the first
2 rows of seats but for AI this percent is
30%.
When the dean enters the class and sits
somewhere in the first 2 rows, near a
student, compute the probability that its
neighbor is from AI?”
Florin Radulescu, Note de curs
19 DMDW-5
Example
“Students in PR106 are 60% from the AI M.Sc. module and 40% from other modules. 20% of the
students are placed in the first 2 rows of seats but for AI this percent is 30%. When the dean
enters the class and sits somewhere in the first 2 rows, near a student, compute the probability
that its neighbor is from AI?”
1. Pr(AI) = 0.6
2. Pr(2 rows | AI) = 0.3
3. Pr(2 rows) = 0.2
So:
Pr(AI | 2 rows) = Pr(2 rows | AI)*Pr(AI) /
Pr(2 rows) = 0.3*0.6/0.2 = 0.9 or 90%.
Florin Radulescu, Note de curs
20 DMDW-5
Building classifiers
21 DMDW-5
Building classifiers
Making the following assumption: “all attributes
are conditionally independent given the class
C=cj” then:
22 DMDW-5
Building classifiers
23 DMDW-5
Building classifiers
24 DMDW-5
Example
25 DMDW-5
Example
Pr(Yes) = 2/8 Pr(No) = 6/8
Pr(Overcast | C = Yes) = 1/2 Pr(Weak | C = Yes) = 2/2
Pr(Overcast | C = No) = 2/6 Pr(Weak | C = No) = 1/6
Pr(Sunny | C = Yes) = 1/2 Pr(Strong| C = Yes) = 0/2
Pr(Sunny | C = No) = 1/6 Pr(Strong| C = No) = 3/6
Pr(Rain | C = Yes) = 0/2 Pr(Absent| C = Yes) = 0/2
Pr(Rain | C = No) = 3/6 Pr(Absent| C = No) = 2/6
26 DMDW-5
Example
For C = Yes
For C = No
27 DMDW-5
Special case: division by 0
Sometimes a class does not occur with a specific
attribute value.
In that case one term Pr(Ai = ai | C = cj) is zero, so the
above expression for probabilities of each class
evaluates to 0/0.
For avoiding this situation, the expression:
must be modified.
(a = number of training examples with Ai = ai and C = cj
and b = number of training examples with C = cj )
Florin Radulescu, Note de curs
28 DMDW-5
Special case: division by 0
where:
s = 1 / Number of examples in the training set
r = Number of distinct values for Ai
29 DMDW-5
Example
For C = Yes
For C = No
30 DMDW-5
Special case: values
31 DMDW-5
Road Map
32 DMDW-5
SVM: Overview
In this course is presented only the general
idea of the Support Vector Machines (SVM)
classification method.
SVMs are described in detail in many
documentations and books, for example [Liu
11] or [Han, Kamber 06].
The method was discovered in Soviet Union
in '70 by Vladimir Vapnik and was developed
in USA after Vapnik joined AT&T Bell Labs in
early '90 (see [Cortes, Vapnik 95]).
Florin Radulescu, Note de curs
33 DMDW-5
SVM: Model
34 DMDW-5
SVM: Model
A possible classifier is a linear function:
f(X) = <w X> + b
such as:
where:
w is a weight vector,
<w X> is the dot product of vectors w and X,
b is a real number and
w and b may be scaled up or down as shown below.
Florin Radulescu, Note de curs
35 DMDW-5
SVM: Model
The meaning of f is that the hyperplane
< w X> + b = 0
separates the points of the training set D in two:
one half of the space contains the positive
values and
the other half the negative values in D (like
hyperplanes H1 and H2 in the next figure).
All test examples can now be classified using f:
the value of f gives the label for the example.
Florin Radulescu, Note de curs
36 DMDW-5
Figure 1
Source: Wikipedia
37 DMDW-5
Best hyperplane
SVM tries to find the ‘best’ hyperplane of that
form.
The theory shows that the best hyperplane is the
one maximizing the so-called margin (the
minimum orthogonal distance between a
positive and negative point from the training set
– see next figure for an example.
38 DMDW-5
Figure 2
Source: Wikipedia
39 DMDW-5
The model
Consider X+ and X- the nearest positive and negative
points for the hyperplane
<w X> + b = 0
Then there are two other parallel hyperplanes, H+ and
H- passing through X+ and X- and their expression is:
H+ : <w X> + b = 1
H- : <w X> + b = -1
These two hyperplanes are with dotted lines in Figure
1. Note that w and b must be scaled such as:
<w Xi> + b 1 for yi = 1
<w Xi> + b -1 for yi = -1
40 DMDW-5
The model
The margin is the distance between these two
planes and may be computed using vector space
algebra obtaining:
41 DMDW-5
Definition: separable case
When positive and negative points are linearly
separable, the SVM definition is the following:
Having a training data set D = {(X1, y1), (X2, y2), ..., (Xk, yk)}
Minimize the value of expression (1) above
With restriction: yi (<w Xi> + b) 1, knowing the value of yi: +1
or -1
This optimization problem is solvable by rewriting the
above inequality using a Lagrangian formulation and
then finding solution using Karush-Kuhn-Tucker (KKT)
conditions.
This mathematical approach is beyond the scope of this
course.
Florin Radulescu, Note de curs
42 DMDW-5
Non-linear separation
In many situations there is no hyperplane for
separation between the positive and negative
examples.
In such cases there is possible to map the
training data points (examples) in another
space, a higher dimensional one.
Here data points may be linearly separable.
The mapping function gets examples (vectors)
from the input space X and maps them in the so-
called feature space F:
:X→F
Florin Radulescu, Note de curs
43 DMDW-5
Non-linear separation
44 DMDW-5
Figure 3
Source: Wikipedia
45 DMDW-5
Kernel functions
But how can we find this mapping function?
In solving the optimization problem for finding the
linear separation hyperplane in the new feature space
F, all terms containing training examples are only of
the form (Xi) (Xj).
By replacing this dot product with a function in both Xi
and Xj the need for finding disappears. Such a
function is called a kernel function:
K(Xi, Xj) = (Xi) (Xj)
For finding the separation hyperplane in F we must
only replace all dot products with the chosen kernel
function and then proceed with the optimization
problem like in separable case.
Florin Radulescu, Note de curs
46 DMDW-5
Kernel functions
47 DMDW-5
Other aspects concerning SVMs
SVM deals with continuous real values for
attributes.
When categorical attributes exists in the training
data a conversion to real values is needed.
When more than two classes are needed
SVM can be used recursively.
First use separates one class; the second use
separates the second class and so on. For N
classes N-1 runs are needed.
SVM are a very good method in hyper
dimensional data classification.
Florin Radulescu, Note de curs
48 DMDW-5
Road Map
49 DMDW-5
kNN
K-nearest neighbor (kNN) does not produce a
model but is a simple method for determining the
class of an example based on the labels of its
neighbors belonging to the training set.
For running the algorithm a distance function is
needed for computing the distance from the test
example to the examples in the training set.
A function f(x, y) may be used as distance function if
four conditions are met:
o f(x, y) 0
o f(x, x) = 0
o f(x, y) = f(y, x)
o f(x, y) f(x, z) + f(z, y).
Florin Radulescu, Note de curs
50 DMDW-5
Algorithm
Input:
A dataset D containing labeled examples (the training set)
A distance function f for measuring the dissimilarity between
two examples
An integer k – parameter - telling how many neighbors are
considered
A test example t
Output:
The class label of t
Method:
Use f to compute the distance between t and each point in D
Select nearest k points
Assign t the majority class from the set of k nearest
neighbors.
Florin Radulescu, Note de curs
51 DMDW-5
Example
K=3 Red
K = 5 Blue
52 DMDW-5
Road Map
53 DMDW-5
Ensemble methods
54 DMDW-5
Bagging
55 DMDW-5
Example
Original dataset a b c d e f
Training set 1 b b b c e f
Training set 2 b b c c d e
Training set 3 a b c c d f
56 DMDW-5
Bagging
Bagging consists in:
Starting with the original dataset, build n training
datasets by sampling with replacement (bootstrap
samples)
For each training dataset build a classifier using the
same learning algorithm (called weak classifiers).
The final classifier is obtained by combining the results
of the weak classifiers (by voting for example).
Bagging helps to improve the accuracy for unstable
learning algorithms: decision trees, neural networks.
It does not help for kNN, Naïve Bayesian classification
or CARs.
Florin Radulescu, Note de curs
57 DMDW-5
Boosting
Boosting consists in building a sequence of
weak classifiers and adding them in the
structure of the final strong classifier.
The weak classifiers are weighted based on
the weak learners' accuracy.
Also data is reweighted after each weak
classifier is built such as examples that are
incorrectly classified gain some extra weight.
The result is that the next weak classifiers in
the sequence focus more on the examples
that previous weak classifiers missed.
Florin Radulescu, Note de curs
58 DMDW-5
Random forest
Random forest is an ensemble classifier consisting in a set
of decision trees The final classifier output the modal value of the
classes output by each tree.
The algorithm is the following:
1. Choose T - number of trees to grow (e.g. 10).
2. Choose m - number of variables used to split each node. m M,
where M is the number of input variables.
3. Grow T trees. When growing each tree do the following:
Construct a bootstrap sample from training data with
replacement and grow a tree from this bootstrap sample.
When growing a tree at each node select m variables at
random and use them to find the best split.
Grow the tree to a maximal extent. There is no pruning.
4. Predict new data by aggregating the predictions of the trees (e.g.
majority votes for classification, average for regression).
59 DMDW-5
Summary
This course presented:
Classification using class association rules: CARs for
building classifiers and using CARs for building new
attributes (features) of the training dataset.
Naïve Bayesian classification: Bayes theorem, Naïve
Bayesian algorithm for building classifiers.
An introduction to support vector machines (SVMs):
model, definition, kernel functions.
K-nearest neighbor method for classification
Ensemble methods: Bagging, Boosting, Random
Forest
Next week: Unsupervised learning – part 1
Florin Radulescu, Note de curs
60 DMDW-5
References
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data, Second Edition, Springer, chapter 3.
[Han, Kamber 06] Jiawei Han, Micheline Kamber, Data Mining:
Concepts and Techniques, Second Edition, Morgan Kaufmann
Publishers, 2006
[Cortes, Vapnik 95] Cortes, Corinna; and Vapnik, Vladimir N.;
"Support-Vector Networks", Machine Learning, 20, 1995.
http://www.springerlink.com/content/k238jx04hm87j80g/
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org
61 DMDW-5
Unsupervised Learning
- Part 1 -
Road Map
2 DMDW-6
Supervised vs. unsupervised
In the previous chapter (supervised learning), data
points (examples) are of two types:
Labeled examples (by some experts); these
examples are used as training set and sometimes,
part of them as validation set.
Unlabeled examples; these examples, members of
the so-called test set, are new data and the objective
is to label them in the same way the training set
examples are labeled.
Labeled examples are used to build a model or
method (called classifier) and this classifier is the
‘machine’ used to label further examples
(unlabeled examples from the test set).
Florin Radulescu, Note de curs
3 DMDW-6
Supervised learning
So the starting points of supervised learning are:
1. The set of classes (labels) is known. These classes
reflects the inner structure of the data, so this structure
is previously known in the case of supervised learning
2. Some labeled examples (at least few for each class)
are known. So supervised learning may be
characterized also as learning from examples. The
classifier is built entirely based on these labeled
examples.
3. A classifier is a model or method for expanding the
experience kept in the training set to all further new
examples.
4. Based on a validation set, the obtained classifier may
be evaluated (accuracy, etc).
Florin Radulescu, Note de curs
4 DMDW-6
Unsupervised learning
In unsupervised learning:
The number of classes (called clusters) is not
known. One of the objectives of clustering is
also to determine this number.
The characteristics of each cluster (e.g. its
center, number of points in cluster, etc) are
not known. All these characteristics will be
available only at the end of the process.
There are no examples or other knowledge
related to the inner structure of the data to
help in building the clusters
Florin Radulescu, Note de curs
5 DMDW-6
Unsupervised learning
6 DMDW-6
Unsupervised learning
Because there are no labeled examples,
there is no possible evaluation of the result
based on previously known information.
Cluster evaluation is made using computed
characteristics of the resulting clusters.
Unsupervised learning is a class of Data
mining algorithms including clustering,
association rules (already presented), self
organizing maps, etc. This chapter focuses
on clustering.
Florin Radulescu, Note de curs
7 DMDW-6
Clustering
8 DMDW-6
Clustering
Input:
2. A distance function (dissimilarity
measure) that can be used to compute
the distance between any two points.
Low valued distance means ‘near’, high
valued distance means ‘far’.
Note: If a distance function is not
available, the distance between any two
points in D must be provided as input.
Florin Radulescu, Note de curs
9 DMDW-6
Clustering
Input:
3. For the most part of the algorithms the items
are represented by their coordinates in a k
dimensional space, called attribute values,
as every dimension defines an attribute for
the set of points.
In this case the distance function may be
the Euclidean distance or other attribute
based distance.
Florin Radulescu, Note de curs
10 DMDW-6
Clustering
Input:
4. Some algorithms also need a predefined
value for the number of clusters in the
produced result.
Output:
A set of object (point) groups called clusters
where points in the same cluster are near
one to another and points from different
clusters are far one from another, considering
the distance function.
Florin Radulescu, Note de curs
11 DMDW-6
Example
Three clusters in 2D
12 DMDW-6
Features
Each cluster may be described by its:
Centroid – is the Euclidean center of the cluster,
computed as the mass center of the (equally
weighted) points in the cluster.
When the cluster is not in a Euclidean space, the
centroid cannot be determined – there are no
coordinates. In that case a clustroid (or medoid) is
used as the center of a cluster.
The clustroid/medoid is a point in the cluster, the
one best approximating its center.
Florin Radulescu, Note de curs
13 DMDW-6
Features
14 DMDW-6
Road Map
15 DMDW-6
Classification
16 DMDW-6
Centroid-based
17 DMDW-6
Example
K-Means:
18 DMDW-6
Hierarchical clustering
19 DMDW-6
Example
20 DMDW-6
Distribution-based clustering
21 DMDW-6
Example
22 DMDW-6
Density-based clustering
23 DMDW-6
Example
24 DMDW-6
Hard vs. Soft clustering
Based on the number of clusters for each
point, clustering techniques may be classified
in:
1. Hard clustering. In that case each point
belongs to exactly one cluster.
2. Soft clustering. These techniques (called
also fuzzy clustering) compute for each data
point and each cluster a membership level
(the level or degree of membership of that
point to that cluster). FLAME algorithm is of
this type.
Florin Radulescu, Note de curs
25 DMDW-6
Hierarchical clustering
Hierarchical clustering algorithms can be
further classified in:
Agglomerative hierarchical clustering: starts with
a cluster for each point and merge the closest
clusters until a single cluster is obtained (bottom-
up).
Divisive hierarchical clustering: starts with a
cluster containing all points and split clusters in
two, based on density or other measure, until
single data point clusters are obtained (top-
down).
Florin Radulescu, Note de curs
26 DMDW-6
Dendrogram
In both cases a dendrogram is obtained.
The dendrogram is the tree resulting from the
merge or split action described above.
For obtaining some clusters, the dendrogram may
be cut at some level.
For the next example, cutting with the upper
horizontal line produces the clusters {(a), (bc),
(de), (f)}.
The second cut produces {(a), (bc), (def)}. Based
on clusters’ characteristics (see cluster evaluation
next week) the best cut may be determined.
Florin Radulescu, Note de curs
27 DMDW-6
Example
28 DMDW-6
Agglomerative hierarchical algorithm
29 DMDW-6
Method
START with a cluster for each point of D.
COMPUTE the distance between any two clusters
WHILE the number of clusters is greater than 1
DO
DETERMINE the nearest two clusters
MERGE clusters in a new cluster c
COMPUTE the distances from c to the other
clusters
ENDWHILE
Florin Radulescu, Note de curs
30 DMDW-6
Distance between clusters
For determining the distance between two
clusters several methods can be used:
1. Single link method: the distance between
two clusters is the minimum distance
between a point in the first cluster and a
point in the second cluster.
2. Complete link method: the distance
between two clusters is the maximum
distance between a point in the first cluster
and a point in the second cluster.
Florin Radulescu, Note de curs
31 DMDW-6
Distance between clusters
32 DMDW-6
Road Map
33 DMDW-6
Algorithm description
34 DMDW-6
Conditions
35 DMDW-6
Conditions
36 DMDW-6
K-means algorithm
Input:
A dataset D = {P1, P2, …, Pm} containing m
points in an n-dimensional Euclidian space
and a distance function.
k: the number of clusters to be obtained
Output:
The k clusters obtained
37 DMDW-6
Method
1. Choose randomly k points in D as initial centroids
2. REPEAT
3. FOR (i=1; i<=m; i++)
4. using the distance function, assign Pi to
5. the nearest centroid
5. END FOR
6. FOR (i=1; i<=k; i++)
7. Consider the set of r points assigned to centroid i:
{Pj1, …, Pjr}
8. New centroid is (Pj1, …, Pjr) / r
//(each point is considered a vector)
9. END FOR
10. UNTIL stopping criteria are met
Florin Radulescu, Note de curs
38 DMDW-6
Stopping criteria
Stopping criteria may be:
1. Cluster are not changing from an iteration to
another.
2. Cluster changes are below a given threshold
(e.g. no more than p points are changing the
cluster between two successive iterations).
3. Cluster centroids movement is below a given
threshold (e.g. the sum of distances between
old and new positions for centroids is no more
than d between two successive iterations).
Florin Radulescu, Note de curs
39 DMDW-6
Stopping criteria
40 DMDW-6
Weaknesses
41 DMDW-6
Weaknesses
42 DMDW-6
Example: initial centroids
a b
c d
43 DMDW-6
Weaknesses
44 DMDW-6
Weaknesses
45 DMDW-6
Strengths
46 DMDW-6
Road Map
47 DMDW-6
A distance function must be:
1. Non-negative: f(x, y) 0
2. Identity: f(x, x) = 0
3. Symmetry: f(x, y) = f(y, x)
4. Triangle inequality:
f(x, y) f(x, z) + f(z, y).
48 DMDW-6
Distance function
49 DMDW-6
Euclidean distance
Simple
50 DMDW-6
Euclidean distance
51 DMDW-6
Other distance functions
52 DMDW-6
Binary attributes
In some situations all attributes have only two
values: 0 or 1 (positive / negative, yes / no,
true / false, etc).
For these cases the distance function may be
defined based on the following confusion
matrix:
a = number of attributes having 1 for x and y
b = number of attributes having 1 for x and 0 for y
c = number of attributes having 0 for x and 1 for y
d = number of attributes having 0 for x and y
Florin Radulescu, Note de curs
53 DMDW-6
Confusion matrix
Data point y
1 0
0 c d c+d
54 DMDW-6
Symmetric binary
55 DMDW-6
Asymmetric binary
56 DMDW-6
Nominal attributes
57 DMDW-6
Nominal attributes
58 DMDW-6
Cosine distance
Consider two points, x = (x1, x2, …, xk) and y = (y1,
y2, …, yk), in a space with k dimensions.
In this case each point may be viewed as a vector
starting from the origin of axis and pointing to x or
y.
The angle between these two vectors may be
used for measuring the similarity: if the angle is 0
or near this value then the points are similar.
Because the distance is a measure of the
dissimilarity, the cosine of the angle – cos( ) - may
be used in the distance function as follows:
Dist(x, y) = 1 – cos( )
Florin Radulescu, Note de curs
59 DMDW-6
Example
Dimension 2
y
Dimension 1
60 DMDW-6
Cosine distance
61 DMDW-6
Cosine distance: Example
If a document is considered a bag of words, each
word of the considered vocabulary becomes a
dimension. On a dimension, a document has the
coordinate:
1 or 0 depending on the presence or absence of the
word from the document
or
A natural number, equal with the number of
occurrences of the word in the document.
Considering a document y containing two or more
copies of another document x, the angle between
x and y is zero so the cosine distance is also equal
to 0 (the documents are 100% similar).
Florin Radulescu, Note de curs
62 DMDW-6
No Euclidean space case
63 DMDW-6
Edit distance
64 DMDW-6
Edit distance example
Consider strings x and y:
x = 'Mary had a little lamb'
y = 'Baby: had a little goat'
Operations for transforming x in y:
2 deletions and 3 insertions to transform
'Mary' in 'Baby:'
3 deletions and 3 insertions to transform
'lamb' in 'goat'
So the distance is 2+3+3+3 = 11.
Florin Radulescu, Note de curs
65 DMDW-6
Edit distance formula
66 DMDW-6
Road Map
67 DMDW-6
Data standardization
68 DMDW-6
Interval-scaled
Min-max normalization:
vnew = (v – vmin) / (vmax – vmin)
For positive values the formula is:
vnew = v / vmax
z-score normalization ( is the standard
deviation):
vnew = (v – vmean) /
69 DMDW-6
Interval-scaled
Decimal scaling:
vnew = v / 10n
70 DMDW-6
Ratio-scaled
Log transform:
vnew = log(v)
71 DMDW-6
Nominal, ordinal
Nominal attributes:
Use feature construction tricks presented in the
last chapter:
If a nominal attributes has n values it is replaced
by n new attributes having a 1/0 value (the
attribute has/has not that particular value).
Ordinal attributes:
Values of an ordinal attribute are ordered, so it
can be treated as a numeric one, assigning some
numbers to its values.
Florin Radulescu, Note de curs
72 DMDW-6
Mixed attributes
73 DMDW-6
Convert to a common type
If some attribute type is predominant, all other
attributes are converted to that type
Then use a distance function attached to that
type.
Some conversions make no sense:
Converting a nominal attribute to an interval
scaled one is not obvious.
How can we convert values as {sunny, overcast,
rain} in numbers?
Sometimes we can assign a value (for example the
average temperature of a sunny, overcast or rainy
day) but this association is not always productive.
Florin Radulescu, Note de curs
74 DMDW-6
Combine different distances
A distance for each dimension is computed
using an appropriate distance function
Then these distances are combined in a
single one.
If:
d(x, y, i) = the distance between x and y on
dimension i
(x, y, i) = 0 or 1 depending on the fact that
the values of x and y on dimension i are
missing (even only one of them) or not.
Florin Radulescu, Note de curs
75 DMDW-6
Combine different distances
Then:
76 DMDW-6
Summary
This course presented:
A parallel between supervised vs.
unsupervised learning, the definition of
clustering and classifications of clustering
algorithms
The description of the k-means algorithm, one
of the most popular clustering algorithms
A discussion about distance functions
How to handle different types of attributes
Next week: Unsupervised learning – part 2
Florin Radulescu, Note de curs
77 DMDW-6
References
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data, Second Edition, Springer, chapter 3.
[Rajaraman, Ullman 10] Mining of Massive Datasets, Anand
Rajaraman, Jeffrey D. Ullman, 2010
[Ullman] Jeffrey Ullman, Data Mining Lecture Notes, 2003-2009,
web page: http://infolab.stanford.edu/~ullman/mining/mining.html
78 DMDW-6
Unsupervised Learning
- Part 2 -
Road Map
2 DMDW-7
k-medoids
The algorithms in this category are similar to
k-means.
The main differences from k-means are:
K-medoids uses a data point as center of a
cluster (such a point is called a medoid). This is
the cluster member best approximating the
cluster center.
Stopping criterion is based not on SSD but on
sum of pairwise dissimilarities (distances).
The best known algorithm of this type is
Partitioning Around Medoids (PAM)
Florin Radulescu, Note de curs
3 DMDW-7
PAM
Input:
A dataset D = {P1, P2, …, Pm} containing m
points in an n-dimensional space and a
distance function between points in that
space.
k: the number of clusters to be obtained
Output:
The k clusters obtained
Florin Radulescu, Note de curs
4 DMDW-7
PAM - Method
1. Randomly choose k points in D as initial medoids: {m1, m2, …, mk}
2. REPEAT
3. FOR (i=1; i<=m; i++)
4. Assign Pi to the nearest medoid
5. END FOR
6. FOR (i=1; i<=k; i++)
7. FOR (j=1; j<=m; j++)
8. IF Pj is not a medoid THEN
9. Configuration(i, j) = swap Pj with mi
10. Compute the cost of the new configuration
11. Reverse the swap
12. END IF
13 END FOR
14 END FOR
15. Select the configuration with the best cost (lowest)
16. UNTIL New configuration = Old configuration
Florin Radulescu, Note de curs
5 DMDW-7
Configuration cost
The main idea is that each medoid may be
swapped with any non-medoid point.
If the new configuration is the best swap, a new
medoid is appointed replacing an old one.
The process continues until no better
configuration is possible.
The cost of a configuration is the sum of the
distances between points and their medoids:
6 DMDW-7
k-modes
k-modes is designed to be used for points
having categorical (nominal or ordinal)
attributes.
The mode of a dataset is the most frequent
value.
This refers to a dataset containing atomic
values.
In clustering a point is characterized by a set
of attributes, in some cases of different types,
each attribute having a value from its domain.
Florin Radulescu, Note de curs
7 DMDW-7
k-modes
In that case we must redefine the mode for
applying the notion to a set of points.
The definition starts with the expression
returning the number of dissimilarities (like in the
previous course) between two points X and Y in
an n-dimensional space:
8 DMDW-7
The mode
If D = {P1, P2, …, Pm} is a set containing m points
with n attributes (categorical or not), the mode of D
may be defined as a vector (with the same number
of dimensions) Q = (q1, q2, …, qn) that minimizes:
9 DMDW-7
k-means vs. k-modes
10 DMDW-7
Frequency-based method
Let X be a set of categorical objects described by
categorical attributes A1 , A2 , …, Am
Let nc(k , j) be the number of objects having category
c(k,j) in attribute Aj and
Let fr(Aj = c(k,j) | X) = nc(k,j) / n the relative frequency of
category c(k,j) in X.
Then:
Theorem: The function D(Q,X) is minimised iff
fr(Aj = qj | X) >= fr(Aj = c(k,j) | X) for qj ≠ c(k,j) for all j =
1..m.
11 DMDW-7
Frequency-based method
12 DMDW-7
K-means++
a b
c d
13 DMDW-7
k-means++
K-means++ is not a new clustering algorithm but a
method to select initial centroids:
1. The first centroid is selected randomly from the data
points.
2. For each data point P, compute d = Dist(P, c), the
distance between P and the nearest centroid already
determined.
3. A new centroid is selected using a weighted
probability distribution: the point is chosen with a
probability proportional to d2.
4. Repeat steps 2 and 3 until k centroids are selected.
After initial centroid selection, usual k-means algorithm
may be run for clustering the dataset.
Florin Radulescu, Note de curs
14 DMDW-7
Road Map
15 DMDW-7
FastMap
There are cases when there is no Euclidean
space and only the distances between two
points are available (given as input or by a
distance function specific to the dataset).
In that case all the algorithms assuming the
existence of coordinates and of a Euclidean
space cannot be used.
This paragraph presents a solution for solving
the above problem: associate a Euclidian
space with few dimensions with such a
dataset.
Florin Radulescu, Note de curs
16 DMDW-7
FastMap
A B
C
Florin Radulescu, Note de curs
17 DMDW-7
FastMap
If N is big, computations are slow. So we
need to place N points into a space with k
dimensions, where k << N.
This process of creating a Euclidian space
knowing only the distances between any two
points is called multidimensional scaling
There are many algorithms for this, the most
known being FastMap, MetricMap, and
Landmark MDS (LMDS).
These algorithms approximate classical MDS
using a subset of the data and fitting the
remainder to the solution. Florin Radulescu, Note de curs
18 DMDW-7
FastMap
19 DMDW-7
x = (D2 (a, c) + D2 (a, b)-D2 (b, c))/ (2 D (a, b))
D(a, c) D(b, c)
x D(a, b)
a b
20 DMDW-7
D’ 2 = D 2 – (x – y)2
a b
x y
21 DMDW-7
Weaknesses
This process stops after computing the
desired number of coordinates for every point
or no more axes can be found.
Weakness:
For real data the problem is that if the
distance matrix is not a Euclidian one, the
value for D’2 may be negative!
In that case the only way to continue is to
assume D’ is 0.
But this assumption leads to propagated
errors.
Florin Radulescu, Note de curs
22 DMDW-7
Example
23 DMDW-7
Example
Average of | Dreal – Dcomputed | over all pairs of nodes
4,00
3,50
3,00
2,50
AVERAGE
2,00
1,50
1,00
0,50
0,00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
STEP
24 DMDW-7
Road Map
25 DMDW-7
Cluster evaluation
After performing the clustering process, the
result must be evaluated in order to validate it
(or not).
Because real clusters are not known for a
test dataset, this is a hard problem.
Some methods were developed for this
purpose.
These methods are designed not for
evaluating the clustering results on a
particular dataset but for evaluating the
quality of the clustering algorithm.
Florin Radulescu, Note de curs
26 DMDW-7
Methods
27 DMDW-7
1. User inspection
In that case some experts are inspecting the results of the clustering
algorithm and rate it.
User inspection may include:
Evaluate cluster centroids
Evaluate distribution of points in clusters
Evaluate clusters by their representation (sometimes clusters
may be represented as a decision tree for example).
Test some points to see if they really belong to the assigned
cluster. This can be made when clustering documents: after
clustering, some documents in each cluster are analyzed to see
if they are in the same category.
This method is hard to use for numerical data and huge volumes of
information because the user inspection is based on the experience
and intuition of the experts.
Also, this method is subjective and may lead sometimes to a wrong
verdict.
Florin Radulescu, Note de curs
28 DMDW-7
2. Ground truth
(comparatie cu situatia reala)
29 DMDW-7
Entropy
30 DMDW-7
Purity
31 DMDW-7
Ground truth
These measures are usually used when
comparing two clustering algorithms on the
same labeled dataset.
Other measures that can be used are
precision, recall and F-score. The
expressions for these measures were also
presented in the previous chapter.
The real problem is that an algorithm may
perform well on a dataset and not so well on
other dataset.
Florin Radulescu, Note de curs
32 DMDW-7
3. Cohesion and separation
Other measures that can be used to evaluate the
clustering algorithm are based on internal
information:
1. Intra-cluster cohesion measures the
compactness of the clusters.
Using the sum of squares of the distances (SSD)
from each point to its cluster center we obtain a
measure of this cohesion.
33 DMDW-7
Cohesion and separation
34 DMDW-7
4. Silhouette
For each point D in a cluster, a 'silhouette' value
can be computed, and this value is in the same
time:
a measure of the similarity of D with the points of
its cluster
a measure of the dissimilarity of D with the points
of other clusters.
Values are between -1 and 1. Positive values
denote that D is similar with the points in his
cluster and negative ones that D is not well
assigned (better to be assigned to other cluster).
Florin Radulescu, Note de curs
35 DMDW-7
Silhouette
Silhouette was introduced by Peter J.
Rousseeuw in a 1987article: "Silhouettes: a
Graphical Aid to the Interpretation and
Validation of Cluster Analysis", published in
Computational and Applied Mathematics.
Computing s(D) - the silhouette value of D -
implies:
Compute a(D) = the average distance from D to
all other points in his cluster.
Compute b(D) = the lowest average distance
from D to all points in other cluster.
Florin Radulescu, Note de curs
36 DMDW-7
Silhouette
Then:
s(D) = (b(D) - a(D)) / max(a(D), b(D))
Or:
1-a(D)/b(D) if a(D) < b(D)
s(D) = 0 if a(D) = b(D)
b(D)/a(D) -1 if a(D) > b(D)
So: -1 <= s(D) <= 1
Florin Radulescu, Note de curs
37 DMDW-7
Silhouette
The average value of s for the points of a
cluster is a measure of the cohesion of the
points in the cluster.
Also, the average value of s for all the points
of the dataset is a measure of the
performance of the clustering process.
For k-means, if k is too big or too small, some
of the clusters have narrower silhouettes than
the rest. Examining clusters' silhouettes we
can determine the best value for k.
Florin Radulescu, Note de curs
38 DMDW-7
5. Indirect evaluation
39 DMDW-7
Road Map
40 DMDW-7
Fuzzy clustering
Fuzzy logic was first proposed by Lotfi A. Zadeh of the
University of California at Berkeley in a 1965 paper.
41 DMDW-7
Fuzzy clustering
42 DMDW-7
The model
Input:
A dataset containing n elements (points), D =
{e1, e2, …, en}.
The number of clusters C
A level of cluster fuzziness, m
Output:
A list of centroids {c1, c2, …, cC}
A matrix U = [uij], i = 1…n, j = 1…C, and uij =
the level/degree of membership of element ei
to the cluster cj.
Florin Radulescu, Note de curs
43 DMDW-7
The model
The process is trying to minimize the objective
function:
where:
uij and ci are as described above.
dij is the distance from the element ei to the centroid cj
m is the fuzziness factor and in many cases the
default value is 2.
If m is close or equal to 1, uij is close to 0 or 1 so a
non-fuzzy solution is obtained (as in k-means).
When m is increased from 2 to bigger values, uij
have lower values and the clusters are fuzzier.
Florin Radulescu, Note de curs
44 DMDW-7
The algorithm
1. Choose randomly initial cluster centers
2. REPEAT
3. Compute all dij values
4. Compute new values for the membership levels uij:
45 DMDW-7
Stoping criteria
46 DMDW-7
Road Map
47 DMDW-7
Clusters and holes
48 DMDW-7
Clusters and holes
49 DMDW-7
Decision tree clusters
0 1 2 3 4 5 6 7 8 9 10 11 12
50 DMDW-7
Decision tree clusters
>=6.5 <6.5
BLACK
y
<=3 >3
BLUE RED
51 DMDW-7
Decision tree clusters
52 DMDW-7
E and N points
0 1 2 3 4 5 6 7 8 9 10 11 12
53 DMDW-7
Processing
A supervised learning algorithm can be used for
building a decision tree for separating the two
types of points: existing and non-existing.
The decision tree is built using the best cut for
each axis, and this best cut is based on the
information gain.
Because for computing the information gain only
the probability for each type of points in a given
region is needed, the non-existing points need not
to be physically added but because their uniform
spread the probability is proportional with the area
of that region.
Florin Radulescu, Note de curs
54 DMDW-7
Processing
For the existing points, the probability for each sub-
region is computed by counting, as usual.
The algorithm assumes that all regions are
rectangular and the number of N points in each
region is at least equal with the number of E points.
After each split of a rectangle, if the inherited
number of N points is less than the number of E
points, their number is increased to the number of E
points.
The result is a decision tree splitting the space in
rectangles, some of them being clusters and the
others holes.
Florin Radulescu, Note de curs
55 DMDW-7
Result
0 1 2 3 4 5 6 7 8 9 10 11 12
56 DMDW-7
Maximal hyper rectangles
57 DMDW-7
FR and MHR
Such a rectangle is called a filled region
(FR).
A maximal hyper rectangle is defined as
follows:
Definition: Given a k-dimensional continuous
space S and n FRs in S, a maximal hyper-
rectangle (MHR) in S is an empty HR that
does not intersect (in a normal sense) with
any FR, and has at least one FR lying on
each of its 2k bounding surfaces. These FRs
are called the bounding FRs of the MHR.
Florin Radulescu, Note de curs
58 DMDW-7
Algorithm
1. Let S be a k-dimensional continuous space and a
set of n FRs (not always disjoint) in S,
2. Start with one MHR, occupying the entire space
S.
3. Each FR is incrementally added to S. For each
insertion, the set of MHRs is updated:
All the existing MHRs that intersect with this FR must
be removed from the set.
For each dimension two new hyper-rectangle bounds
(lower and upper) are identified. If the new hyper-
rectangles verify the MHR definition and are
sufficiently large, insert them into the MHRs list.
Florin Radulescu, Note de curs
59 DMDW-7
Example
H1
H2
60 DMDW-7
Summary
This course presented:
K-Medoids, k-modes and k-means++ where k-medoids and k-
modes are clustering algorithms and k-means++ is a method for
determining a better than random set of initial cluster centers for
k-means.
FastMap: a multidimensional scaling algorithm to build a
Euclidean space given the distances between any two points
Cluster evaluation techniques. A method not included in this
course but still important is silhouette (see [Rousseeuw 87])
Clusters and holes: how to determine regions with no or few data
points
Fuzzy clustering and fuzzy C-means for performing soft
clustering.
Next week: Semi-supervised learning
61 DMDW-7
References
[Liu et al. 98] Bing Liu, Ke Wang, Lai-Fun Mun and Xin-Zhi Qi, "Using
Decision Tree Induction for Discovering Holes in Data," Pacific Rim
International Conference on Artificial Intelligence (PRICAI-98), 1998
[Liu et al. 00] Bing Liu, Yiyuan Xia, Phlip S. Yu. "Clustering through decision
tree construction." Proceedings of 2000 ACM CIKM International
Conference on Information and Knowledge Management (ACM CIKM-
2000), Washington, DC, USA, November 6-11, 2000
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks, Contents,
and Usage Data, Second Edition, Springer, chapter 4.
[Huang 97] Huang, Z: A fast clustering algorithm to cluster very large categorical
data sets in data mining. In: SIGMOD Workshop on Research Issues on Data
Mining and Knowledge Discovery, pp. 1-8, 1997
[Huang 98] Huang, Z: Extensions to the k-Means Algorithm for Clustering
Large Data Sets with Categorical Values, DMKD 2, 1998,
http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf
[Torgerson 52] Torgerson, W.S. (1952). Multidimensional Scaling: Theory
and Method, Psychometrika, vol 17, pp. 401-419.
[Faloustsos, Lin 95] Faloutsos, C., Lin K.I. (1995). FastMap: A Fast
Algorithm for Indexing, Data-Mining and Visualization of Traditional and
Multimedia Datasets. In: Proceedings of the 1995 ACM SIGMOD
International Conference on Management of Data.
Florin Radulescu, Note de curs
62 DMDW-7
References
[Wand et al, 99] Wang, J.T-L., Wang, X., Lin, K-I., Shasha, D., Shapiro,
B.A., Zhang, K. (1999). Evaluating a class of distance-mapping
algorithms for data mining and clustering, In: Proc of ACM KDD, pp.
307-311.
[de Silva, Tenenbaum 04]de Silva, V., Tenenbaum J.B. (2004). Sparse
multi-dimensional scaling using landmark points,
[Yang et al 06] Yang, T., Liu, J., McMillan, L., Wang, W., (2006). A Fast
Approximation to Multidimensional Scaling, In: Proceedings of the
ECCV Workshop on Computation Intensive Methods for Computer
Vision (CIMCV).
[Platt 05] Platt, J.C., (2005). FastMap, MetricMap, and Landmark MDS
are all Nyström Algorithms, In: 10th International Workshop on Artificial
Intelligence and Statistics, pp. 261-268.
[Bezdek 81] Bezdek, James C. (1981). Pattern Recognition with Fuzzy
Objective Function Algorithms. Kluwer Academic Publishers Norwell,
MA, USA, ISBN 0-306-40671-3
[Rousseeuw 87] Peter J. Rousseeuw (1987). "Silhouettes: a Graphical
Aid to the Interpretation and Validation of Cluster
Analysis". Computational and Applied Mathematics 20: 53–65
63 DMDW-7
Partially Supervised Learning
Road Map
2 DMDW-8
Partially supervised learning
In supervised learning the goal is to build a
classifier starting from a set of labeled examples.
Unsupervised learning starts with a set of
unlabeled examples trying to discover the inner
structure of this set, like in clustering.
Partially supervised learning (or semi-supervised
learning) learning includes a series of algorithms
and techniques using a (small) set of labeled
examples and a (possible large) set of unlabeled
examples for performing classification or
regression.
Florin Radulescu, Note de curs
3 DMDW-8
Partially supervised learning
The need for such algorithms and techniques comes
from the cost of obtaining labeled examples.
This is made in many cases manually by experts and
the volume of these examples is sometimes small.
When learning is made starting from a finite number of
training examples in a high-dimensional space and for
each dimension the number of possible values is
large, the amount of needed training data required to
ensure that there are several samples with each
combination of values is huge.
4 DMDW-8
Hughes effect
For a given number of training samples the
predictive power decreases as the dimensionality
increases.
This phenomenon is called Hughes effect or
Hughes phenomenon, after Gordon F. Hughes
He published in 1968 the paper "On the mean
accuracy of statistical pattern recognizers".
Adding extra information to a small number of
labeled training examples will increase the
accuracy (by delaying the occurrence of the effect
described).
Florin Radulescu, Note de curs
5 DMDW-8
Effect of unlabeled examples
6 DMDW-8
Effect of unlabeled examples
7 DMDW-8
Effect of unlabeled examples
8 DMDW-8
Effect of unlabeled examples
9 DMDW-8
Effect of unlabeled examples
10 DMDW-8
Positive and unlabeled examples
11 DMDW-8
Positive and unlabeled examples
12 DMDW-8
Positive and unlabeled examples
13 DMDW-8
Positive and unlabeled examples
14 DMDW-8
Distribution of unlabeled examples
15 DMDW-8
MCAR
16 DMDW-8
MAR
17 DMDW-8
MNAR
18 DMDW-8
Road Map
19 DMDW-8
Learning from labeled and unlabeled data
20 DMDW-8
Learning from labeled and unlabeled data
21 DMDW-8
Co-training (Blum and Mitchel)
22 DMDW-8
Co-training (Blum and Mitchel)
23 DMDW-8
Co-training algorithm (v1)
1. Initially LA = LB = L, UA = UB = U
2. Build two classifiers, A from LA and X1 and B from
LB and X2
3. Allow A to label the set UA, obtaining L1
4. Allow B to label the set UB, obtaining L2
5. Based on confidence, select C1 from L1 and C2 from
L2 (subsets containing a number of most confident
examples for each class)
6. Add C1 to LB and subtract it from UB
7. Add C2 to LA and subtract it from UA
8. Go to step 2 until stopping criteria are met
Florin Radulescu, Note de curs
24 DMDW-8
Co-training (Blum and Mitchel)
The process ends when there are no more
unlabeled examples or C1 and C2 are empty
In that case there are some unlabeled examples
but the confidence of their classifications –
probability of the assigned class for example - is
below a given threshold.
In the end, the final classifier is obtained by
combining A and B (the final two classifiers
obtained at step 2).
The experiments described in the original article
are made using a slightly different form of the
algorithm, presented on the next slide.
Florin Radulescu, Note de curs
25 DMDW-8
Co-training algorithm (v0)
1. Given:
• A set L of labeled examples
• A set U of unlabeled examples
2. Create a pool U’ of examples by choosing u examples at
random from U.
3. Loop for k iterations:
3.1. Use L to train a classifier h1 that considers only the x1 portion of
x
3.2. Use L to train a classifier h2 that considers only the x2 portion of
x
3.3. Allow h1 to label p positive and n negative examples from U’
3.4. Allow h2 to label p positive and n negative examples from U’
3.5. Add these self-labeled examples to L
3.6. Randomly choose 2p + 2n examples from U to replenish U’
Florin Radulescu, Note de curs
26 DMDW-8
Co-training algorithm (v0)
27 DMDW-8
Co-training results
28 DMDW-8
Co-training (Goldman and Zhou)
29 DMDW-8
Algorithm
1. Repeat until LA and LB do not change during iteration. For each algorithm
do
2. Train algorithm A on L LA to obtain the hypothesis HA (a hypothesis
defines a partition of the instance space). Similar for B
3. Each algorithm considers each of its equivalence classes and decides
which one to use to label data from U for the other algorithm, using two
tests. For A the tests are (similar for B):
o The class k used by A to label data for B has accuracy at least as good
as the accuracy of B.
o The conservative estimate of the class k is bigger than the conservative
estimate of B.
(The conservative estimate is an estimation for 1/ 2 where is the
hypothesis error. This prevents the degradation of B performances due
to the noise.)
4. All examples in U passing these tests are placed in LB (similar for B).
5. End Repeat
6. At the end, combine HA and HB
Florin Radulescu, Note de curs
30 DMDW-8
ASSEMBLE
ASSEMBLE is an ensemble algorithm
presented in [Bennet at all, 2002]
It won the NIPS* 2001 Unlabeled data
competition.
It alternates between assigning “pseudo-
classes” to the instances from the unlabeled
data set and constructing the next base classifier
using the labeled examples but also the
unlabeled examples
For these examples the previous assigned
pseudo-class is considered.
*NIPS = Neural Information Processing Systems Conference
31 DMDW-8
ASSEMBLE - advantages
Any weight-sensitive classification algorithm can be
boosted using labeled and unlabeled data.
ASSEMBLE can exploit unlabeled data to reduced
the number of classifiers needed in the ensemble
therefore speeding up learning.
ASSEMBLE works well in practice.
Computational results show the approach is
effective on a number of test problems, producing
more accurate ensembles than AdaBoost using the
same number of base learners.
32 DMDW-8
Re-weighting
Re-weighting is a technique for reject-inferencing in
credit scoring presented in [Crook, Banasik, 2002].
The main idea is to extrapolate information on the
examples from approved credit applications to the
unlabeled data.
The re-weighting may be used if data is of the MAR
type, so the provided population model for all
applicants is the same as that for accepts only:
P(y=1| x, labeled=1) = P(y=1| x, labeled=0) = P(y=1| x),
So: for a given x the distribution of the examples
having a certain label is the same in the labeled and
unlabeled set.
Florin Radulescu, Note de curs
33 DMDW-8
Re-weighting
All credit institutions have an archive of approved
applications and for each of these applications
there is also a Good/Bad performance label.
Based on the classification variables used to
accept/reject an application, applications (past-
labeled but also those unlabeled) can be scored
and partitioned in score groups.
For every score group the distribution of classes in
the labeled examples is then extrapolate to the
unlabeled examples of the same score group,
picking at random examples from here.
Florin Radulescu, Note de curs
34 DMDW-8
Re-weighting example
0.0-0.2 10 10 6 4 2 12 8
0.2-0.4 10 20 10 10 1.5 15 15
0.4-0.6 20 60 20 40 1.33 27 53
35 DMDW-8
Re-weighting example
The group weight is computed as (XL + XU) / XL.
For every score group the weight is used to compute
the number of examples of class0 and class1 from
the whole score group examples (labeled and
unlabeled).
Example: for score group 0.8-1.0, the weight is 1.1
so re-weighting class0 and class1 we obtain 10*1.1
= 11 for class0 and 190*1.1 = 209 for class1.
It means that we pick at random 11-10=1 example
from the unlabeled set and label it as class0 and
209-190 = 19 examples (the rest of them, 20-1) and
label them as class1.
Florin Radulescu, Note de curs
36 DMDW-8
Re-weighting example
This procedure is run for every score group.
At the end, all unlabeled examples have a
class0/class1 label.
Note that class0/class1 is not the same as
rejected/accepted; the initial set of labeled
examples contains only accepted applications!
Using the whole set of examples (L+U) now
having all a class0/class1 label, we can learn a
new classifier that incorporate not only the data
from labeled examples but also information from
unlabeled ones.
Florin Radulescu, Note de curs
37 DMDW-8
Expectation-Maximization
Expectation-maximization is an iterative method
for finding maximum likelihood or maximum a
posteriori (MAP) estimates of parameters in
statistical models, where the model depends on
unobserved latent variables (see Wikipedia).
It consists in an iterative process having two steps:
1. Expectation step: using current estimates of the
parameters compute, guess a probability
distribution over completition of missing data
2. Maximization step: compute new estimates of the
parameters using these completitions
Florin Radulescu, Note de curs
38 DMDW-8
Expectation-Maximization
In [Liu 11] and [Nigam et al, 98] the process is
described as follows:
Initial: Train a classifier using only the set of
labeled documents.
Loop:
Use this classifier to label (probabilistically) the
unlabeled documents (E step)
Use all the documents to train a new classifier (M
step)
Until convergence.
Florin Radulescu, Note de curs
39 DMDW-8
Expectation-Maximization
For Naïve Bayes, the expectation step means computing for every
class cj and every unlabeled document di the probability Pr(cj | di; ).
Notations are:
ci – class ci
D – the set of documents
di – a document di in D
V – words vocabulary (set of significant words)
wdi, k – the word in position k in document di
Nti - the number of times that word wt occurs in document di
- the set of parameters of all components, = { 1, 2, …,
K, 1, 2, …, K}. j is the mixture weight (or mixture
probability) of the mixture component j; j is the parameters of
component j. K is the number of mixture components.
40 DMDW-8
Expectation step
41 DMDW-8
Maximization step
Pr (wt | cj ; ) =
Pr(cj | ) =
42 DMDW-8
Expectation-Maximization
EM algorithm works well if the two mixture model
assumptions for a particular data set are true:
o The data (or the text documents) are generated
by a mixture model,
o There is one-to-one correspondence between
mixture components and document classes.
In many real-life situations these two assumptions
are not met.
For example, the class Sports may contain
documents about different sub-classes such as
Football, Tennis, and Handball.
Florin Radulescu, Note de curs
43 DMDW-8
Road Map
44 DMDW-8
Positive and unlabeled data
Sometimes all labeled examples are only from the
positive class. Examples (see [Liu 11]):
Given a collection of papers on semi-supervised learning,
find all semi-supervised learning papers in proceeding or
another collection of documents.
Given the browser bookmarks of a person, find other
documents that may be interesting for that person.
Given the list of customers of a direct marketing company,
identify other persons (from a person database) that may
be also interested in those products.
Given the approved and good (as performance)
applications from a credit company, identify other persons
that may be interested in getting a credit.
45 DMDW-8
Theoretical foundation
Suppose we have a classification function f and an input
vector X labeled with class Y, where Y {1, -1}. We rewrite
the probability of error:
Pr[f(X) Y] = Pr[f(X) = 1 and Y = -1] + Pr[f(X) = -1 and Y = 1] (1)
Because:
Pr[f(X) = 1 and Y = -1] = Pr[f(X) = 1] – Pr[f(X) = 1 and Y = 1] =
Pr[f(X) = 1] – (Pr[Y = 1] – Pr[f(X) = -1 and Y = 1]).
Replacing in (1) we obtain:
Pr[f(X) Y] = Pr[f(X) = 1] – Pr[Y = 1] +
2Pr[f(X) = -1|Y = 1]Pr[Y = 1] (2)
Pr[Y = 1] is constant.
If Pr[f(X) = -1|Y = 1] is small minimizing error is approximately
the same as minimizing Pr[f(X) = 1].
Florin Radulescu, Note de curs
46 DMDW-8
Theoretical foundation
If the sets of positive examples P and unlabeled
examples U are large, holding Pr[f(X) = -1|Y = 1]
small while minimizing Pr[f(X) = 1] is
approximately the same as:
o minimizing PrU[f(X) = 1]
o while holding PrP[f(X) = 1] ≥ r (where r is recall
Pr[f(X)=1| Y=1]) which is the same as (PrP[f(X) = -1] ≤
1 – r)
In other words:
o The algorithm tries to minimize the number of
unlabeled examples labeled as positive
o Subject to the constraint that the fraction of errors
on the positive examples is no more than 1-r.
Florin Radulescu, Note de curs
47 DMDW-8
2-step strategy
For implementing the theory above there is a 2-step strategy
(presented in [Liu 11]):
Step 1: Identify in the unlabeled examples a subset called
“reliable negatives” (RN).
These examples will be used as negative labeled examples in
the next step.
We start with only positive examples but must build a negative
labeled set in order to use a supervised learning algorithm for
building the model (classifier)
Step 2: Build a sequence of classifiers by iteratively applying
a classification algorithm and then selecting a good classifier.
In this step we can use Expectation Maximization or SVM for
example.
48 DMDW-8
Obtaining reliable negatives (RN)
49 DMDW-8
Spy technique
In this technique, first randomly select a set S of
positive documents from P and puts them in U.
These examples are the spies.
They behave identically to the unknown positive
documents in P.
Then using I-EM algorithm with (P-S) as positive
and U S as negative, a classifier is obtained
It uses the probabilities assigned to the documents
in S to decide a probability threshold th to identify
possible negative documents in U:
all documents with a probability less than any spy
will be assigned to RN
Florin Radulescu, Note de curs
50 DMDW-8
Spy algorithm
1. RN = {};
2. S = Sample(P, s%);
3. US = U S;
4. PS= P-S;
5. Assign each document in PS the class label 1;
6. Assign each document in US the class label -1;
7. I-EM(US, PS); // This produces a Naïve Bayes classifier.
8. Classify each document in Us using the NB classifier;
9. Determine a probability threshold th using S;
10. For Each document d Us
11. If its probability Pr(1|d) < th
12. Then RN = RN {d};
13. End If
14. End For
Florin Radulescu, Note de curs
51 DMDW-8
1-DNF algorithm
The algorithm builds a so-called positive
feature set (PF) containing words that occur
in the positive examples set of documents P
more frequently than in the unlabeled
examples set U.
Then using PF it tries to identify (for filtering
out) possible positive documents from U.
A document in U that does not have any
positive feature in PF is regarded as a strong
negative document.
Florin Radulescu, Note de curs
52 DMDW-8
Algorithm
1. PF = {}
2. For i = 1 to n
3. If (freq(wi, P)/|P|> freq(wi, U)/|U|)
4. Then PF = PF {wi}
5. End if
6. End for
7. RN = U;
8. For each document d U
9. If ( wi, freq (wi , d ) > 0) and (wi PF)
10. Then RN = RN - {d}
11. End if
12. End for
53 DMDW-8
Naïve Bayes
In this case, a classifier is built considering all
unlabeled examples as negative. Then the
classifier is used to classify U and the negative
labeled examples form the reliable negative set.
The algorithm is :
1. Assign label 1 to each document in P;
2. Assign label –1 to each document in U;
3. Build a NB classifier using P and U;
4. Use the classifier to classify U. Those documents in
U that are classified as negative form the reliable
negative set RN.
Florin Radulescu, Note de curs
54 DMDW-8
Rocchio
The algorithm of building RN is the same as
for Naïve Bayes with the difference that at
step 3 a Rocchio classifier is built instead of a
Naïve Bayes one.
Rocchio builds a prototype vector for each
class (a vector describing all documents in
the class) and then using the cosine similarity
finds the class for test examples: the class of
the prototype most similar with the given
example.
Florin Radulescu, Note de curs
55 DMDW-8
Summary
56 DMDW-8
References
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks, Contents, and Usage
Data, Second Edition, Springer.
[Chawla, Karakoulas 2005] Nitesh V. Chawla, Grigoris Karakoulas, Learning From
Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains,
Journal of Artificial Intelligence Research, volume 23, 2005, pages 331-366.
[Nigam et al, 98] Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom Mitchell,
Using EM to Classify Text from Labeled and Unlabeled Documents, Technical Report
CMU-CS-98-120. Carnegie Mellon University. 1998
[Blum, Mitchell, 98] Blum, A., Mitcell, T. Combining labeled and unlabeled data with co-
training, Procs. Of Workshop on Computational Learning Theory, 1998.
[Goldman, Zhou, 2000] Sally Goldman, Yan Zhou, Enhancing Supervised Learning with
Unlabeled Data, Proceedings of the Seventeenth International Conference on Machine
Learning (ICML), 2000, pages 327 – 334
[Bennet at al, 2002] Bennet, K., Demiriz, A., Maclin, R., Exploiting unlabeled data in
ensemble methods, Procs. Of the 6th Intl. Conf. on Knowledge Discovery and
Databases, 2002, pages 289-296.
[Crook, Banasik, 2002] Sample selection bias in credit scoring models, Intl. Conf.on
Credit Risk Modeling and Decisioning, 2002.
[Zhang, Zuo 2009] Bangzuo Zhang, Wanli Zuo, Reliable Negative Extracting Based on
kNN for Learning from Positive and Unlabeled Examples, journal of Computers, vol. 4,
no. 1, 2009 Florin Radulescu, Note de curs
2 DMDW-9
Objectives
Weblog mining methods, techniques and
algorithms are intended to discover patterns
in clickstreams recorded by the web servers
and also profiles of the users interacting with
them.
The input data are:
1. Web server logs, and particularly the access
logs. A web server maintains other logs also (for
example error logs) that are not discussed in
this lesson
Florin Radulescu, Note de curs
3 DMDW-9
Objectives
The input data are – cont.:
2. Site structure. The link structure of the site is
used to perform path completition. This means
that pages seen in browser window but not
requested from the web server due to caching
(proxy or local) are determined using this
structure
3. Site content. The content for each page can be
used to attach different event labels (product
view, buy/bid, etc) to different pages for better
understanding of surfer behavior.
Florin Radulescu, Note de curs
4 DMDW-9
Objectives
5 DMDW-9
Tasks in web mining
There are four types of tasks in web mining (see
[Kosala, Blockeel, 2000]):
1. Resource finding: the task of retrieving intended
Web documents.
2. Information selection and pre-processing:
automatically selecting and pre-processing specific
information from retrieved Web resources.
3. Generalization: automatically discovers general
patterns at individual Web sites as well as across
multiple sites.
4. Analysis: validation and/or interpretation of the
mined patterns.
Florin Radulescu, Note de curs
6 DMDW-9
Categories of tasks
7 DMDW-9
Web content mining
Web content mining is dedicated to the
extraction and integration of data, information
and knowledge from Web page contents, no
matter the structure of the website.
The hyperlinks contained in each page or the
hyperlinks pointing to them are not relevant in
that case, only the information content.
In [Cooley et al, 97] web content mining is
also split in two approaches:
the agent-based approach and
the database approach.
Florin Radulescu, Note de curs
8 DMDW-9
Agent based approach
The objective is to build intelligent tools for information
retrieval:
Intelligent Search Agents. In this case, intelligent Web
agents are developed. These agents search for relevant
information using domain characteristics and user profiles,
then organize and interpret the discovered information.
Information Filtering/Categorization. In this case, the
agents use information retrieval techniques and
characteristics of open hypertext Web documents to
automatically retrieve, filter, and categorize them.
Personalized Web Agents. In the third case, the agents
learn about user preferences and discover Web
information based on them (also preferences of similar
users may used).
9 DMDW-9
Database approach
The objectives involve improvements of the
management for semi-structured data
available on the Web.
Multilevel Databases. At the lowest level of the
database are semi-structured information stored
in Web repositories (hypertext documents), and
at the higher levels meta data or generalizations
are extracted and organized using relational or
object-oriented model
Web Query Systems. In this case, specialized
query languages are used for querying the Web.
Examples are W3QL, WebLog, Lorel, UnQL, etc.
Florin Radulescu, Note de curs
10 DMDW-9
Web structure mining
Web structure mining uses graph theory to
analyze the node and connection structure of a
web site (see also [Wikipedia]). The new research
area emerged in the domain is called Link Mining.
The following summarization of link mining is from
[da Costa, Gong 2005]:
1. Link-based Classification. In this case the task is
to focus on the prediction of the category of a web
page, based on words that occur on the page, links
between pages, anchor text, html tags and other
possible attributes found on the web page.
11 DMDW-9
Web structure mining
Summarization of link mining – cont.:
2. Link-based Cluster Analysis. Cluster analysis finds
naturally occurring sub-classes. In that case the data
is clustered with similar objects in the same cluster,
and dissimilar objects in different clusters. Link-
based cluster analysis is unsupervised so it can be
used to discover hidden patterns in data.
3. Link Type. The goal is to predict the existence of
links, the type of link, or the purpose of a link.
4. Link Strength. In this approach links are weighted
(importance, etc).
5. Link Cardinality. The goal is to compute a
prediction for the number of links between objects.
Florin Radulescu, Note de curs
12 DMDW-9
Applications
13 DMDW-9
Applications
Authorities are pages containing information
about a topic, and hubs are pages not
containing actual information, but links to
pages containing topic information.
The measure of being hub or authority are
computed recursively: the authority measure
is the sum of hub measures for the hubs
pointing at it and the hub measure is the sum
of the authority measures for the pages
referred by that page.
Florin Radulescu, Note de curs
14 DMDW-9
Web usage mining
Web usage mining tries to predict user behavior when interacting
with the Web. This is the main topic to be discussed in detail in this
lesson.
Data involved in web usage mining may be classified in four
categories:
1. Usage data. Here we have server, client and proxy logs. There
are several problems encountered here in identifying users and
sessions based on their IP address (see [Srivastava et al., 2000]):
o Single IP address / Multiple Server Sessions: because
several users access the web server via an ISP provider
and the provider allow the access using some proxies,
many users have the same IP address in the web server
access log in the same period.
15 DMDW-9
Web usage mining
Problems encountered - cont.:
o Multiple IP address / Single Server Session: also
because of the ISP policy, accesses of the same user
session can be assigned to different proxies, so having
different IP addresses in the web server access log.
o Multiple IP address / Single User: the same user
accessing the web from different computers will be
recorded with different IP addresses for different
sessions.
o Multiple agent / Single User: The same user may use
several browsers, even on the same computer, so will
be recorded in the log files with different user agents.
16 DMDW-9
Web usage mining
2. Content data. The website contains
documents in HTML or other format or
dynamic pages generated from scripts and
related databases.
The content of a page can be used for
associating events or other semantic that
can be used in the process of web usage
mining.
Webpages contains also meta data as
descriptive keywords, document attributes,
semantic tags, etc.
Florin Radulescu, Note de curs
17 DMDW-9
Web usage mining
18 DMDW-9
Web usage mining
19 DMDW-9
Road Map
20 DMDW-9
Web log formats
Example:
127.0.0.1 - frank [10/Oct/2015:13:55:36 -0700] "GET
/apache_pb.gif HTTP/1.0" 200 2326
21 DMDW-9
What is everything
Elements from the previous definition:
Field Description
Remote host address The IP address of the client that made the request.
Remote log name Usualy not used. It was provided for the case of a client machine
running ident protocol server (identd) - see RFC 1413.
User name The name of the authenticated user that accessed the server.
Anonymous users are indicated by a hyphen. The best practice is for
the application always to provide the user name.
Date, time, and The local date and time at which the activity occurred. The offset from
Greenwich mean time Greenwich mean time is also indicated.
(GMT) offset
Request and Protocol The HTTP protocol version that the client used.
version
Service status code The HTTP status code. (A value of 200 indicates that the request
completed successfully.)
22 DMDW-9
What is everything
For the prefious example:
23 DMDW-9
Combined Log File Format
24 DMDW-9
Example
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET
/apache_pb.gif HTTP/1.0" 200 2326
"http://www.example.com/start.html" "Mozilla/4.08 [en]
(Win98; I ;Nav)"
The first seven fields are the same.
The last two fields indicated the referrer as start.html
from www.example.com and the user agent as
Netscape.
Mozilla was originally the codename for the defunct
Netscape Navigator software project, along with
Netscape's mascot, a cartoon reptile inspired by
Godzilla, - see [Wikipedia]. Now: Firefox browser.
There is also an Extended Log File Format.
Florin Radulescu, Note de curs
25 DMDW-9
Road Map
26 DMDW-9
Statistic approaches
For obtaining statistics about a website there
are two possibilities:
1. Local statistics. There are several
packages that analyze the log file of the
webserver and present detailed statistics
about the accesses recorded in them. Some
examples are: Analog, W3Perl, AWStats,
Webalizer, etc.
2. External statistics. In this case behavioral
information cannot be obtained, only
statistics about visitors.
Florin Radulescu, Note de curs
27 DMDW-9
Examples
28 DMDW-9
Examples – cont.
29 DMDW-9
Examples – cont.
30 DMDW-9
Road Map
31 DMDW-9
Data mining approaches
32 DMDW-9
The web usage mining process
Structure:
Site files
(site content)
33 DMDW-9
Data preprocessing
34 DMDW-9
Data cleaning
35 DMDW-9
Pageview identification
36 DMDW-9
Pageview identification
37 DMDW-9
Example
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/ HTTP/1.1" 200 765 "-" "Mozilla/5.0
(Linux; Android 7.0; SM-G930F Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/62.0.3202.84 Mobile Safari/537.36"
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/mit.css HTTP/1.1" 200 855
"http://info.cs.pub.ro/scoaladevara/" "Mozilla/5.0 (Linux; Android 7.0; SM-G930F Build/NRD90M)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36"
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/stanga.html HTTP/1.1" 200 810
"http://info.cs.pub.ro/scoaladevara/" "Mozilla/5.0 (Linux; Android 7.0; SM-G930F Build/NRD90M)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36"
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/sus.html HTTP/1.1" 200 597
"http://info.cs.pub.ro/scoaladevara/" "Mozilla/5.0 (Linux; Android 7.0; SM-G930F Build/NRD90M)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36"
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/intre.html HTTP/1.1" 200 428
"http://info.cs.pub.ro/scoaladevara/" "Mozilla/5.0 (Linux; Android 7.0; SM-G930F Build/NRD90M)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36"
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/program.html HTTP/1.1" 200 1357
"http://info.cs.pub.ro/scoaladevara/stanga.html" "Mozilla/5.0 (Linux; Android 7.0; SM-G930F
Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36"
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/logo.png HTTP/1.1" 200 16696
"http://info.cs.pub.ro/scoaladevara/sus.html" "Mozilla/5.0 (Linux; Android 7.0; SM-G930F Build/NRD90M)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36"
• etc
38 DMDW-9
User identification
39 DMDW-9
User identification
b) Cookies: In the absence of authentication
facilities, client side cookies may be used.
o A cookie is a unique piece of information
(like a passport) issued by the web server
and sent to the browser and subsequently
used by the browser to access pages on that
web server.
o In that way each cookie identifies a user
session but the cookie can live beyond the
session and be recognized in subsequent
sessions of the same user.
Florin Radulescu, Note de curs
40 DMDW-9
User identification
41 DMDW-9
Example
42 DMDW-9
Web server log
43 DMDW-9
User 1
User1:
Tim e Client IP Req. URL Ref. URL User Agent
12:55 1.2.3.4 A - Chrome20;Win7
12:59 1.2.3.4 B A Chrome20;Win7
13:04 1.2.3.4 D B Chrome20;Win7
13:13 1.2.3.4 E B Chrome20;Win7
13:17 1.2.3.4 C A Chrome20;Win7
13:18 1.2.3.4 A - Chrome20;Win7
13:21 1.2.3.4 C A Chrome20;Win7
13:24 1.2.3.4 G C Chrome20;Win7
13:26 1.2.3.4 B A Chrome20;Win7
13:31 1.2.3.4 E B Chrome20;Win7
44 DMDW-9
User2 and User3
User2 Tim e
13:14
Client IP
1.2.3.4
Req. URL
B
Ref. URL
-
User Agent
FireFox9;Win7
13:16 1.2.3.4 D B FireFox9;Win7
13:19 1.2.3.4 E B FireFox9;Win7
13:22 1.2.3.4 A B FireFox9;Win7
13:25 1.2.3.4 C A FireFox9;Win7
13:28 1.2.3.4 G C FireFox9;Win7
User3
Tim e Client IP Req. URL Ref. URL User Agent
13:10 2.3.4.5 C - IE9;WinXP;SP1
13:15 2.3.4.5 F C IE9;WinXP;SP1
13:20 2.3.4.5 A C IE9;WinXP;SP1
13:23 2.3.4.5 B A IE9;WinXP;SP1
45 DMDW-9
Sessionization
Session identification (sessionization): The
web activity of a user is segmented in sessions.
As a general idea, a user session begins when he
opens the browser window and ends when that
window is closed.
A user session contains visits on several websites,
on each website being recorded as a session in
the web server log.
From the point of view of a single web server, only
pageviews from that server are known and
represents the user session.
Florin Radulescu, Note de curs
46 DMDW-9
Sessionization
There are several methods to identify user
sessions:
a) Authentication and cookies, discussed earlier.
b) Embedded session IDs: at the beginning of a new
session the server generates a unique session ID.
Web pages are dynamically generated and the ID
is contained in every link, so subsequent hits are
recognized.
c) Software agents: programs loaded into the
browsers that send back usage data.
d) Heuristics: when the above methods are not
available, several heuristics may be used to split the
activity of a user into sessions.
Florin Radulescu, Note de curs
47 DMDW-9
Euristics
Some known heuristics for sessionization are:
1. Duration of a session is limited at a given amount
of time (for example 20 minutes)
2. Session ends when the time of stay on a webpage
is above a given amount of time (for example, if
between two successive hits there is more than 20
minutes, a new session begins there)
3. Pageviews in a session are linked. If a pageview
is not accessible from an open session, it starts a
new session. Note that the same user may have
several open sessions in the same time (several
different browser windows pointing on the same
web server).
Florin Radulescu, Note de curs
48 DMDW-9
Example
49 DMDW-9
Episode identification
50 DMDW-9
Path completition
Path completition: Because of the cache
that browsers and proxies implement, some
pageviews are not requested from the web
servers but are directly served by the proxy
or the browser cache is used to display it.
In that case the web server log do not contain
entries for that pageview.
The obvious example for that situation is
pressing the “Back” button of the browser. In
that case the cached version of the previous
page is displayed in most of the cases.
Florin Radulescu, Note de curs
51 DMDW-9
Example
For the web site structure:
A
B C
D E F G
52 DMDW-9
Data integration and event identification
53 DMDW-9
Events
At this moment some pageviews or some successions
of pageviews can be associated with specific events.
Identifying events adds more semantic to the user
sessions, semantic that may be used in further
analysis process.
Examples:
o Product view: a pageview where a product is displayed
o Product click-through: when the user clicks on a product to
display more data about it
o Shopping cart change: when a user add or remove a
product in the shopping cart
o Buy: when the shopping cart is validated and the customer
finalize the buying transaction
Florin Radulescu, Note de curs
54 DMDW-9
Pattern discovery
55 DMDW-9
Statistical Analysis
56 DMDW-9
Association rules
57 DMDW-9
Clustering
Clustering. Clustering algorithms can be used for
discovering usage clusters and page clusters.
In the first case users with similar surfing behavior
are discovered (each cluster contains similar
users).
This may be used for market segmentation and
personalization.
In the second case, clusters contain similar web
pages or related based on their content.
These clusters can be used by the search engines
for better results and also for recommendation
purposes.
Florin Radulescu, Note de curs
58 DMDW-9
Classification
59 DMDW-9
Sequential patterns discovery
60 DMDW-9
The GSP algorithm
61 DMDW-9
The GSP algorithm
62 DMDW-9
GSP vs. Apriori
63 DMDW-9
Dependency modeling
64 DMDW-9
Summary
65 DMDW-9
References
[Liu 11] Bing Liu, Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data, Second Edition, Springer, 2011,
chapter 12
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org
[W3.org 1] Logging Control In W3C httpd, page visited June 1,
2012: http://www.w3.org/Daemon/User/Config/Logging.html
[W3.org 2] Extended Log File Format, page visited June 1,
2012: http://www.w3.org/TR/WD-logfile.html
[Apache.org 1] Apache HTTP Server Version 2.4, Log files,
page visited June 1, 2012:
http://httpd.apache.org/docs/2.4/logs.html
66 DMDW-9
References
[Kosala, Blockeel, 2000] Raymond Kosala, Hendrik Blockeel, Web
Mining Research: A Survey, ACM SIGKDD Explorations Newsletter,
June 2000, Volume 2 Issue 1.
[Cooley et al, 97] Cooley, R.; Mobasher, B.; Srivastava, J.; Web
mining: information and pattern discovery on the World Wide Web.
Tools with Artificial Intelligence, 1997, Ninth IEEE International
Conference.
[da Costa, Gong 2005] Miguel Gomes da Costa Júnior Zhiguo
Gong, Web Structure Mining: An Introduction, Proceedings of the
2005 IEEE International Conference on Information Acquisition June
27 - July 3, 2005, Hong Kong and Macau, China
[Srivastava et al., 2000] J. Srivastava, R. Cooley, M.Deshpande,
P.Tan, Web usage mining: discovery and applications of web usage
patterns from web data, SIGKDD Explorations, Volume 1(2), 2000,
available at http://www.sigkdd.org/explorations/
Florin Radulescu, Note de curs
67 DMDW-9
Data warehousing - introduction
2 DMDW-10
Foreword
The goal of this lesson is to present a
comprehensive introduction to Data warehousing,
with definitions of the main terms used.
The lesson is a summary of the scientific literature
of the domain, based mainly on the books
published by two authors:
W.H. Inmon, the originator of the term Data
Warehousing
R. Kimball, who developed the dimensional
methodology (known also as Kimball methodology)
which has become a standard in the area of decision
support.
Florin Radulescu, Note de curs
3 DMDW-10
Definitions
Wikipedia:
Data warehouse is a repository of an
organization's electronically stored data.
Data warehouses are designed to facilitate
reporting and analysis.
A data warehouse houses a standardized,
consistent, clean and integrated form of data
sourced from various operational systems in
use in the organization, structured in a way to
specifically address the reporting and analytic
requirements.
Florin Radulescu, Note de curs
4 DMDW-10
Definitions
R. Kimball (see [Kimball, Ross, 2002]):
A data warehouse is a copy of transactional
data specifically structured for querying and
analysis.
According to this definition:
The form of the stored data (RDBMS, flat file) is
not linked with the definition of a data warehouse.
Data warehousing is not linked exclusively with
"decision makers" or used in the process of
decision making.
Florin Radulescu, Note de curs
5 DMDW-10
Definitions
W.H. Inmon (see [Inmon 2002]):
A data warehouse is a:
subject-oriented,
integrated,
nonvolatile,
time-variant
collection of data in support of management’s
decisions.
The data warehouse contains granular
corporate data.
Florin Radulescu, Note de curs
6 DMDW-10
Defintion explained
7 DMDW-10
Subject-oriented
8 DMDW-10
Subject-oriented
For each activity there is possibly another software
system managing data on the main subject
areas: policies, customers, claims and premiums
in the area, so there are possible four separate
databases, one for each activity, with similar but
not identical structures.
When uploading data in the company data
warehouse, the data must first be restructured on
these major subject areas, integrating data on
customers, policies, claims and premiums from
each activity (as in the next slide).
Florin Radulescu, Note de curs
9 DMDW-10
Subject-oriented
10 DMDW-10
Subject-oriented
11 DMDW-10
Integrated
When preparing data for uploading in the data
warehouse, one of the most important activities is
the integration. Data is loaded from operational
sources and must be converted, summarized, re-
keyed, etc., before loading it in the data
warehouse.
The next slide illustrates some of the most known
actions performed for data integration:
Combine multiple encodings in a single one. For
example, the gender may be encoded as (0, 1), (m, f),
(male, female) in separate operational systems. If (m,
f) is chosen as the data warehouse encoding, all data
encoded using other convention must be converted.
Florin Radulescu, Note de curs
12 DMDW-10
Integrated
13 DMDW-10
Integrated
Actions performed for data integration – cont.:
Chose a unique measure unit for each piece of
information. For example, if length is measured in cm,
inches, yards and meters in different operational
systems, one unit must be chosen for the data
warehouse and all other values must be converted.
If the same object has in some data sources different
values for the same attribute (e.g. description, name,
features, etc), these must be combined in a single
one.
If the same object has different keys in the source
systems it must be re-keyed to have a single key in
the data warehouse.
Florin Radulescu, Note de curs
14 DMDW-10
Non-volatile
In usual operational systems data is updated
or deleted to reflect the actual values. In a
data warehouse data is never updated and
deleted: after data is loaded, it stays there for
future reporting, like a snapshot reflecting the
situation in a certain moment.
The next load operations, instead of changing
the old snapshots, are added as new
snapshots and so the data warehouse is a
sequence of such snapshots that coexist.
Florin Radulescu, Note de curs
15 DMDW-10
Non-volatile
16 DMDW-10
Non-volatile
17 DMDW-10
Time variant
As described above, a data warehouse
contains a sequence of snapshots, each
snapshot being actual at a given moment of
time.
Because a DW contains the whole history of
a company, it is possible to retrieve
information in a time horizon of 5-10 years or
even more.
Each unit of information is stamped or linked
with the moment during which that
information was accurate.
Florin Radulescu, Note de curs
18 DMDW-10
Time variant
19 DMDW-10
Time variant
In an operational system only the current
data is kept. For example, if a customer
changes address, in the operational system
old address is replaced (update) with the new
one.
In the data warehouse all successive
addresses of a customer are kept.
Because date and time are very important in
analyzing data and reporting, the key
structure contains usually the date and
sometimes the time.
Florin Radulescu, Note de curs
20 DMDW-10
Why building a DW?
In [Kimball, Ross, 2002] there is a list of reasons for a
company to build its own data warehouse:
“We have mountains of data in this company, but we
can’t access it.”
“We need to slice and dice the data every which way.”
“You’ve got to make it easy for business people to get
at the data directly.”
“Just show me what is important.”
“It drives me crazy to have two people present the
same business metrics at a meeting, but with different
numbers.”
“We want people to use information to support more
fact-based decision making.”
Florin Radulescu, Note de curs
21 DMDW-10
Requirements for a DW
22 DMDW-10
Information must be easy accessible
23 DMDW-10
Information must be consistent
The process of fueling a data warehousing with
data contains a step of preprocessing, where data
is assembled from many sources, cleansed,
quality assured. Data is released (published) to
the users only when it is fit for usage.
As described earlier, an integration step is
performed when data is load from operational
sources, unifying encodings, units of measure,
keys, names and common values/features, etc.
Common definitions for the contents of the data
warehouse must be available for DW users.
Florin Radulescu, Note de curs
24 DMDW-10
Flexibility
25 DMDW-10
Security
26 DMDW-10
Decision support
The primary goal of implementing a data
warehouse in an organization is the decision
support
The ultimate output from a DW is the set of
decisions based on its content, analyzed and
presented in different ways to the decision
makers.
The original label for a data warehouse and
the tools around it was ‘decision support
system’.
Florin Radulescu, Note de curs
27 DMDW-10
Acceptance
The ultimate test for the success in implementing
a data warehouse is the acceptance test.
If the business community does not continue to
use it in the first six months after training, then the
system has failed the acceptance test, no mater
how bright is the technical solution.
It is possible to ignore using it because decisions
may be adopted also without a decision support
system.
Key point in user acceptance is simplicity and user
friendliness.
Florin Radulescu, Note de curs
28 DMDW-10
Road Map
29 DMDW-10
ODS
The concept of Operational Data Store (ODS)
was also introduced by W.H. Inmon and its
definition, found in [Inmon 98] is the following:
An ODS is an integrated, subject-oriented,
volatile (including update), current-valued
structure designed to serve operational users
as they do high performance integrated
processing.
We can compare an ODS with a database
integrating data from multiple sources. Its
goal is to help analysis and reporting.
Florin Radulescu, Note de curs
30 DMDW-10
ODS vs. DW
31 DMDW-10
ODS features
According to Inmon, the main features of an ODS
are:
enablement of integrated, collective on-line
processing.
delivers consistent high transaction performance--
two to three seconds.
supports on-line update.
is integrated across many applications.
provides a foundation for collective, up-to- the-
second views of the enterprise.
the ODS supports decision support processing.
Florin Radulescu, Note de curs
32 DMDW-10
Similarities DW - ODS
Subject-oriented data:
Before data is loaded in the ODS, it must first be
restructured on major subject areas (as in the
case of insurance company: integrating data on
customers, policies, claims and premiums from
each activity).
Integrated content:
Data is sourced from multiple operational systems
(sources), and the integration step includes, like in
DW case, cleaning, unifying encodings, re-keying,
removing redundancies, preserving integrity, etc.
Florin Radulescu, Note de curs
33 DMDW-10
Dissimilarities DW - ODS
34 DMDW-10
Road Map
35 DMDW-10
DW architecture
36 DMDW-10
DW architecture
The basic elements of a Data Warehouse environment
are:
Operational Source Systems. These are the source of
the data in the DW, and are placed outside of the data
warehouse
Data Staging Area. Here data is prepared
(transformed) for loading in the presentation area. This
area is not accessible to the regular user.
Data Presentation. This part is what regular users see
and consider to be a DW.
Data Access Tools. These tools are used for analyzing
and reporting. They provide the interface between the
user and the DW.
Florin Radulescu, Note de curs
37 DMDW-10
Data staging area
The data staging area (DSA) of a data
warehouse is compared in [Kimball, Ross, 2002]
with the kitchen of a restaurant. It is:
A storage area and
A set of processes performing the so-called
Extract-Transform-Load (ETL) operation:
Extract – Extracting data from Operational Source
Systems
Transform – Integrating data from all sources, as
described below
Load – Publishing data for users, meaning loading
data in the Data presentation area
Florin Radulescu, Note de curs
38 DMDW-10
Integration tasks
Dealing with synonyms: same data with different
name in different operational systems
Dealing with homonymous: same name for
different data
Unifying keys from different sources
Unifying encodings
Unifying unit measures and levels of detail
Dealing with different software platforms
Dealing with missing data
Dealing with different value ranges, etc.
Florin Radulescu, Note de curs
39 DMDW-10
Data staging area
40 DMDW-10
Main approaches
Storing data in a DW (so also in DSA) may be
done following two main approaches:
1. The normalized approach (supported by the work
of W.H. Inmon – see [Inmon 2002]
2. The dimensional approach (supported by the work
of Ralph Kimball – see [Kimball, Ross, 2002])
These approaches are not mutually exclusive, and
there are other approaches.
Dimensional approaches can involve normalizing
data to a degree.
This lesson is based on the dimensional approach
Florin Radulescu, Note de curs
41 DMDW-10
Normalized approach
In the normalized approach, data are
stored following database normalization
rules.
Tables are grouped by subject areas (data on
customers, policies, claims and premiums for
example).
The main advantage of this approach is that
loading data is straightforward because the
philosophy of structuring data is the same for
operational source systems and the data
warehouse.
Florin Radulescu, Note de curs
42 DMDW-10
Normalized approach
The main disadvantage of this approach is the
number of joins needed to obtain meaningful
information.
A regular user needs also to have a good
knowledge about the data in the DW and also a
training period in obtaining de-normalized tables
from normalized ones.
Missing a join condition when performing a query
may lead to Cartesian products instead of joins. In
other words, regular user may need assistance
from a database specialist to perform usual
operations.
Florin Radulescu, Note de curs
43 DMDW-10
Dimensional approach
44 DMDW-10
Dimensional approach
• Advantages of the dimensional approach
are:
– Data is easy to understand, easy to use, no need
for assistance from a database specialist, speed
in solving queries.
– Data being de-normalized (or partially de-
normalized) the number of joins needed for
performing a query is lower than in the
normalized approach.
– Joins between the fact table and its dimensions is
easy to perform because the fact table contains
surrogate keys for all involved dimension tables.
Florin Radulescu, Note de curs
45 DMDW-10
Dimensional approach
46 DMDW-10
Data presentation area
At the end of the ETL process prepared data is
loaded in the Data Presentation Area (DPA).
After that moment, data is available for users for
querying, reporting and other analytical
applications. 9
Because regular users have access only to that
area, they may consider the presentation area as
being the data warehouse.
This area is structured as a series of integrated
data marts, each presenting the data from a
single business process.
Florin Radulescu, Note de curs
47 DMDW-10
Data presentation area
48 DMDW-10
Hypercube
49 DMDW-10
Data marts – Definition 1
[SQLServer 2005]:
A data mart is defined as a repository of data
gathered from operational data and other sources
that is designed to serve a particular community of
knowledge workers.
Data may derive from an enterprise-wide database
or data warehouse or be more specialized.
The emphasis of a data mart is on meeting the
specific demands of a particular group of
knowledge users in terms of analysis, content,
presentation, and ease-of-use.
Florin Radulescu, Note de curs
50 DMDW-10
Data marts – Definition 2
[Wikipedia] defines a data mart as a
structure / access pattern specific to data
warehouse environments, used to retrieve
client-facing data.
The data mart is a subset of the data
warehouse and is usually oriented to a
specific business line or team.
Whereas data warehouses have an
enterprise-wide depth, the information in data
marts pertains to a single department.
Florin Radulescu, Note de curs
51 DMDW-10
Data marts – Definition 2
In some deployments, each department or
business unit is considered the owner of its
data mart including all
the hardware, software and data.This
enables each department to isolate the use,
manipulation and development of their data.
In other deployments where conformed
dimensions are used, this business unit
ownership will not hold true for shared
dimensions like customer, product, etc.
Florin Radulescu, Note de curs
52 DMDW-10
DW vs. Data marts
Data warehouse:
Holds multiple subject areas
Holds very detailed information
Works to integrate all data sources
Does not necessarily use a dimensional
model but feeds dimensional models.
53 DMDW-10
DW vs. Data marts
Data mart:
Often holds only one subject area- for
example, Finance, or Sales
May hold more summarized data (although
may hold full detail)
Concentrates on integrating information from
a given subject area or set of source systems
Is built focused on a dimensional model using
a star schema.
Florin Radulescu, Note de curs
54 DMDW-10
Other data mart features
55 DMDW-10
Other data mart features
Data marts use common dimensions and
facts.
Kimball refers them as ‘conformed’.
This means for example that the same date
dimension is used in all data marts, and in all
star schemes of the DW, if the significance is
the same for all cases.
Because data marts use conformed
dimensions and facts, they can be combined
and used together
Florin Radulescu, Note de curs
56 DMDW-10
So
57 DMDW-10
Examples
58 DMDW-10
Data access tools
59 DMDW-10
Ah-hoc query tools
By this channel the user obtains raw data
verifying the conditions specified in the ad-
hoc query.
For using this channel the user must have a
good knowledge on the DW structure and on
query language used.
This channel is for specialists and
experienced users.
Sometimes there are some pre-built queries
that may be used.
Florin Radulescu, Note de curs
60 DMDW-10
Report writers
61 DMDW-10
Analytic applications
62 DMDW-10
Exemple: Interactive report
x
63 DMDW-10
Exemple: Dashboard
x
64 DMDW-10
Exemple: Scorecard
x
65 DMDW-10
Exemple: Other tools
x
66 DMDW-10
Modeling tools
67 DMDW-10
Summary
This course presented:
Some definitions of a with data warehouse and a detailed
discussion based on Inmon definition, explaining what
means the four features of a DW: subject-oriented,
integrated, non-volatile and time-variant. Some reasons for
building a data warehouse are also discussed.
A definition of the concept of Operational data store with a
parallel between ODS and DW
A discussion about the architecture of a DW presenting the
Data Stage Area, Data presentation Area and Data Access
Tools, the main parts of such a construction.
Next week: Dimensional modeling
68 DMDW-10
References
[Inmon 2002] W.H. Inmon - Building The Data Warehouse. Third
Edition, Wiley & Sons, 2002
[Kimball, Ross, 2002] Ralph Kimball, Margy Ross - The Data
Warehouse Toolkit, Second Edition, Wiley & Sons, 2002
[CS680, 2004] Introduction to Data Warehouses, Drexel Univ. CS 680
Course notes, 2004 (page
https://www.cs.drexel.edu/~dvista/cs680/2.DW.Overview.ppt visited
2010)
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org, visited
2009.
[SQLServer 2005] Dan Gallagher, Tim D. Nelson, and Steve Proctor,
Data mart, nov. 2005, Site:
http://searchsqlserver.techtarget.com/definition/data-mart, visited June
20, 2012
[Inmon, 98] W.H. Inmon - The Operational Data Store, July 1, 1998,
web page visited June 20, 2012: http://www.information-
management.com/issues/19980701/469-1.html
[Rainardi, 2008] Vincent Rainardi, Building a Data Warehouse with
Examples in SQL Server, Springer, 2008
Florin Radulescu, Note de curs
69 DMDW-10
Dimensional Modeling
2 DMDW-11
Facts and dimensions
3 DMDW-11
Facts
4 DMDW-11
Additive measures
5 DMDW-11
Semi-Additive
Semi-Additive measures can be aggregated
across some dimensions but not all.
Here are for example periodic measurements:
account balance for a bank account or
inventory level for a retail chain.
In the first case an average may be computed for
knowing the average daily balance but the sum of
daily balances is not meaningful.
In the second case, inventory level is additive on
product and warehouse but not across time: the
sum of yesterday and today inventory level for a
given product is not a meaningful value.
Florin Radulescu, Note de curs
6 DMDW-11
Non-additive measures
Non-additive measures cannot be aggregated
across all/any dimension.
A classical example is the unit price.
Considering a retail company, the sum of unit
prices along any dimension (product, customer,
location, etc.) is not meaningful.
For that reason, if these values can be computed
based on additive measures, the non-additive
measures are not stored in the fact tables.
For our example, the unit price can always be
computed dividing total cost by the quantity sold.
Florin Radulescu, Note de curs
7 DMDW-11
Grain
8 DMDW-11
Dimensions
In [CS680, 2004], dimension tables are
characterized as follows:
Represent the who, what, where, when and how of
a measurement/artifact
Represent real-world entities not business
processes
Give the context of a measurement (subject)
Example: in a retail company DW, the Sales fact
table can be linked with dimensions like Location
(Where), Time (When), Product (What), Customer
(Who), Sales Channel (How).
Florin Radulescu, Note de curs
9 DMDW-11
Dimensions
The Dimension Attributes are the columns of
the dimension table. [Wikipedia] lists some
features for these attributes:
Verbose - labels consisting of full words,
Descriptive,
Complete - no missing values,
Discretely valued - only one value per row in
dimensional table,
Quality assured - no misspelling, no
impossible values.
Florin Radulescu, Note de curs
10 DMDW-11
Star scheme
11 DMDW-11
Advantages
Each fact table is surrounded by several linked
dimension tables, as in Figure 1.
Because of its appearance, such a construction is
called a ‘star scheme’.
A star scheme has several advantages:
Is easy to understand. Graphic representations have
almost always this advantage
Provide better performance: data is de-normalized in
fact and dimension tables, so for obtaining a query
result needs only the joins between the fact table and
the implied dimensions
Is extensible. Attributes and dimensions may be
added easily
Florin Radulescu, Note de curs
12 DMDW-11
SQL Query
SELECT P.Name, SUM(S.Sales), . . .
// additional condition
13 DMDW-11
Snow-flake schemes
14 DMDW-11
Example
15 DMDW-11
Road Map
16 DMDW-11
The four step approach
17 DMDW-11
The four step approach
18 DMDW-11
Select the business processes
19 DMDW-11
Select the business processes
20 DMDW-11
No duplicate data
This approach also ensures that data contains no duplicate
data.
If a department approach in structuring the data warehouse is
used, same data may be used by several departments and
must be presented redundantly in the DW.
For example, inventory data are used for supply chain
management but also for production management in a car
factory.
A data warehouse organized based on departmental structure
will duplicate inventory data but organizing it on business
processes will avoid redundancy and both departments –
supply management and production – will use the same data.
21 DMDW-11
No duplicate data
22 DMDW-11
Step 2: Declare the grain
Each line in a fact table is a grain in our data
warehouse. In step 2 of the dimensional design
process the level of detail for these lines / grains must
be defined.
Thinking at a retail company with registered users (like
Metro or Selgros), for the POS sales business
process, a grain may be:
1. An individual line item on a customer’s retail sales
ticket or invoice, as measured by a scanner device
(in that case the same item may be on several lines
in the same ticket/invoice because the quantity was
greater than one and each product was scanned
individually).
Florin Radulescu, Note de curs
23 DMDW-11
Step 2: Declare the grain
24 DMDW-11
Step 2: Declare the grain
2. The same
significance as
above but lines
containing the
same part
number are
summarized in a
single line.
Florin Radulescu, Note de curs
25 DMDW-11
Step 2: Declare the grain
3. A daily reunion of the sales tickets of a customer
containing items and prices.
26 DMDW-11
Step 2: Declare the grain
27 DMDW-11
Step 2: Declare the grain
28 DMDW-11
Step 2: Declare the grain
∑
Florin Radulescu, Note de curs
29 DMDW-11
Discussion
30 DMDW-11
Discussion
A key idea in choosing the granularity level is
emphasized in [Kimball, Ross, 2002]:
“Preferably you should develop dimensional models
for the most atomic information captured by a
business process. Atomic data is the most detailed
information collected; such data cannot be subdivided
further.”
and
“A data warehouse almost always demands data
expressed at the lowest possible grain of each
dimension not because queries want to see individual
low-level rows, but because queries need to cut
through the details in very precise ways.”
Florin Radulescu, Note de curs
31 DMDW-11
Atomic data features
Some features of atomic data listed in
Kimball & Ross book are:
Is highly dimensional,
Being highly dimensional, data may be drilled in
more ways,
Dimensional approach is favored by atomic data,
each extra dimension being easily added to the
star schemes,
Provides maximum analytic flexibility,
Detailed data allow more ad hoc queries,
Florin Radulescu, Note de curs
32 DMDW-11
Atomic data features
Features – cont.:
Low level grain does not prohibit adding also
summary high level grain in the DW for speeding
up frequent queries and reports.
Note that declaring the grain is a critical step.
If later the granularity choice is proved to be
wrong, the process must go back to step 2 for
re-declaring the grain correctly, and after that
steps 3 and 4 must be run again.
Florin Radulescu, Note de curs
33 DMDW-11
Step 3: Choose the dimensions
34 DMDW-11
Step 3: Choose the dimensions
35 DMDW-11
Step 3: Choose the dimensions
36 DMDW-11
Step 4: Identify the facts
Every line in a fact table must contain some attribute
values.
These attributes represents the measures assigned to
the business process that must be determined at this
step.
In the case of a star scheme containing data on POS
retail sales in a store chain, possible attributes of the
fact table are:
Quantity sold – additive value
Total line value amount – additive value
Line cost amount – additive value
Line profit amount – additive value
Unit price – not an additive value
Florin Radulescu, Note de curs
37 DMDW-11
Discussion
38 DMDW-11
Discussion
Additive measures are preferred. So unit price,
which is not an additive value, will be removed
because it can be computed by division from the
total line value amount and quantity sold.
Redundant data can be stored in a fact table if
they are additive or semi-additive.
For example, Line profit amount may be computed
by subtracting the cost amount from the value
amount.
The presence of these redundant values is
allowed for speeding up processing.
Florin Radulescu, Note de curs
39 DMDW-11
Road Map
40 DMDW-11
Modeling example
A retail sales modeling example is presented in
[Kimball, Ross, 2002] for a store chain.
Each store has several departments and sales
several tens of thousands items (called stock
keeping units – SKU).
Each SKU has either a universal product code
imprinted by the manufacturers or a local code for
bulk goods (for example agricultural products -
vegetables and fruits, meat, bakery, etc.).
Package variation of a product is another SKU and
by consequence has a different code.
Florin Radulescu, Note de curs
41 DMDW-11
Modeling example
42 DMDW-11
Step 1
43 DMDW-11
Step 2
44 DMDW-11
Step 3
45 DMDW-11
Step 3 – star scheme
Product Store
Product_Key (PK) Store_Key (PK)
Product attributes Store attributes
POS_Sales
Product_Key (FK)
Date_Key (FK)
Store_Key (FK)
SP_Key (FK)
Date Promotion_Key (FK)
Date_Key (PK) Ticket_number (FK) Salesperson
Date attributes Fact table attributes SP_Key (PK)
SP attributes
Promotion
Promotion_Key (PK)
Promotion attributes
46 DMDW-11
Step 3 - details
47 DMDW-11
Degenerate dimensions
48 DMDW-11
Degenerate dimensions
49 DMDW-11
Step 4: Identify the facts
50 DMDW-11
Star scheme again
Promotion
Promotion_Key (PK)
Promotion attributes
51 DMDW-11
Discussion
52 DMDW-11
Date attributes example
Date
Date Key (PK) Calendar Quarter
Date Calendar Year-Quarter
Full Date Description Calendar Half Year
Day of Week Calendar Year
Day Number in Epoch Fiscal Week
Week Number in Epoch Fiscal Week Number in Year
Month Number in Epoch Fiscal Month
Day Number in Calendar Month Fiscal Month Number in Year
Day Number in Calendar Year Fiscal Year-Month
Day Number in Fiscal Month Fiscal Quarter
Day Number in Fiscal Year Fiscal Year-Quarter
Last Day in Week Indicator Fiscal Half Year
Last Day in Month Indicator Fiscal Year
Calendar Week Ending Date Holiday Indicator
Calendar Week Number in Year Weekday Indicator
Calendar Month Name Selling Season
Calendar Month Number in Year Major Event
Calendar Year-Month (YYYY-MM) SQL Date Stamp
53 DMDW-11
Product attributes example
Product
Product Key (PK) Product Description
SKU Number (Natural Key) Brand Description
Category Description Department Description
Package Type Description Package Size
Fat Content Diet Type
Weight Weight Units of Measure
Storage Type Shelf Life Type
Shelf Width Shelf Height
Shelf Depth
54 DMDW-11
Store attributes example
Store
Store Name Store Region
Store Number (Natural Key) Floor Plan Type
Store Street Address Photo Processing Type
Store City Financial Service Type
Store County Selling Square Footage
Store State Total Square Footage
Store Zip Code First Open Date
Store Manager Last Remodel Date
Store District
55 DMDW-11
Promotion attributes example
Promotion
Promotion Key (PK) Coupon Type
Promotion Name Ad Media Name
Price Reduction Type Display Provider
Promotion Media Type Promotion Cost
Ad Type Promotion Begin Date
Display Type Promotion End Date
56 DMDW-11
Summary
This course presented the dimensional model
of data warehouses:
Definitions for facts and dimensions, definitions
for star scheme and snow-flake scheme.
The four steps in dimensional modeling: identify
the business process, declare the grain, choose
dimensions and identify the facts
A modeling example for a sales chain with
illustration of attributes in fact and dimension
tables
Next week: Data warehouse case study
Florin Radulescu, Note de curs
57 DMDW-11
References
[CS680, 2004] Introduction to Data Warehouses, Drexel Univ. CS
680 Course notes, 2004 (page
https://www.cs.drexel.edu/~dvista/cs680/2.DW.Overview.ppt
visited 2010)
[Kimball, Ross, 2002] Ralph Kimball, Margy Ross - The Data
Warehouse Toolkit, Second Edition, Wiley & Sons, 2002
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org
58 DMDW-11
Dimensional Modeling – part 2
Case Studies
Prof.dr.ing. Florin Radulescu
Universitatea Politehnica din Bucureşti
Road Map
2 DMDW-12
Types of Dimensional Models
❑In [3] are discussed five distinct types of
Dimensional Models. A Dimensional Model in
the next slides is either a star scheme or a
data mart – several interconnected star
schemes:
1. Accumulating Snapshot Tables
2. Aggregate Tables
3. Fact Tables
4. Factless Fact Tables
5. Snapshot Tables
Florin Radulescu, Note de curs
3 DMDW-12
Types of Dimensional Models
4 DMDW-12
Accumulating Snapshot Tables
5 DMDW-12
Accumulating Snapshot Tables
6 DMDW-12
Accumulating Snapshot Tables
❑ This consistent updating of accumulating snapshot
fact rows is unique among fact tables.
❑ In addition to the date foreign keys associated with
each critical process step, accumulating snapshot
fact tables contain foreign keys for other
dimensions and optionally contain degenerate
dimensions.
❑ They often include numeric lag measurements
consistent with the grain, along with milestone
completion counters.” (source: [2])
7 DMDW-12
Accumulating Snapshot Tables
8 DMDW-12
x
9 DMDW-12
Accumulating Snapshot Tables
10 DMDW-12
Types of Dimensional Models
11 DMDW-12
Aggregate Tables
12 DMDW-12
Aggregate Tables
13 DMDW-12
Aggregate Tables
14 DMDW-12
Aggregate Tables
15 DMDW-12
Aggregate Tables
16 DMDW-12
Florin Radulescu, Note de curs
17 DMDW-12
Aggregate Tables
18 DMDW-12
Aggregate Tables – Load example
alter session enable parallel dml;
commit;
Florin Radulescu, Note de curs
19 DMDW-12
Types of Dimensional Models
20 DMDW-12
Fact Tables
21 DMDW-12
Fact Tables
22 DMDW-12
Fact Tables
23 DMDW-12
Types of Dimensional Models
24 DMDW-12
Factless Fact Tables
Factless Fact tables are defined in [2] as
follows:
❑“Although most measurement events capture
numerical results, it is possible that the event
merely records a set of dimensional entities
coming together at a moment in time.
❑For example, an event of a student attending
a class on a given day may not have a
recorded numeric fact, but a fact row with
foreign keys for calendar day, student,
teacher, location, and class is well-defined.
Florin Radulescu, Note de curs
25 DMDW-12
Factless Fact Tables
26 DMDW-12
Factless Fact Tables
27 DMDW-12
Factless Fact Tables
28 DMDW-12
Factless Fact Tables
29 DMDW-12
x
30 DMDW-12
Types of Dimensional Models
31 DMDW-12
Snapshot Fact Tables
32 DMDW-12
Snapshot Fact Tables
33 DMDW-12
Snapshot Fact Tables
34 DMDW-12
x
35 DMDW-12
Snapshot Fact Tables
36 DMDW-12
Snapshot Fact Tables
37 DMDW-12
Road Map
38 DMDW-12
Surrogate keys
39 DMDW-12
Surrogate keys
40 DMDW-12
Surrogate keys
❑In addition, natural keys for a dimension may
be created by more than one source system,
and these natural keys may be incompatible
or poorly administered.
❑The DW/BI system needs to claim control of
the primary keys of all dimensions; rather
than using explicit natural keys or natural
keys with appended dates, you should create
anonymous integer primary keys for every
dimension.
Florin Radulescu, Note de curs
41 DMDW-12
Surrogate keys
42 DMDW-12
Surrogate keys for fact tables
43 DMDW-12
Surrogate keys for fact tables
44 DMDW-12
Surrogate keys for fact tables
45 DMDW-12
Surrogate keys for fact tables
46 DMDW-12
Surrogate keys summary
❑Surrogate keys:
✓Are mandatory for dimension tables
✓May be used in fact tables
✓There are cases when using surrogate keys
for fact tables are mandatory due to the data
processing characteristics
47 DMDW-12
Road Map
48 DMDW-12
Conformed Dimensions
49 DMDW-12
Conformed Dimensions
50 DMDW-12
Conformed Dimensions
51 DMDW-12
Conformed Dimensions
❑Simply put, if multiple star schemes use the
same dimension table, there will be a single
table in the data warehouse for that
dimension that will be shared by all star
schemes that need it.
❑For this reason, the grain must be the same
in all cases.
❑Conformed Dimensions are therefore very
important and are frequently Reference Data
(such as Calendars) or Master Data (such as
Products).
Florin Radulescu, Note de curs
52 DMDW-12
Conformed Dimensions
53 DMDW-12
Conformed Dimensions
54 DMDW-12
x
❑x
55 DMDW-12
Conformed Dimensions
56 DMDW-12
Conformed Dimensions
57 DMDW-12
Summary
58 DMDW-12
References
1. Kimball group website: https://www.kimballgroup.com/
2. Kimball Group Dimensional Modeling Techniques:
https://www.kimballgroup.com/data-warehouse-business-
intelligence-resources/kimball-techniques/dimensional-
modeling-techniques/
3. Barry Williams - Dimensional Modelling by Example,
http://www.databaseanswers.org/downloads/Dimensional_M
odelling_by_Example.pdf visited April 2020
4. http://etutorials.org/SQL/oracle+dba+guide+to+data+wareho
using+and+star+schemas/Chapter+7.+Implementing+Aggre
gates/Aggregation+by+Itself/
5. https://www.kimballgroup.com/2006/07/design-tip-81-fact-
table-surrogate-key/
6. https://www.kimballgroup.com/2009/05/the-10-essential-
rules-of-dimensional-modeling/
Florin Radulescu, Note de curs
59 DMDW-12