0% found this document useful (0 votes)
307 views784 pages

DMDW Full PDF

This document provides an introduction to data mining. It defines data mining as the process of discovering useful patterns from large data sets using techniques from artificial intelligence, machine learning, statistics, and database systems. The document outlines the typical steps in the data mining process, including data collection, preprocessing, model building, and evaluation. It also discusses various data mining methods, subdomains, communities involved, types of data mining software, and some success stories.

Uploaded by

Irimescu Andrei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
307 views784 pages

DMDW Full PDF

This document provides an introduction to data mining. It defines data mining as the process of discovering useful patterns from large data sets using techniques from artificial intelligence, machine learning, statistics, and database systems. The document outlines the typical steps in the data mining process, including data collection, preprocessing, model building, and evaluation. It also discusses various data mining methods, subdomains, communities involved, types of data mining software, and some success stories.

Uploaded by

Irimescu Andrei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 784

1.

Introduction to Data Mining

Prof.dr.ing. Florin Radulescu


Grading

 60% during the semester:


 10% Course activity (course attendance)
 20% Midterm exam (questions with multiple choice
answers)
 30% Project (project attendance, algorithm
presentation, project delivery)
 40% Final exam (questions with multiple choice
answers)
Course site:
http://acs.curs.pub.ro
Florin Radulescu, Course 1

2 DM, DMDW
Road Map

 What is data mining


 Steps in data mining process
 Data mining methods and subdomains
 Summary

Florin Radulescu, Course 1

3 DM, DMDW
Definition ([Liu 11])

Data mining is also called Knowledge Discovery


in Databases (KDD).
It is commonly defined as the process of
discovering useful patterns or knowledge from
data sources, e.g., databases, texts, images, the
web, etc.
The patterns must be valid, potentially useful
and understandable.

Florin Radulescu, Course 1

4 DM, DMDW
Definition ([Ullman 09, 10])

Discovery of useful, possibly unexpected,


patterns in data.
Discovery of “models” for data:
– Statistical modeling
– Machine learning
– Computational Approaches to Modeling
– Summarization
– Feature Extraction
Florin Radulescu, Course 1

5 DM, DMDW
Definition ([Wikipedia])
Data mining (the analysis step of the
"Knowledge Discovery in Databases" process,
or KDD), an interdisciplinary subfield
of computer science, is the computational
process of discovering patterns in large data
sets ("big data") involving methods at the
intersection of artificial intelligence, machine
learning, statistics, and database systems.

Florin Radulescu, Course 1

6 DM, DMDW
Definition ([Wikipedia])
The overall goal of the data mining process is to
extract information from a data set and transform
it into an understandable structure for further
use.
Aside from the raw analysis step, it involves
database and data management aspects, data
preprocessing, model and inference
considerations, interestingness metrics,
complexity considerations, post-processing of
discovered structures, visualization, and online
updating.
Florin Radulescu, Course 1

7 DM, DMDW
Definition ([Kimball, Ross 02])
 A class of undirected queries, often against the most
atomic data, that seek to find unexpected patterns in the
data.
 The most valuable results from data mining are
clustering, classifying, estimating, predicting, and finding
things that occur together.
 There are many kinds of tools that play a role in data
mining, including decision trees, neural networks,
memory- and case-based reasoning tools, visualization
tools, genetic algorithms, fuzzy logic, and classical
statistics.
 Generally, data mining is a client of the data warehouse.
Florin Radulescu, Course 1

8 DM, DMDW
Conclusions
 The data mining process converts data into valuable
knowledge that can be used for decision support
 Data mining is a collection of data analysis
methodologies, techniques and algorithms for
discovering new patterns
 Data mining is used for large data sets
 Data mining process is automated (no need for human
intervention)
 Data mining and Knowledge Discovery in Databases
(KDD) are considered by some authors to be the same
thing. Other authors list data mining as the analysis step
in the KDD process - after data cleaning and
transformation and before results visualization /
evaluation.
Florin Radulescu, Course 1

9 DM, DMDW
Success stories (1)
Some early success stories in using data mining (from
[Ullman 03]):
• Decision trees constructed from bank-loan histories to
produce algorithms to decide whether to grant a loan.
• Patterns of traveler behavior mined to manage the sale
of discounted seats on planes, rooms in hotels, etc.
• “Diapers and beer" Observation that customers buying
diapers are more likely to buy beer than average,
allowing supermarkets to place beer and diapers nearby,
knowing that many customers would walk between them.
Placing potato chips between increased sales of all three
items.
Florin Radulescu, Course 1

10 DM, DMDW
Success stories (2)
• Skycat and Sloan Sky Survey: clustering sky objects by
their radiation levels in different bands allowed
astronomers to distinguish between galaxies, nearby
stars, and many other kinds of celestial objects.
• Comparison of the genotype of people with/without a
condition allowed the discovery of a set of genes that
together account for many cases of diabetes. This sort of
mining will become much more important as the human
genome is constructed.

Florin Radulescu, Course 1

11 DM, DMDW
What is not Data Mining
 Find a certain person in an employee database
 Compute the minimum, maximum, sum, count or
average values based on table/tables columns
 Use a search engine to find your name occurrences on
the web

Florin Radulescu, Course 1

12 DM, DMDW
DM software (1)
In ([Mikut, Reischl 11]) DM software programs are classified in 9
categories:
 Data mining suites (DMS) focus on data mining and include
numerous methods and support feature tables and time series.
Examples:
 Commercial: IBM SPSS Modeler, SAS Enterprise Miner, DataEngine,
GhostMiner, Knowledge Studio, NAG Data Mining Components,
STATISTICA
 Open source: RapidMiner
 Business intelligence packages (BIs) include basic data mining
functionality - statistical methods in business applications, are often
restricted to feature tables and time series and large feature tables
are supported. Examples:
 Commercial: IBM Cognos 8 BI, Oracle DataMining, SAPNetweaver
Business Warehouse, Teradata Database, DB2 Data Warehouse from
IBM
 Open source: Pentaho
Florin Radulescu, Course 1

13 DM, DMDW
DM software (2)
 Mathematical packages (MATs) provide a large and extendable set
of algorithms and visualization routines. Examples:
 Commercial: MATLAB, R-PLUS
 Open source: R, Kepler
 Integration packages (INTs) are extendable bundles of many
different open-source algorithms
 Stand-alone software (KNIME, the GUI-version of WEKA, KEEL, and
TANAGRA)
 Larger extension package for tools from the MAT type
 Extensions (EXT) are smaller add-ons for other tools such as
Excel, Matlab, R, with limited but quite useful functionality.
Examples:
 Artificial neural networks for Excel (Forecaster XL and XLMiner)
 MATLAB (Matlab Neural Networks Toolbox).
Florin Radulescu, Course 1

14 DM, DMDW
DM software (3)
 Data mining libraries (LIBs) implement data mining methods as a
bundle of functions and can be embedded in other software tools
using an Application Programming Interface. Examples: Neurofusion
for C++, WEKA, MLC++, JAVA Data Mining Package, LibSVM
 Specialties (SPECs) are similar to DMS tools, but implement only
one special family of methods such as artificial neural networks.
Examples: CART, Bayesia Lab, C5.0, WizRule, Rule Discovery
System, MagnumOpus, JavaNNS, Neuroshell, NeuralWorks Predict,
RapAnalyst.
 Research (RES) are usually the first implementations of new
algorithms, with restricted graphical support and without automation
support. RES tools are mostly open source. WEKA and RapidMiner
started in this category.
 Solutions (SOLs) describe a group of tools that are customized to
narrow application fields. Examples: for text mining: GATE, image
processing: ITK, ImageJ, drug discovery: Molegro Data Modeler

Florin Radulescu, Course 1

15 DM, DMDW
Communities involved

The most important communities involved in


data mining
STATISTICS

DATABASE
SYSTEMS AI

DATA
MINING

CLUSTERING VISUALIZATION

Florin Radulescu, Course 1

16 DM, DMDW
Road Map

 What is data mining


 Steps in data mining process
 Data mining methods and subdomains
 Summary

Florin Radulescu, Course 1

17 DM, DMDW
Data mining steps (1)
1. Data collection: Data gathering from existing
databases or (for Internet documents) from Web
crawling.
2. Data preprocessing, including:
– Data cleaning: replace (or remove) missing values, smooth
noisy data, remove or just identify outliers, remove
inconsistencies.
– Data integration: integration of data from multiple sources, with
possible different data types and structures and also handling
of duplicate or inconsistent data.
– Data transformation: data normalization (or standardization),
summarizations, generalization, new attributes construction,
etc.

Florin Radulescu, Course 1

18 DM, DMDW
Data mining steps (2)
2. Data preprocessing (cont):
– Data reduction (called also feature extraction): not all the
attributes are necessary for the particular Data Mining process
we want to perform. Only relevant attributes are selected for
further processing reducing the total size of the dataset (and
the time needed for running the algorithm).
– Discretization: some algorithms work only on discrete data.
For that reason the values for continuous attributes must be
replaced with discrete ones from a limited set. One example is
replacing age (number) with an attribute having only three
values: Young, Middle-age and Old.

Florin Radulescu, Course 1

19 DM, DMDW
Data mining steps (3)
3. Pattern extraction and discovery. This is the stage
where the data mining algorithm is used to obtain the
result. Some authors consider that Data Mining is
reduced only at this step, the whole process being
called KDD.
4. Visualization: because data mining extracts hidden
properties/information from data it is necessary to
visualize the results for a better understanding and
evaluation. Also needed for the input data.
5. Evaluation of results: not everything that outputs
from a data mining algorithm is a valuable fact or
information. Some of them are statistic truths and
others are not interesting/useful for our activity. Expert
judgment is necessary in evaluating the results
Florin Radulescu, Course 1

20 DM, DMDW
Bonferroni principle (1)
A true information discovered by a ‘data mining’ process
can be a statistical truth. Example (from [Ullman 03]):
 In 1950’s David Rhine, a parapsychologist, tested
students in order to find if they have or not extrasensorial
perception (ESP).
 He asked them to guess the color of 10 successive
cards – red or black. The result was that 1/1000 of them
guessed all 10 cards (he declared they have ESP).
 Re-testing only these students he found that they have
lost ESP after knowing they have this feature
 David Rhine did not realize that the probability of
guessing 10 successive cards is 1/1024 = 1/210 ,
because the probability for each of these 10 cards is ½
(red or black).
Florin Radulescu, Course 1

21 DM, DMDW
Bonferroni principle (2)
 This kind of results may be included in the output of a
data mining algorithm but must be recognized as a
statistical truth and not a real data mining output.
 This fact is also the object of the Bonferroni principle.
This can be synthesized as below:

• if your method of finding significant items returns


significantly more items that you would expect in the
actual population, you can assume most of the items you
find with it are bogus [rationalwiki.org]

Florin Radulescu, Course 1

22 DM, DMDW
Road Map

 What is data mining


 Steps in data mining process
 Data mining methods and subdomains
 Summary

Florin Radulescu, Course 1

23 DM, DMDW
Method types
 Prediction methods. These methods use some
variables to predict the values of other variables. A
good example for that category is classification. Based
on known, labeled data, classification algorithms build
models that can be used for classifying new, unseen
data.
 Description methods. Algorithms in this category find
patterns that can describe the inner structure of the
dataset. For example clustering algorithms find groups
of similar objects in a dataset (called clusters) and
possible isolated objects, far away from any cluster,
called outliers.

Florin Radulescu, Course 1

24 DM, DMDW
Algorithms

Prediction algorithm types:


Classification
Regression
Deviation detection
Description algorithm types:
Clustering
Association rule discovery
Sequential pattern discovery
Florin Radulescu, Course 1

25 DM, DMDW
Classification
Input:
• A set of k classes C = {c1, c2, …, ck}
• A set of n labeled items D = {(d1, ci1), (d2, ci2), …, (dn
cin)}. The items are d1, …, dn, each item dj being labeled
with class cj  C. D is called the training set.
• For calibration of some algorithms a validation set is
required. This validation set contains also labeled items
not included in the training set.
Output:
• A model or method for classifying new items (a
classifier). The set of new items that will be classified
using the model/method is called the test set
Florin Radulescu, Course 1

26 DM, DMDW
Example
 Let us consider a medical set of items where each item
is a patient of a hospital emergency unit (RO: UPU).
 There are 5 classes, representing maximum waiting time
categories: C0, C10, C30, C60 and C120, Ck meaning
the patient waits maximum k minutes.
 We may represent these data in tabular format
 The output of a classification algorithm using this training
set may be for example a decision tree or a set of
ordered rules.
 The model may be used to classify future patients and
assign a waiting time label to them

Florin Radulescu, Course 1

27 DM, DMDW
Emergency unit training set
Name (or Vital Danger 0 resource 1 resource >1 >1 resource Waiting
ID) risk? if needed needed resource needed and time
waits? needed vital function s (class
affected label)

John Yes Yes No Yes No No C0

Maria No Yes No No Yes No C10

Nadia Yes Yes Yes No No No C0

Omar No No No No Yes Yes C30

Kiril No No No Yes No Yes C60

Denis No No No No Yes No C10

Jean No No Yes Yes No No C120

Patricia Yes Yes No No Yes Yes C60

Florin Radulescu, Course 1

28 DM, DMDW
Result: decision tree

• The result for the example:


Felix Yes Yes No No No Yes ?????

will be C0 Florin Radulescu, Course 1

29 DM, DMDW
Regression (1)
Regression is related with statistics.
Meaning: predicting a value of a given
continuous valued variable based on the values
of other variables, assuming a linear or
nonlinear model of dependency ([Tan,
Steinbach, Kumar 06]).
Used in prediction and forecasting - its use
overlaps machine learning.
Regression analysis is also used to understand
relationship between independent variables and
dependent variable and can be used to
infer causal relationships between them.
Florin Radulescu, Course 1

30 DM, DMDW
Regression (2)

There are many types of regression. For


example, Wikipedia lists:
Linear regression model
Simple linear regression
Logistic regression
Nonlinear regression
Nonparametric regression
Robust regression
Stepwise regression
Florin Radulescu, Course 1

31 DM, DMDW
Example

Linear regression example


(from http://en.wikipedia.org/wiki/File:Linear_regression.svg)

Florin Radulescu, Course 1

32 DM, DMDW
Deviation detection
 Deviation detection or anomaly detection means discovering
significant deviation from the normal behavior. Outliers are a
significant category of abnormal data.
 Deviation detection can be used in many circumstances:
 Data mining algorithm running stage: often such information may
be important for business decisions and scientific discovery.
 Auditing: such information can reveal problems or mal-practices.
 Fraud detection in a credit card system: fraudulent claims often
carry inconsistent information that can reveal fraud cases.
 Intrusion detection in a computer network may rely on abnormal
data.
 Data cleaning (part of data preprocessing): such information can
be detected and possible mistakes may be corrected in this
stage.

Florin Radulescu, Course 1

33 DM, DMDW
Deviation detection techniques
Distance based techniques (example: k-nearest
neighbor).
One Class Support Vector Machines.
Predictive methods (decision trees, neural
networks).
Cluster analysis based outlier detection.
Pointing at records that deviate from association
rules
Hotspot analysis

Florin Radulescu, Course 1

34 DM, DMDW
Algorithms

Prediction algorithm types:


Classification
Regression
Deviation Detection
Description algorithm types:
Clustering
Association Rule Discovery
Sequential Pattern Discovery
Florin Radulescu, Course 1

35 DM, DMDW
Clustering
Input:
 A set of n objects D = {d1, d2, …, dn} (called usually points).
The objects are not labeled and there is no set of class labels
defined.
 A distance function (dissimilarity measure) that can be used to
compute the distance between any two points. Low valued
distance means ‘near’, high valued distance means ‘far’.
 Some algorithms also need a predefined value for the number
of clusters in the produced result.
Output:
 A set of object (point) groups called clusters where points in
the same cluster are near one to another and points from
different clusters are far one from another, considering the
distance function.

Florin Radulescu, Course 1

36 DM, DMDW
Example
 Having a set of points in a 2 dimensional space, find the
natural clusters formed by these points.
INITIAL AFTER CLUSTERING

Source: http://en.wikipedia.org/wiki/File:Cluster-1.svg, http://en.wikipedia.org/wiki/File:Cluster-2.svg

Florin Radulescu, Course 1

37 DM, DMDW
Association Rule Discovery
Let us consider:
A set of m items I = {i1, i2, …, im}.
A set of n transactions T = { t1, t2, …, tn},
each transaction containing a subset of I,
so if tk  T then tk = {ik1, ik2, …, ikj} where j
depends on k.
Then:
A rule is a construction X  Y where X and
Y are itemsets.
Florin Radulescu, Course 1

38 DM, DMDW
Association Rule Discovery
 The support of a rule is the number/proportion of
transactions containing the union between the left and
the right part of the rule (and is equal with the support of
this union as an itemset):
support(X  Y) = support(XY)
 The confidence of a rule is the proportion of
transactions containing Y in the set of transactions
containing X:
confidence(X  Y) = support(XY) / support(X).
 We accept a rule as a valid one if the support and the
confidence of the rule are at least equal with some given
thresholds.
Florin Radulescu, Course 1

39 DM, DMDW
Association Rule Discovery
Input:
 A set of m items I = {i1, i2, …, im}.
 A set of n transactions T = { t1, t2, …, tn}, each transaction
containing a subset of I, so if tk  T then tk = {ik1, ik2, …, ikj}
where j depends on k.
 A threshold s for the support, given either as a percent or in
absolute value. If an itemset X  I is part of w transactions
then w is the support of X. If w >= s then X is called frequent
itemset
 A second threshold c for rule confidence.
Output:
 The set of frequent itemsets in T, having support >= s
 The set of rules derived from T, having support >= s and
confidence >= c
Florin Radulescu, Course 1

40 DM, DMDW
Example
 Consider the following set of transactions:
Transaction Items
ID

1 Bread, Milk, Butter, Orange Juice, Onion, Beer

2 Bread, Milk, Butter, Onion, Garlic, Beer, Orange Juice, Shirt,


Pen, Ink, Baby diapers

3 Milk, Butter, Onion, Garlic, Beer

4 Orange Juice, Shirt, Shoes, Bread, Milk

5 Butter, Onion, Garlic, Beer, Orange Juice

 If s = 60% then {Bread, Milk, Orange Juice} or {Onion, Garlic, Beer}


are frequent itemsets. Also if s = 60% and c=70% then the rule
{Onion, Beer}  {Garlic} is a valid one because its support is 60%
and the confidence is 75%.
Florin Radulescu, Course 1

41 DM, DMDW
Sequences

The model:
Itemset: a set of n distinct items
I = {i1, i2, …, in }
Event: a non-empty collection of items; we can
assume that items are in a given order (e.g.
lexicographic): (i1,i2 … ik)
Sequence : an ordered list of events:
< e1 e2 … em >

Florin Radulescu, Course 1

42 DM, DMDW
Sequential Pattern Discovery
Input:
 A set of sequences S (or a sequence database).
 A Boolean function that can test if a sequence S1 is included
(or is a subsequence) of a sequence S2. In that case S2 is
called a super sequence of S1.
 A threshold s (percent or absolute value) needed for finding
frequent sequences.
Output:
 The set of frequent sequences, i.e. the set of sequences that
are included in at least s sequences from S.
 Sometimes a set of rules can be derived from the set of
frequent sequences, each rule being of the form S1  S2
where S1 and S2 are sequences.

Florin Radulescu, Course 1

43 DM, DMDW
Examples
 In a bookstore we can find frequent sequences like:
{(Book_on_C, Book_on_C++), (Book_on_Perl)}

 From this sequence we can derive a rule like that: after


buying books about C and C++, a customers buys books
on Perl:
Book_on_C, Book_on_C++  Book_on_Perl

Florin Radulescu, Course 1

44 DM, DMDW
Summary
This first course presented:
 A list of alternative definitions of Data Mining and some examples of
what is Data Mining and what is not Data Mining
 A discussion about the researchers communities involved in Data
Mining and about the fact that Data Mining is a cluster of
subdomains
 The steps of the Data Mining process from collecting data located in
existing repositories (data warehouses, archives or operational
systems) to the final evaluation step.
 A brief description of the main subdomains of Data Mining with some
examples for each of them.

Next week: Data preprocessing

Florin Radulescu, Course 1

45 DM, DMDW
References
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data, Second Edition, Springer, 1-13.
[Tan, Steinbach, Kumar 06] Pang-Ning Tan, Michael Steinbach,
Vipin Kumar, 2006. Introduction to Data Mining, Adisson-Wesley, 1-
16.
[Kimbal, Ross 02] Ralph Kimball, Margy Ross, 2002. The Data
Warehouse Toolkit, Second Edition, John Wiley and Sons, 1-16, 396
[Mikut, Reischl 11] Ralf Mikut and Markus Reischl, Data mining
tools, 2011, Wiley Interdisciplinary Reviews: Data Mining and
Knowledge Discovery, Volume 1, Issue 5,
http://onlinelibrary.wiley.com/doi/10.1002/widm.24/pdf
[Ullman] Jeffrey Ullman, Data Mining Lecture Notes, 2003-2009,
web page: http://infolab.stanford.edu/~ullman/mining/mining.html
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org

Florin Radulescu, Course 1

46 DM, DMDW
2. Data preprocessing

Prof.dr.ing. Florin Radulescu


Universitatea Politehnica din Bucureşti
Road Map

 Data types
 Measuring data
 Data cleaning
 Data integration
 Data transformation
 Data reduction
 Data discretization
 Summary
Florin Radulescu, Note de curs

2 DMDW-2
Data types

 Categorical vs. Numerical


 Scale types
 Nominal
 Ordinal
 Interval
 Ratio (RO: scala proportionala)

Florin Radulescu, Note de curs

3 DMDW-2
Categorical vs. Numerical
 Categorical data, consisting in names representing
some categories, meaning that they belong to a
definable category. Example: color (with categories
red, green, blue and white) or gender (male, female).
 The values of this type are not ordered, the usual
operations that may be performed being equality and
set inclusion.
 Numerical data, consisting in numbers from a
continuous or discrete set of values.
 Values are ordered, so testing this order is possible (<,
>, etc).
 Sometimes we must or may convert categorical data in
numerical data by assigning a numeric value (or code)
for each label.
Florin Radulescu, Note de curs

4 DMDW-2
Scale types

Stanley Smith Stevens, director of the


Psycho-Acoustic Laboratory, Harvard
University, proposed in a 1946 Science
article that all measurement in science are
using four different types of scales:
 Nominal
 Ordinal
 Interval
 Ratio
Florin Radulescu, Note de curs

5 DMDW-2
Scale types

Quote from the article:

Florin Radulescu, Note de curs

6 DMDW-2
Nominal
 Values belonging to a nominal scale are
characterized by labels.
 Values are unordered and equally weighted.
 We cannot compute the mean or the median
from a set of such values
 Instead, we can determine the mode, meaning
the value that occurs most frequently.
 Nominal data are categorical but may be
treated sometimes as numerical by assigning
numbers to labels.
Florin Radulescu, Note de curs

7 DMDW-2
Ordinal
 Values of this type are ordered but the difference or
distance between two values cannot be determined.
 The values only determine the rank order /position in the
set.
 Examples: the military rank set or the order of
marathoners at the Olympic Games (without the times)
 For these values we can compute the mode or the
median (the value placed in the middle of the ordered
set) but not the mean.
 These values are categorical in essence but can be
treated as numerical because of the assignment of
numbers (position in set) to the values
Florin Radulescu, Note de curs

8 DMDW-2
Interval
 These are numerical values.
 For interval scaled attributes the difference between two
values is meaningful.
 Example: the temperature using Celsius scale is an
interval scaled attribute because the difference between
10 and 20 degrees is the same as the difference
between 40 and 50 degrees.
 Zero does not mean ‘nothing’ but is somehow arbitrarily
fixed. For that reason negative values are also allowed.
 We can compute the mean, the standard deviation or we
can use regression to predict new values.

Florin Radulescu, Note de curs

9 DMDW-2
Ratio
Ratio scaled attributes are like interval scaled
attributes but zero means ‘nothing’.
Negative values are not allowed.
The ratio between two values is meaningful.
Example: age - a 10 years child is two times
older than a 5 years child.
Other examples: temperature in Kelvin, mass in
kilograms, length in meters, etc.
All mathematical operations can be performed,
for example logarithms, geometric and harmonic
means, coefficient of variation
Florin Radulescu, Note de curs

10 DMDW-2
Binary data
 Sometimes an attribute may have only two values, as the
gender in a previous example. In that case the attribute is
called binary.
 Symmetric binary: when the two values are of the same weight
and have equal importance (as in the gender case)
 Asymmetric binary: one of the values is more important than
the other. Example: a medical bulletin containing blood tests for
identifying the presence of some substances, evaluated by
‘Present’ or ‘Absent’ for each substance. In that case ‘Present’ is
more important that ‘Absent’.
 Binary attributes can be treated as interval or ratio scaled but
in most of the cases these attributes must be treated as
nominal (binary symmetric) or ordinal (binary asymmetric)
 There are a set of similarity and dissimilarity (distance)
functions specific to binary attributes.
Florin Radulescu, Note de curs

11 DMDW-2
Road Map

 Data types
 Measuring data
 Data cleaning
 Data integration
 Data transformation
 Data reduction
Data discretization
 Summary
Florin Radulescu, Note de curs

12 DMDW-2
Measuring data

 Measuring central tendency:


 Mean
 Median
 Mode
 Midrange (ro: valoarea centrala)
 Measuring dispersion:
 Range
 Kth percentile
 IQR (ro: intervalul intercuartilic)
 Five-number summary
 Standard deviation and variance
Florin Radulescu, Note de curs

13 DMDW-2
Central tendency - Mean
 Consider a set of n values of an attribute: x1, x2, …, xn.
 Mean: The arithmetic mean or average value is:
μ = (x1 + x2 + …+ xn) / n
 If the values x have different weights, w1, …, wn , then
the weighted arithmetic mean or weighted average is:
μ = (w1x1 + w2x2 + …+ wnxn) / (w1 + w2 + …+ wn)
 If the extreme values are eliminated from the set
(smallest 1% and biggest 1%) a trimmed mean is
obtained.

Florin Radulescu, Note de curs

14 DMDW-2
Central tendency - Median
Median: The median value of an ordered set is
the middle value in the set.
Example: Median for {1, 3, 5, 7, 1001, 2002,
9999} is 7.
If n is even the median is the mean of the middle
values:
the median of {1, 3, 5, 7, 1001, 2002} is 6
(arithmetic mean of 5 and 7).

Florin Radulescu, Note de curs

15 DMDW-2
Central tendency - Mode
Mode (RO: valoarea modala): The mode of a
dataset is the most frequent value.
A dataset may have more than a single mode.
For 1, 2 and 3 modes the dataset is called
unimodal, bimodal and trimodal.
When each value is present only once there is
no mode in the dataset.
For a unimodal dataset the mode is a measure
of the central tendency of data. For these
datasets we have the empirical relation:
mean – mode = 3 x (mean – median)
Florin Radulescu, Note de curs

16 DMDW-2
Central tendency - Midrange

Midrange (RO: valoarea centrala / mijlocul


intervalului). The midrange of a set of values is
the arithmetic mean of the largest and the
smallest value.
For example the midrange of {1, 3, 5, 7, 1001,
2002, 9999} is 5000 (the mean of 1 and 9999).

Florin Radulescu, Note de curs

17 DMDW-2
Dispersion (1)
 Range. The range is the difference between the largest
and smallest values.
 Example: for {1, 3, 5, 7, 1001, 2002, 9999} range is 9999
– 1 = 9998.
 kth percentile. The kth percentile is a value xj having the
property that k percent of the values are less or equal
than xj.
 Example: the median is the 50th percentile.
 The most used percents are the median and the 25th and
75th percentiles, called also quartiles (ro: cuartile).
Notation: Q1 for 25% and Q3 for 75%.

Florin Radulescu, Note de curs

18 DMDW-2
Dispersion (2)
 Computing method: There are more than one different
methods for computing Q1, Q2 and Q3. The most
obvious method is the following:
Put the values of the data set in ascending order
Compute the median using its definition. It divides the
ordered dataset into two halves (lower and upper),
neither one including the median.
The median value is Q2
The median of the lower half is Q1 (or the lower
quartile)
The median of the upper half is Q3 (or the upper
quartile)
Florin Radulescu, Note de curs

19 DMDW-2
Dispersion (3)
 Interquartile range (IQR) is the difference between Q3
and Q1 (ro: interval intercuartilic):
IQR = Q3 – Q1
 Potential outliers are values more than 1.5 x IQR below
Q1 or above Q3.
 Five-number summary. Sometimes the median and the
quartiles are not enough for representing the spread of
the values
 The smallest and biggest values must be considered
also.
 (Min, Q1, Median, Q3, Max) is called the five-number
summary.

Florin Radulescu, Note de curs

20 DMDW-2
Dispersion (4)
 Examples:
For {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}
Range = 10; Midrange = 6;
Q1 = 3; Q2 = 6; Q3 = 9; IQR = 9 - 3 = 6

For {1, 3, 3, 4, 5, 6, 6, 7, 8, 8}
Range = 7; Midrange = 4.5;
Q1 = 3; Q2 = 5.5 [=(5+6)/2]; Q3 = 7; IQR = 7 - 3 = 4

For {1, 3, 5, 7, 8, 10, 11, 13}


Range = 12; Midrange = 7;
Q1 = 4; Q2 = 7.5; Q3 = 10.5; IQR = 10.5 - 4 = 6.5

Florin Radulescu, Note de curs

21 DMDW-2
Dispersion (5)
 Standard deviation. The standard deviation of n values
(observations) is:

 The square of standard deviation is called variance.


 The standard deviation measures the spread of the
values around the mean value.
 A value of 0 is obtained only when all values are
identical.
Florin Radulescu, Note de curs

22 DMDW-2
Road Map

 Data types
 Measuring data
 Data cleaning
 Data integration
 Data transformation
 Data reduction
Data discretization
 Summary
Florin Radulescu, Note de curs

23 DMDW-2
Objectives
The main objectives of data cleaning are:
 Replace (or remove) missing values,
 Smooth noisy data,
 Remove or just identify outliers

Florin Radulescu, Note de curs

24 DMDW-2
NULL values
When a NULL value is present in data it may be:
1. Legal NULL value: Some attributes are
allowed to contain a NULL value. In such a
case the value must be replaced by something
like ‘Not applicable’ and not a NULL value.
2. Missing value: The value existed at
measurement time but was not collected.

Florin Radulescu, Note de curs

25 DMDW-2
Missing values (1)
 May appear from various reasons:
 human/hardware/software problems,
 data not collected (considered unimportant at
collection time),
 deleted data due to inconsistencies, etc.
 There are two solutions in handling missing
data:
1. Ignore the data point / example with missing
attribute values. If the number of errors is
limited and these errors are not for sensitive
data, removing them may be a solution.
Florin Radulescu, Note de curs

26 DMDW-2
Missing values (2)
2. Fill in the missing value. This may be done in
several ways:
 Fill in manually. This option is not feasible in
most of the cases due to the huge volume of
the datasets that must be cleaned.
 Fill in with a (distinct from others) value ‘not
available’ or ‘unknown’.
 Fill in with a value measuring the central
tendency, for example attribute mean,
median or mode.
Florin Radulescu, Note de curs

27 DMDW-2
Missing values (3)
2. Fill in the missing value - cont.
 Fill in with a value measuring the central
tendency but only on a subset (for example,
for labeled datasets, only for examples
belonging to the same class).
 The most probable value, if that value may
be determined, for example by decision
trees, expectation maximization (EM),
Bayes, etc.

Florin Radulescu, Note de curs

28 DMDW-2
Smooth noisy data

 The noise can be defined as a random error


or variance in a measured variable ([Han,
Kamber 06]).
 Wikipedia define noise as a colloquialism for
recognized amounts of unexplained
variation in a sample.
 For removing the noise, some smoothing
techniques may be used:
1. Regression (was presented in first course)
2. Binning
Florin Radulescu, Note de curs

29 DMDW-2
Binning

 Binning (RO: partitionare, clasare) can be


used for smoothing an ordered set of values.
 Smoothing is made based on neighbor values.
 There are two steps:
 Partitioning ordered data in several bins. Each bin
contains the same number of examples (data
points).
 Smoothing for each bin: values in a bin are modified
based on some bin characteristics: mean, median,
boundaries.
Florin Radulescu, Note de curs

30 DMDW-2
Example
 Consider the following ordered data for some attribute:
1, 2, 4, 6, 9, 12, 16, 17, 18, 23, 34, 56, 78, 79, 81

Initial bins Use mean for Use median for Use bin boundaries
binning binning for binning

1, 2, 4, 6, 9 4, 4, 4, 4, 4 4, 4, 4, 4, 4 1, 1, 1, 9, 9

12, 16, 17, 18, 23 17, 17, 17, 17, 17 17, 17, 17, 17, 17 12, 12, 12, 23, 23

34, 56, 78, 79, 81 66, 66, 66, 66, 66 78, 78, 78, 78 34, 34, 81, 81, 81

Florin Radulescu, Note de curs

31 DMDW-2
Result
So the smoothing result is:

 Initial: 1, 2, 4, 6, 9, 12, 16, 17, 18, 23, 34, 56, 78, 79, 81
 Using the mean: 4, 4, 4, 4, 4, 17, 17, 17, 17, 17, 66, 66,
66, 66, 66
 Using the median: 4, 4, 4, 4, 4, 17, 17, 17, 17, 17, 78,
78, 78, 78, 78
 Using the bin boundaries: 1, 1, 1, 9, 9, 12, 12, 12, 23, 23,
34, 34, 81, 81, 81

Florin Radulescu, Note de curs

32 DMDW-2
Outliers
 An outlier (ro: valoare aberanta / punct izolat) is an
attribute value numerically distant from the rest of
the data.
 Outliers may be sometimes correct values: for example,
the salary of the CEO of a company may be much bigger
that all other salaries. But in most of the cases outliers
are and must be handled as noise.
 Outliers must be identified and then removed (or
replaced, as any other noisy value) because many data
mining algorithms are sensitive to outliers.
 For example any algorithm using the arithmetic mean
(one of them is k-means) may produce erroneous results
because the mean is very sensitive to outliers.
Florin Radulescu, Note de curs

33 DMDW-2
Identifying outliers
 Use of IQR: values more than 1.5 x IQR below
Q1 or above Q3 are potential outliers. Boxplots
may be used to identify these outliers (boxplots
are a method for graphical representation of
data dispersion).
 Use of standard deviation: values that are
more than two standard deviations away from
the mean for a given attribute are also
potential outliers.
 Clustering. After clustering a certain dataset
some points are outside any cluster (or far
away from any cluster center.
Florin Radulescu, Note de curs

34 DMDW-2
Road Map

 Data types
 Measuring data
 Data cleaning
 Data integration
 Data transformation
 Data reduction
Data discretization
 Summary
Florin Radulescu, Note de curs

35 DMDW-2
Objectives

Data integration means merging data from


different data sources into a coherent dataset.
The main activities are:
Schema integration
Remove duplicates and redundancy
Handle inconsistencies

Florin Radulescu, Note de curs

36 DMDW-2
Schema integration
Must identify the translation of every source
scheme to the final scheme (entity identification
problem)
Subproblems:
The same thing is called differently in every data
source. Example: the customer id may be called
Cust-ID, Cust#, CustID, CID in different sources.
Different things are called with the same name in
different sources. Example: for employees data, the
attribute ‘City’ means city where resides in a source
and city of birth in another source.
Florin Radulescu, Note de curs

37 DMDW-2
Duplicates

Duplicates: The same information may be stored


in many data sources. Merging them can cause
sometimes duplicates of that information:
 as duplicate attribute (same attribute with different
names is found multiple times in the final result) or
 as duplicate instance (same object/entity is found
multiple times in the final database).
These duplicates must be identified and
removed.

Florin Radulescu, Note de curs

38 DMDW-2
Redundancy
Redundancy: Some information may be
deduced / computed.
For example age may be deduced from
birthdate, annual salary may be computed from
monthly salary and other bonuses recorded for
each employee.
Redundancy must be removed from the dataset
before running the data mining algorithm
Note that in existing data warehouses some
redundancy is allowed.
Florin Radulescu, Note de curs

39 DMDW-2
Inconsistencies
Inconsistencies are conflicting values for a set of
attributes.
Example Birthdate = January 1, 1980, Age = 12
represents an obvious inconsistency but we may
find other inconsistencies that are not so
obvious.
For detecting inconsistencies extra knowledge
about data is necessary: for example, the
functional dependencies attached to a table
scheme can be used.
Available metadata describing the content of the
dataset may help in removing inconsistencies.
Florin Radulescu, Note de curs

40 DMDW-2
Road Map

 Data types
 Measuring data
 Data cleaning
 Data integration
 Data transformation
 Data reduction
Data discretization
 Summary
Florin Radulescu, Note de curs

41 DMDW-2
Objectives

Data is transformed and summarized in a better


form for the data mining process:
Normalization
New attribute construction
Summarization using aggregate functions

Florin Radulescu, Note de curs

42 DMDW-2
Normalization
All attributes are scaled to fit a specified range:
0 to 1,
-1 to 1 or generally
|v| <= r where r is a given positive value.
Needed when the importance of some attributes
is bigger only because the range of the values of
that attributes is bigger.
Example: Euclidian distance between A(0.5,
101) and B(0.01, 2111) is ≈ 2010, determined
almost exclusively by the second dimension.

Florin Radulescu, Note de curs

43 DMDW-2
Normalization
We can achieve normalization using:
 Min-max normalization:
vnew = (v – vmin) / (vmax – vmin)
 For positive values the formula is:
vnew = v / vmax
 z-score normalization ( is the standard deviation):
vnew = (v – vmean) /
 Decimal scaling: vnew = v / 10n
where n is the smallest integer for that all numbers become
(as absolute value) less than the range r (for r = 1, all
new values of v are <= 1) then
Florin Radulescu, Note de curs

44 DMDW-2
Feature construction
 New attribute construction is called also feature
construction.
 It means: building new attributes based on the values of
existing ones.
 Example: if the dataset contains an attribute ‘Color’ with
only three distinct values {Red, Green, Blue} then three
attributes may be constructed: ‘Red’, ‘Green’ and ‘Blue’
where only one of them equals 1 (based on the value of
‘Color’) and the other two 0.
 Another example: use a set of rules, decision trees or
other tools to build new attribute values from existing
ones. New attributes will contain the class labels
attached by the rules / decision tree used / labeling tool.

Florin Radulescu, Note de curs

45 DMDW-2
Summarization
At this step aggregate functions may be used to
add summaries to the data.
Examples: adding sums for daily, monthly and
annual sales, counts and averages for a number
of customers or transactions, and so on.
All these summaries are used for the ‘slice and
dice’ process when data is stored in a data
warehouse.
The result is a data cube and each summary
information is attached to a level of granularity.
Florin Radulescu, Note de curs

46 DMDW-2
Road Map

 Data types
 Measuring data
 Data cleaning
 Data integration
 Data transformation
 Data reduction
Data discretization
 Summary
Florin Radulescu, Note de curs

47 DMDW-2
Objectives

Not all information produced by the


previous steps is needed for a certain data
mining process.
Reducing the data volume by keeping only
the necessary attributes leads to a better
representation of data and reduces the
time for data analysis.

Florin Radulescu, Note de curs

48 DMDW-2
Reduction methods (1)
Methods that may be used for data reduction (see [Han,
Kamber 06]) :
 Data cube aggregation, already discussed.
 Attribute selection: keep only relevant attributes. This
can be made by:
 stepwise forward selection (start with an empty set and add
attributes),
 stepwise backward elimination (start with all attributes and
remove some of them one by one)
 a combination of forward selection and backward elimination.
 decision tree induction: after building the decision tree, only
attributes used for decision nodes are kept.

Florin Radulescu, Note de curs

49 DMDW-2
Reduction methods (2)
 Dimensionality reduction: encoding mechanisms
are used to reduce the data set size or compress
data.
 A popular method is Principal Component Analysis
(PCA): given N data vectors having n dimensions,
find K <= N orthogonal vectors (called principal
components) that can be used for representing data.
 A PCA example is presented on the following slide,
for a multivariate Gaussian distribution (source:
wikipedia).

Florin Radulescu, Note de curs

50 DMDW-2
PCA example
PCA for a multivariate Gaussian distribution (source:
http://2011.igem.org/Team:USTC-Software/parameter )

Florin Radulescu, Note de curs

51 DMDW-2
Reduction methods (3)
Numerosity reduction: the data are replaced
by smaller data representations such as
parametric models (only the model parameters
are stored in this case) or nonparametric
methods: clustering, sampling, histograms.
Discretization and concept hierarchy
generation, discussed in the following
paragraph.

Florin Radulescu, Note de curs

52 DMDW-2
Road Map

 Data types
 Measuring data
 Data cleaning
 Data integration
 Data transformation
 Data reduction
Data discretization
 Summary
Florin Radulescu, Note de curs

53 DMDW-2
Objectives
There are many data mining algorithms that
cannot use continuous attributes. Replacing
these continuous values with discrete ones is
called discretization.
Even for discrete attributes, is better to have a
reduced number of values leading to a reduced
representation of data. This may be performed
by concept hierarchies.

Florin Radulescu, Note de curs

54 DMDW-2
Discretization (1)

 Discretization means reducing the number of


values for a given continuous attribute by
dividing its values in intervals.
 Each interval is labeled and each attribute value
will be replaced with the interval label.
 Some of the most popular methods to perform
discretization are:
1. Binning: equi-width bins or equi-frequency bins may
be used. Values in the same bin receive the same
label.
Florin Radulescu, Note de curs

55 DMDW-2
Discretization (1)
 Popular methods to perform discretization - cont:
2. Histograms: like binning, histograms partition values for an
attribute in buckets. Each bucket has a different label and labels
replace values.
3. Entropy based intervals: each attribute value is considered a
potential split point (between two intervals) and an information
gain is computed for it (reduction of entropy by splitting at that
point). Then the value with the greatest information gain is
picked. In this way intervals may be constructed in a top-down
manner.
4. Cluster analysis: after clustering, all values in the same cluster
are replaced with the same label (the cluster-id for example)

Florin Radulescu, Note de curs

56 DMDW-2
Concept hierarchies

Usage of a concept hierarchy to perform


discretization means replacing low-level
concepts (or values) with higher level concepts.
Example: replace the numerical value for age
with young, middle-aged or old.
For numerical values, discretization and concept
hierarchies are the same.

Florin Radulescu, Note de curs

57 DMDW-2
Concept hierarchies
 For categorical data the goal is to replace a bigger set of
values with a smaller one (categorical data are discrete
by definition):
 Manually define a partial order for a set of attributes. For
example the set {Street, City, Department, Country} is partially
ordered, Street City Department Country. In that case we
can construct an attribute ‘Localization’ at any level of this
hierarchy, by using the n rightmost attributes (n = 1 .. 4).
 Specify (manually) high level concepts for value sets of low level
attribute values associated with. For example {Muntenia, Oltenia,
Dobrogea} Tara_Romaneasca.
 Automatically identify a partial order between attributes, based
on the fact that high level concepts are represented by attributes
containing a smaller number of values compared with low level
ones.

Florin Radulescu, Note de curs

58 DMDW-2
Summary
This second course presented:
 Data types: categorical vs. numerical, the four scales (nominal,
ordinal, interval and ratio) and binary data.
 A short presentation of data preprocessing steps and some ways to
extract important characteristics of data: central tendency (mean,
mode, median, etc) and dispersion (range, IQR, five-number
summary, standard deviation and variance).
 A description of every preprocessing step:
 cleaning,
 integration,
 transformation,
 reduction and
 discretization

 Next week: Association rules and sequential patterns

Florin Radulescu, Note de curs

59 DMDW-2
References
[Han, Kamber 06] Jiawei Han, Micheline Kamber, Data Mining:
Concepts and Techniques, Second Edition, Morgan Kaufmann
Publishers, 2006, 47-101
[Stevens 46] Stevens, S.S, On the Theory of Scales of
Measurement. Science June 1946, 103 (2684): 677–680.
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org
[Liu 11] Bing Liu, 2011. CS 583 Data mining and text mining course
notes, http://www.cs.uic.edu/~liub/teach/cs583-fall-11/cs583.html

Florin Radulescu, Note de curs

60 DMDW-2
Association Rules and
Sequential Patterns
Prof.dr.ing. Florin Radulescu
Universitatea Politehnica din Bucureşti
Road Map

Frequent itemsets and rules


Apriori algorithm
FP-Growth
Data formats
Class association rules
Sequential patterns. GSP algorithm

Florin Radulescu, Note de curs

2 DMDW-3
Objectives
 Association rule learning was introduced in the article
“Mining Association Rules Between Sets of Items in
Large Databases” by Agrawal, Imielinski and Swami (see
references), article presented at the 1993 SIGMOD
Conference (ACM SIGMOD means ACM Special
Interest Group on Management of Data).
 One of the best known applications is finding
relationships (rules) between products as recorded by
POS systems in supermarkets.
 For example, the statement that 85% of baskets that
contain bread contain also mineral water is a rule with
bread as antecedent and mineral water as consequent.
Florin Radulescu, Note de curs

3 DMDW-3
Objectives
The original article (Agrawal et al.) lists some
examples of the expected results:
Find all rules that have “Diet Coke” as
consequent,
Find all rules that have “bagels” in the
antecedent,
Find all rules that have “sausage” in the
antecedent and “mustard” in the consequent,
Find all the rules relating items located on
shelves A and B in the store,
Find the “best” k rules (considering rule support)
that have “bagels” in the consequent
Florin Radulescu, Note de curs

4 DMDW-3
Frequent itemsets and rules

Frequent itemsets and rules:


 Items and transactions
 Association rules
 Goals for mining transactions

Florin Radulescu, Note de curs

5 DMDW-3
Items and transactions
 Let I = {i1, i2, …, in} be a set of items. For example, items may
be all products sold in a supermarket or all words contained in
some documents.
 A transaction t is a set of items, with t I. Examples of
transactions are market baskets containing products or
documents containing words.
 A transaction dataset (or database) T is a set of
transactions, T = {t1, t2, …, tm}. Each transaction may contain
a different number of items and the dataset may be stored in
a SGBD managed database or in a text file.
 An itemset S is a subset of I. If v = |S| is the number of items
in S (or the cardinal of S), then S is called a v-itemset.

Florin Radulescu, Note de curs

6 DMDW-3
Items and transactions
 The support of an itemset X, sup(X), is equal to the
number (or proportion) of transactions in T containing
X. The support may be given as an absolute value
(number of transactions) or proportion or percent
(proportion of transactions).
 In many cases we want to find itemsets with a support
greater or equal with a given value (percent) s. Such
an itemset is called frequent itemset and, in the
market basket example, it contains items that can be
found together in many baskets (where the measure
of ‘many’ is s).
 These frequent itemsets are the source of all ‘powerful’
association rules.
Florin Radulescu, Note de curs

7 DMDW-3
Example 1
Let us consider a dataset containing market basket
transactions:
 I = {laptop, mouse, tablet, hard-drive, monitor,
keyboard, DVD-drive, CD-drive, flash-memory, . . .}
 T = {t1, t2, …, tm}
 t1 = {laptop, mouse, tablet}
 t2 = {hard-drive, monitor, laptop, keyboard, DVD-drive}
. . .
 tm = {keyboard, mouse, tablet)

 S = {keyboard, mouse, monitor, laptop} is a 4-itemset.

Florin Radulescu, Note de curs

8 DMDW-3
Example 2
If items are words and transactions are documents,
where each document is considered a bag of words,
then we can have:
 T = {Doc1, Doc2, …, Doc6}
 Doc1 = {rule, tree, classification}
 Doc2 = {relation, tuple, join, algebra, recommendation}
 Doc3 = {variable, loop, procedure, rule}
 Doc4 = {clustering, rule, tree, recommendation}
 Doc5 = {join, relation, selection, projection,
classification}
 Doc6 = {rule, tree, recommendation}

Florin Radulescu, Note de curs

9 DMDW-3
Example 2
In that case:
 sup({rule, tree}) = 3 or 50% or 0.5
 sup({relation, join}) = 2 or 33.33% or 1/3
 If the threshold is s = 50% (or 0.5) then {rule, tree} is
frequent and {relation, join} is not.

Florin Radulescu, Note de curs

10 DMDW-3
Frequent itemsets and rules

Frequent itemsets and rules:


 Items and transactions
 Association rules
 Goals for mining transactions

Florin Radulescu, Note de curs

11 DMDW-3
Association rules

If X and Y are two itemsets, an


association rule is an implication of the
form X → Y where X ∩ Y = .
X is called the antecedent of the rule (RO:
antecedent), Y is the consequent (RO:
consecinta).
For each association rule we can compute
the support of the rule and the
confidence of the rule.
Florin Radulescu, Note de curs

12 DMDW-3
Association rules
 If m is the number of transactions in T then:
 sup(X → Y) = sup(X Y) - as absolute value or
 sup(X → Y) = sup(X Y) / m - as proportion
 conf(X → Y) = sup(X Y) / sup(X)
where the support of an itemset is given as absolute value
(number of transactions).

Trans. Containing
X Y

Trans. Containing X

T, |T| = m
Florin Radulescu, Note de curs

13 DMDW-3
Association rules

The support of a rule X → Y is given by


the proportion of transactions in T
containing both X and Y.
The confidence of a rule X → Y is given
by the proportion of transactions
containing Y in the set of transactions
containing X (the set of transactions
containing X Y is included in the set of
transactions containing X). Florin Radulescu, Note de curs

14 DMDW-3
Association rules

If the support of a rule is high it


describes a relationship between itemsets
that are found together in many
transactions (itemsets in many baskets or
word sets in many documents).
If the confidence of a rule X → Y is high
then if a transaction contains X then with a
high probability (equal to the confidence of
the rule) it also contains Y Florin Radulescu, Note de curs

15 DMDW-3
Finding association rules

The previous paragraph states that if the support


of a rule X → Y is high, then X and Y can be
found together in many transactions.
Consequently, both itemsets are part of a
frequent itemset (considering the same
minimum support), or in other terms, each such
rule can be determined starting from a frequent
itemset and dividing it in two disjoint parts: X and
Y.
Florin Radulescu, Note de curs

16 DMDW-3
Finding association rules
 That means that the process of finding all the rules
given the minimum support and the minimum
confidence has three steps:
 Step 1. Find all frequent itemsets containing at least
two items, considering the given minimum support
minsup.
 Step 2. For each frequent itemset U found in step 1
list all splits (X, Y) with X Y= and X Y=U. Each
split generates a rule X → Y.
 Step 3. Compute the confidence of each rule. Keep
only the rules with confidence at least minconf.
Florin Radulescu, Note de curs

17 DMDW-3
Example 3
Consider the set of six transactions in Example 2:
 Doc1 = {rule, tree, classification}
 Doc2 = {relation, tuple, join, algebra,
recommendation}
 Doc3 = {variable, loop, procedure, rule}
 Doc4 = {clustering, rule, tree, recommendation}
 Doc5 = {join, relation, selection, projection,
classification}
 Doc6 = {rule, tree, recommendation}

Florin Radulescu, Note de curs

18 DMDW-3
Example 3
With a minimum support of 50% we find that
{rule, tree} is a frequent itemset. The two rules
derived from this itemset have the same
minimum support:
 rule → tree
with sup = 50% and conf = 3 / 4 = 75% and
 tree → rule
with sup = 50% and conf = 3 / 3 = 100%
If the minimum confidence required is 80% then
only the second rule is kept, the first being
considered not enough powerful.
Florin Radulescu, Note de curs

19 DMDW-3
Frequent itemsets and rules

Frequent itemsets and rules:


 Items and transactions
 Association rules
 Goals for mining transactions

Florin Radulescu, Note de curs

20 DMDW-3
Goals for mining transactions
 Goal 1: Find frequent itemsets. Frequent
itemsets can be used not only to find rules but also
for marketing purposes.
 As an example, in a supermarket, frequent
itemsets helps marketers to place items in an
effort to control the way customers walk through
the store:
 Items that are sold together are placed for
example in distant corners of the store such that
customers must go from one product to another
possibly putting other products in the basket on
the way.
Florin Radulescu, Note de curs

21 DMDW-3
Goal 2

 Find association rules. Such a rule tells that


people that buy some items X buy some other
items Y with a high probability. Association rules
may also be used for marketing purposes.
A well known example in the domain literature is
the rule:
Diapers → Beer

Florin Radulescu, Note de curs

22 DMDW-3
Diapers → Beer
In [Whitehorn 06] this example is described as follows:
 “Some time ago, Wal-Mart decided to combine the
data from its loyalty card system with that from its
point of sale systems.
 The former provided Wal-Mart with demographic data
about its customers, the latter told it where, when and
what those customers bought.
 Once combined, the data was mined extensively and
many correlations appeared.
 Some of these were obvious; people who buy gin are
also likely to buy tonic. They often also buy lemons.
 However, one correlation stood out like a sore thumb
because it was so unexpected.
Florin Radulescu, Note de curs

23 DMDW-3
Diapers → Beer
On Friday afternoons, young American males who
buy diapers (nappies) also have a predisposition
to buy beer.
 No one had predicted that result, so no one would
ever have even asked the question in the first place.
 Hence, this is an excellent example of the difference
between data mining and querying.”

 This example is only a Data Mining myth (as Daniel


Power described in
http://www.dssresources.com/newsletters/66.php) but
was for many years a very used example because the
psychological impact.
Florin Radulescu, Note de curs

24 DMDW-3
Goal 3
In [Ullman 03-09] is listed also a third goal for
mining transactions:
Goal 3: Find causalities. In the case of the rule
Diapers → Beer a natural question is if the left
part of the rule (buying diapers) causes the right
part (buy also beer).
Causal rules can be used in marketing: a low
price of diapers will attract diaper buyers and an
increase of the beer price will grow the overall
sales numbers.
Florin Radulescu, Note de curs

25 DMDW-3
Algorithms
 There are many algorithms for finding frequent
itemsets and consequently the association rules
in a dataset.
All these algorithms are developed for huge
volumes of data, meaning that the dataset is too
large to be loaded and processed in the main
memory.
For that reason minimizing the number of times
the data are read from the disk become a key
feature of each algorithm.
Florin Radulescu, Note de curs

26 DMDW-3
Road Map

Frequent itemsets and rules


Apriori algorithm
 Apriori principle
 Apriori algorithm
FP-Growth
Data formats
Class association rules
Sequential patterns. GSP algorithm
Florin Radulescu, Note de curs

27 DMDW-3
Apriori algorithm

 This algorithm is introduced in 1994 in [Agrawal,


Srikant 94], at the VLDB conference (VLDB =
International Conference on Very Large
Databases).
It is based on the Apriori principle (called also
monotonicity or downward closure property).

Florin Radulescu, Note de curs

28 DMDW-3
Apriori principle
The Apriori principle states that any subset of a
frequent itemset is also a frequent itemset.
Example 4: If {1, 2, 3, 4} is a frequent itemset then all
its four subsets with 3 values are also frequent: {1,
2, 3}, {1, 2, 4}, {1, 3, 4} and {2, 3, 4}.
 Consequently each frequent v-itemset is the reunion
of v frequent (v-1)-itemsets.
 That means we can determine the frequent itemsets
with dimension v examining only the set of all
frequent itemsets with dimension (v-1).
Florin Radulescu, Note de curs

29 DMDW-3
Apriori principle

It means that an algorithm for finding


frequent itemsets must:
1. find frequent 1-itemsets (frequent items)
2. find frequent 2-itemsets considering all pairs
of frequent items found in step 1
3. Find frequent 3-itemsets considering all
triplets with each subset in the frequent pairs
set found in step 2
4. . . . and so on.
Florin Radulescu, Note de curs

30 DMDW-3
Apriori principle
It is a level-wise approach where each step
requires a full scan of the dataset (residing on
disk).
A diagram is presented in the next slide where Ci
is the set of candidates for frequent i-itemsets
and Li is the actual set of frequent i-itemsets.
C1 is the set of all itemsets found in transactions
(a subset of I) and may be obtained either by a
reunion of all transactions in T or by considering
C1 = I (in that case some items may have a zero
support)
Florin Radulescu, Note de curs

31 DMDW-3
Using Apriori Principle
C1 L1

C2 L2

C3 L3

Ck Lk

The process stops in two cases:


 No candidate from Ck has the support at least minsup (Lk
is empty)
 There is no (k+1)-itemset with all k-subsets in Lk
(meaning that Ck+1 is empty)
Florin Radulescu, Note de curs

32 DMDW-3
Apriori Algorithm
The algorithm is described in [Agrawal, Srikant 94] and
uses the level-wise approach described before:
 A first scan of the dataset leads to the L1 (the set of
frequent items). For each transaction t in T and for
each item a in t the count of a is increased
(a.count++). At the end of the scan L1 will contain all
items with a count at least minsup (given as absolute
value).
 For k=2, 3, … the process continues by generating the
set of candidates Ck and then counting the support of
each candidate by a full scan of the dataset.
 Process ends when Lk is empty.

Florin Radulescu, Note de curs

33 DMDW-3
Apriori Algorithm
Algorithm Apriori (T)
L1 = scan (T);
for (k = 2; Lk-1 ; k++) do
Ck apriori-gen(Lk-1);
for each transaction t T do
for each candidate c Ck do
if c is contained in t then
c.count++;
end
end
Lk {c Ck | c.count minsup}
end
return L = Lk;
Florin Radulescu, Note de curs

34 DMDW-3
Candidate generation
Candidate generation is also described in the
original algorithm as having two steps: the join
step and the prune step. The first builds a larger
set of candidates and the last removes some of
them proved impossible to be frequent.
In the join step each candidate is obtained from
two different frequent (k-1)-itemsets containing
(k-2) identical items:
Ck = { {i1, …, ik-1, i’k-1} | p={i1, …, ik-1} Lk-1 ,
q={i1, …, i’k-1} Lk-1 , ik-1 < i’k-1}
Florin Radulescu, Note de curs

35 DMDW-3
Candidate generation

The prune step removes those candidates


containing a (k-1)-itemset that is not in Lk-1 (if a
subset of an itemset is not frequent the whole
itemset cannot be frequent, as a consequence of
the Apriori principle):
The join step is described in [Agrawal, Srikant
94] as a SQL query (see next slide). Each
candidate in Ck comes from two frequent
itemsets in Lk-1 that differ by a single item

Florin Radulescu, Note de curs

36 DMDW-3
Join

The join step


insert into Ck
select p.item1, p.item2, ..., p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1 = q.item1, . . ., p.itemk-2 = q.itemk-2,
p.itemk-1 < q.itemk-1;

Florin Radulescu, Note de curs

37 DMDW-3
Join and prune

The prune step


foreach c in Ck
foreach (k-1)-subset q of c do
if (s Lk-1) then
remove c from Ck;
end
end

Florin Radulescu, Note de curs

38 DMDW-3
Example 5
Consider again the set of six transactions in
Example 2:
Doc1 = {rule, tree, classification}
Doc2 = {relation, tuple, join, algebra,
recommendation}
Doc3 = {variable, loop, procedure, rule}
Doc4 = {clustering, rule, tree, recommendation}
Doc5 = {join, relation, selection, projection,
classification}
Doc6 = {rule, tree, recommendation}
and a minimum support of 50% (minsup=3).
Florin Radulescu, Note de curs

39 DMDW-3
Step 1

 At the first scan of the transaction dataset T the


support for each item is computed:
Rule 4 recommendation 3
Tree 3 variable 1
classification 2 loop 1
relation 2 procedure 1
Tuple 1 clustering 1
Join 2 selection1 1
algebra 1 projection 1

With a minsup=3, L1 = { {rule}, {tree},


{recommendation} }.

Florin Radulescu, Note de curs

40 DMDW-3
Step 2

Considering:
rule < tree < recommendation
From the join C2 = { {rule, tree}, {rule,
recommendation}, {tree, recommendation} }.
The prune step does not modify C2.
The second scan of the transaction dataset
leads to the following pair support values:

Florin Radulescu, Note de curs

41 DMDW-3
Step 2
 Step 2 {rule, tree} 3
{rule, recommendation} 2
{tree, recommendation} 2

 The only frequent pair is {rule, tree}. L2 = { {rule,


tree} }.
 Step 3. Because L2 has a single element, C3 = ,
so L3 = , and the process stops. L = L1 L2 = {
{rule}, {tree}, {recommendation}, {rule, tree} }. If we
consider only maximal itemsets, L = {
{recommendation}, {rule, tree} }.
Florin Radulescu, Note de curs

42 DMDW-3
Example 6
Consider the transaction dataset {(1, 2, 3, 5), (2, 3,
4), (3, 4, 5)} and the minsup s = 50% (or s = 3/2;
because s must be an integer s = 2)
C1 = {1, 2, 3, 4, 5}
L1 = {2, 3, 4, 5}
C2 = {(2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5) }
L2 = {(2, 3), (3, 4), (3, 5)}
After the join step C3 = {(3, 4, 5)} - obtained by
joining (3, 4) and (3, 5).
In the prune step (3, 4, 5) is removed because
its subset (4, 5) is not in L2
Florin Radulescu, Note de curs

43 DMDW-3
Example 6
 After the prune step C3 = , so L3 = , and the
process stops. L = L1 L2 = {(2), (3), (4), (5), (2,
3), (3, 4), (3, 5)} or as maximal itemsets L = L2.
 The rules generated from itemsets are:
2 → 3, 3 → 2, 3 → 4, 4 → 3, 3 → 5, 5 → 3.
 The support of these rules is at least 50%.
 Considering a minconf equal to 80% only 3 rules
have a confidence greater or equal with minconf:
2 → 3, 4 → 3, 5 → 3 (with conf = 2/2 = 100%).
 The rules having 3 in the antecedent have a
confidence of 67% (because sup(3) = 3 and sup(2,
3) = sup(3, 4) = sup(3, 5) = 2).
Florin Radulescu, Note de curs

44 DMDW-3
Apriori summary

 Apriori algorithm performs a level-wise search.


If the largest frequent itemset has k items, the
algorithm must perform k passes over the data
(most probably stored on disk) for finding all
frequent itemsets.
In many implementations the maximum number
of passes may be specified, leading to frequent
itemsets with the same maximum dimension.

Florin Radulescu, Note de curs

45 DMDW-3
Road Map

Frequent itemsets and rules


Apriori algorithm
FP-Growth
Data formats
Class association rules
Sequential patterns. GSP algorithm

Florin Radulescu, Note de curs

46 DMDW-3
FP-Growth

FP-Growth (Frequent Pattern Growth)


algorithm performs frequent itemsets
discovery without candidate generation. It
has a 2 steps approach:
Step 1: Build FP-tree. Requires only 2 passes
over the dataset.
Step 2: Extracts frequent itemsets from the
FP-tree.

Florin Radulescu, Note de curs

47 DMDW-3
Build the FP-tree

 At the first pass over the data frequent items are


discovered.
All other items (infrequent items) are discarded:
transactions will contain only frequent items.
Also, these items are ordered by their support in
decreasing order.

Florin Radulescu, Note de curs

48 DMDW-3
Build the FP-tree

At the second pass over the data the FP-tree is


built:
FP-Growth considers an ordered set of items
(frequent items ordered by their support).
Each transaction is written with items in that
order.
The algorithm reads a transaction at a time and
adds a new branch to the tree, branch
containing as nodes the transaction items.
Florin Radulescu, Note de curs

49 DMDW-3
Build the FP-tree

Pass 2 - cont.:
Each node has a counter.
If two transactions have the same prefix the two
branches overlap on the nodes of the common
prefix and the counter of those nodes are
incremented.
Also, nodes with the same item are linked by
orthogonal paths.

Florin Radulescu, Note de curs

50 DMDW-3
Example 7

Let’s consider the following transaction set


and minsup=3

TID Items
1 a, b, c
2 a, b, c, d
3 a, b, f
4 a, b, d
5 c, d, e

Florin Radulescu, Note de curs

51 DMDW-3
Example 7

 Item counts. Only a, b, c, d are frequent:

Item Support
a 4
b 4
c 3
d 3
e 1
f 1

Florin Radulescu, Note de curs

52 DMDW-3
Example 7

 After discarding nonfrequent items and


ordering descending using the item
support the transaction dataset is the
following. Note that the order is a, b, c, d.
TID Items
1 a, b, c
2 a, b, c, d
3 a, b
4 a, b, d
5 c, d
Florin Radulescu, Note de curs

53 DMDW-3
Example 7
null

TID Items
1 a, b, c
a:4
2 a, b, c, d
Item Support
3 a, b
a 4 b:4
4 a, b, d
b 4
5 c, d
c 3 c:2 c:1

d 3

d:1 d:1 d:1

Florin Radulescu, Note de curs

54 DMDW-3
Extract frequent itemsets
 After building the FP-tree the algorithm
starts to build partial trees (called
conditional FP-trees) ending with a given
item (a suffix).
The item is not present in the tree but all
frequent itemsets generated from that
conditional tree will contain that item.
In building the conditional FP-tree, non-
frequent items are skipped (but the branch
remains if there are still nodes on it).
Florin Radulescu, Note de curs

55 DMDW-3
Extract frequent itemsets

For the previous example trees ending


with d, c, b and a may be built.
For each tree the dotted path is used
for starting points and the algorithm
goes up propagating the counters (with
sums where two branches go to the
same node).

Florin Radulescu, Note de curs

56 DMDW-3
d conditional FP-tree

null

Item Support null

a:4
a 2 a:2
Item Support

b:4
a 4
b 2
b 4

c:2 c:1
c 3
c 2 b:2
d 3

d:1 d:1 d:1

c:1 c:1

Florin Radulescu, Note de curs

57 DMDW-3
c conditional FP-tree
 Because all items have a support below minsup, no
itemset containing d is frequent.
 The same situation is for the c conditional FP-tree:

null

a:4 null
Item Support
Item Support
a 4 b:4

a 2 a:2
b 4

c 3 c:2 c:1

d 3 b 2
d:1 d:1 d:1
b:2

Florin Radulescu, Note de curs

58 DMDW-3
b conditional FP-tree

 b conditional FP-tree:
null

null
a:4

Item Support

a 4 b:4 Item Support


b 4

c 3 c:2 c:1
a 4 a:4
d 3

d:1 d:1 d:1

 Because the support of a is above minsup {a, b} is a


frequent itemset.
 In fact, {a, b} is the only frequent itemset with more than
one item produced by the algorithm (there are no other
frequent itemsets in the given dataset for minsup=3).

Florin Radulescu, Note de curs

59 DMDW-3
Results
 If there are more than one item with support above or
equal minsup in a conditional FP-tree then the algorithm
is run again against the conditional FP-tree to find
itemsets with more than two items.
 For example, if the minsup=2 then from the c conditional
FP-tree the algorithm will produce {a, c} and {b, c}. Then
the same procedure may be run against this tree for
suffix bc, obtaining {a, b, c}. Also from the d conditional
FP-tree first {c, d}, {b, d} and {a, d} are obtained, and
then, for the suffix bd, {a, b, d} is obtained.
 Suffix cd leads to unfrequent items in the FP-tree and
suffix ad produces {a, d}, already obtained.

Florin Radulescu, Note de curs

60 DMDW-3
Road Map

Frequent itemsets and rules


Apriori algorithm
FP-Growth
Data formats
Class association rules
Sequential patterns. GSP algorithm

Florin Radulescu, Note de curs

61 DMDW-3
Data formats

Table format
In this case a dataset is stored in a two columns
table:
Transactions(Transaction-ID, Item) or
T(TID, Item)
where all the lines of a transaction have the same
TID and the primary key contains both columns
(so T does not contain duplicate rows).

Florin Radulescu, Note de curs

62 DMDW-3
Data formats
Text file format
 In that case the dataset is a textfile containing a
transaction per line. Each line may contain a
transaction ID (TID) as the first element or this TID
may be missing, the line number being a virtual TID.
Example 8:
10 12 34 67 78 45 89 23 67 90 line 1
789 12 45 678 34 56 32 line 2
........
 Also in this case any software package must either
have a native textfile input option or must contain a
conversion module from text to the needed format
Florin Radulescu, Note de curs

63 DMDW-3
Data formats
Custom format
 Many data mining packages use a custom format for the input
data.
 An example is the ARFF format used by Weka, presented
below. Weka means Waikato Environment for Knowledge
Analysis) is a popular open source suite of machine
learning software developed at the University of Waikato, New
Zealand.
 ARFF stands for Atribute-Relation File Format. An .arff file is
an ASCII file containing a table (called also relation). The file
has two parts:
 A Header part containing the relation name, the list of
attributes and their types.
 A Data part containing the row values of the relation,
comma separated.
Florin Radulescu, Note de curs

64 DMDW-3
ARFF example
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall MARSHALL%PLU@io.arc.nasa.gov)
% (c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa Florin Radulescu, Note de curs
4.9,3.1,1.5,0.1,Iris-setosa
65 DMDW-3
Road Map

Frequent itemsets and rules


Apriori algorithm
FP-Growth
Data formats
Class association rules
Sequential patterns. GSP algorithm

Florin Radulescu, Note de curs

66 DMDW-3
Class association rules (CARs)
 As for clasical association rules, the model for CARs
considers a set of items I = {i1, i2, …, in} and a set of
transactions T = {t1, t2, …, tm}. The difference is that
each transaction is labeled with a class label c, where
c C, with C containing all class labels and C ∩ I = .
 A class association rule is a construction with the
following syntax:
X→y
where X I and y C.
 The definition of the support and confidence for a
class association rule is the same with the case of
association rules.

Florin Radulescu, Note de curs

67 DMDW-3
Example 10
 Consider the set of six transactions in Example 2, now
labeled with class labels from C = {database,
datamining, programming}:
Doc1 {rule, tree, classification} datamining
Doc2 {relation, tuple, join, algebra, recommendation} database
Doc3 {variable, loop, procedure, rule} programming
Doc4 {clustering, rule, tree, recommendation} datamining
Doc5 {join, relation, selection, projection, classification} database
Doc6 {rule, tree, recommendation} datamining

Florin Radulescu, Note de curs

68 DMDW-3
Example 10
Then the CARs:
rule → datamining;
recommendation → database
has:
sup(rule → datamining) = 3/6 = 50%,
conf(rule → datamining) = 3/4 = 75%.
sup(recommendation → database) = 1/6 17%,
conf(recommendation → database) = 1/3 33%
For a minsup=50% and a minconf=50% the first rule
stands and the second is rejected.
Florin Radulescu, Note de curs

69 DMDW-3
Mining CARs
 Algorithm for mining CARs using a modified Apriori
algorithm (see [Liu 11] ):
 At the first pass over the algorithm computes F1 where
F1= { the set of CARs with a single item on the left side
verifying a given minsup and minconf}.
 At step k, Ck is built from Fk-1 and then, passing
through the data and counting for each member of Ck
the support and the confidence, Fk is determined.
 Candidate generation is almost the same as for
association rules with the only difference that in the
join step only CARs with the same class in the right
side are joined.

Florin Radulescu, Note de curs

70 DMDW-3
Candidates generation

Ck = // starts with an empty set


forall f1, f2 Fk-1 // for each pair of frequent CAR
f1 = {i1, … , ik-2, ik-1} → y // only last item
f2 = {i1, … , ik-2, i’k-1} → y // is different
ik-1 < i’k-1 do // and same class

c = {i1, …, ik-1, i’k-1} → y; // join step


Ck = C k {c}; // add new candidate

for each (k-1)-subset s of {i1, …, ik-1, i’k-1} do


if (s → y Fk-1) then
Ck = Ck - {c}; // prune step
endfor
endfor

Florin Radulescu, Note de curs

71 DMDW-3
Road Map

Frequent itemsets and rules


Apriori algorithm
FP-Growth
Data formats
Class association rules
Sequential patterns. GSP algorithm

Florin Radulescu, Note de curs

72 DMDW-3
Sequential patterns model
Itemset: a set of n distinct items
I = {i1, i2, …, in }
Event: a non-empty collection of items; we can
assume that items are in a given (e.g.
lexicographic) order: (i1,i2 … ik)
Sequence : an ordered list of events: < e1 e2 …
em >
Length of a sequence: the number of items in
the sequence
Example: <AM, CDE, AE> has length 7
Florin Radulescu, Note de curs

73 DMDW-3
Sequential patterns model
 Size of a sequence: the number of itemsets in the
sequence
Example: <AM, CDE, AE> has size 3
 K-sequence : sequence with k items, or with
length k
Example: <B, AC> is a 3-sequence
 Subsequence and supersequence: <e1 e2 …eu>
is a subsequence of or included in <f1 f2 …fv> (and
the last is a supersequence of the first sequence
or contains that sequence) if there are some
integers 1 j1 < j2 < … < ju-1 < ju v such that e1
fj1 and e2 fj2 and … and eu fju.
Florin Radulescu, Note de curs

74 DMDW-3
Sequential patterns model
Sequence database X: a set of sequences
Frequent sequence (or sequential pattern):
a sequence included in more than s
members of the sequence database X;
 s is the user-specified minimum support.
The number of sequences from X containing
a given sequence is called the support of
that sequence.
 So, a frequent sequence is a sequence with
a support at least s where s is the minsup
specified by the user.
Florin Radulescu, Note de curs

75 DMDW-3
Example 11
 <A, BC> is a subsequence of <AB, E, ABCD>
 <AB, C> is not a subsequence of <ABC>
 Consider a minsup=50% and the following sequence
database:
Sequence ID Sequence

1 <A, B, C>

2 <AB, C, AD>

3 <ABC, BCE>

4 <AD, BC, AE>

5 <B, E>

Florin Radulescu, Note de curs

76 DMDW-3
Example 11

 The frequent sequences (support at least 50%


means 3 of 5) are:

1-sequences <A>, <B>, <C>, <E>

2-sequences <A, B>, <A, C>, <B, C>, <B, E>

There is no 3-sequence (or upper) with support


at least 50%.

Florin Radulescu, Note de curs

77 DMDW-3
Algorithms

 Apriori
GSP (Generalized Sequential Pattern)
FreeSpan (Frequent pattern-projected
Sequential pattern mining)
PrefixSpan (Prefix-projected Sequential
pattern mining)
SPADE (Sequential PAttern Discovery
using Equivalence classes)
Florin Radulescu, Note de curs

78 DMDW-3
GSP Algorithm

 Similar with Apriori:


Algorithm GSP(I, X, minsup)
C1 = I // initial n candidates
L1 = {<{f}>| f∈ C1, f.count/n minsup}; // first pass over X
for (k = 2; Lk-1 ; k++) do // loop until Lk-1 is empty
Ck = candidate-generation(Lk-1);
foreach s X do //
foreach c Ck do
if c is-contained-in s then
c.count++;
endfor
endfor
Lk = {c Ck | c.count/n minsup}
endfor
return k Fk;
Florin Radulescu, Note de curs

79 DMDW-3
GSP Algorithm

 Candidate generation is made in a join and


prune manner.
At the join step two sequences f1 and f2 from Lk-1
are joined if removing the first item from f1 and
the last item from f2 the result is the same.
The joined sequence is obtained by adding the
last item of f2 to f1, with the same status
(separate element or part of the last element of
f1).
Florin Radulescu, Note de curs

80 DMDW-3
Example 12

 <AB, CD, E> join with <B, CD, EF>:


<AB, CD, EF>
<AB, CD, E> join with <B, CD, E, F>:
<AB, CD, E, F>

Florin Radulescu, Note de curs

81 DMDW-3
Summary
This third course presented:
What are frequent itemsets and rules and their
relationship
Apriori and FP-growth algorithms for discovering
frequent itemsets.
Data formats for discovering frequent itemsets
What are class association rules and how can be
mined
An introduction to sequential patterns and the GSP
algorithm
 Next week: Supervised learning – part 1.
Florin Radulescu, Note de curs

82 DMDW-3
References
[Agrawal, Imielinski, Swami 93] R. Agrawal; T. Imielinski; A. Swami:
Mining Association Rules Between Sets of Items in Large Databases",
SIGMOD Conference 1993: 207-216, (http://rakesh.agrawal-
family.com/papers/sigmod93assoc.pdf)
[Agrawal, Srikant 94] Rakesh Agrawal and Ramakrishnan Srikant. Fast
algorithms for mining association rules in large databases. Proceedings
of the 20th International Conference on Very Large Data Bases, VLDB,
pages 487-499, Santiago, Chile, September 1994
(http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf)
[Srikant, Agrawal 96] R. Srikant, R. Agrawal: "Mining Sequential
Patterns: Generalizations and Performance Improvements", to appear
in Proc. of the Fifth Int'l Conference on Extending Database Technology
(EDBT), Avignon, France, March 1996, (http://rakesh.agrawal-
family.com/papers/edbt96seq.pdf)
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data, Second Edition, Springer, chapter 2.
[Ullman 03-09] Jeffrey Ullman, Data Mining Lecture Notes, 2003-2009,
web page: http://infolab.stanford.edu/~ullman/mining/mining.html
Florin Radulescu, Note de curs

83 DMDW-3
References
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org
[Whitehorn 06] Mark Whitehorn, The parable of the beer and diapers,
web page: http://www.theregister.co.uk/2006/08/15/beer_diapers/
[Silverstein et al. 00] Silverstein, C., Brin, S., Motwani, R., Ullman, J. D.
2000. Scalable techniques for mining causal structures. Data Mining
Knowl. Discov. 4, 2–3, 163–192., www.vldb.org/conf/1998/p594.pdf
[Verhein 08] Florian Verhein, Frequent Pattern Growth (FP-Growth)
Algorithm, An Introduction, 2008,
http://www.florian.verhein.com/teaching/2008-01-09/fp-growth-
presentation_v1%20(handout).pdf
[Pietracaprina, Zandolin 03] Andrea Pietracaprina and Dario Zandolin:
Mining Frequent Itemsets using Patricia Tries,
[Zhao, Bhowmick 03] Qiankun Zhao, Sourav S. Bhowmick, Sequential
Pattern Mining: A Survey, Technical Report, CAIS, Nanyang
Technological University, Singapore, No. 2003118 , 2003,
(http://cs.nju.edu.cn/zhouzh/zhouzh.files/course/dm/reading/reading04/
zhao_techrep03.pdf)
Florin Radulescu, Note de curs

84 DMDW-3
Supervised Learning
- Part 1 -
Road Map

What is supervised learning


Evaluation of classifiers
Decision trees. ID3 and C4.5
Rule induction systems
Summary

Florin Radulescu, Note de curs

2 DMDW-4
Objectives

Supervised learning is one of the most


studied subdomains of Data Mining
It is also part of Machine learning, a
branch of Artificial Intelligence.
Means that a new model can be built
starting from past experiences (data)

Florin Radulescu, Note de curs

3 DMDW-4
Definitions

 Supervised learning includes:


 Classification: results are discrete
values (goal: identify group/class
membership).
 Regression: results are continuous
or ordered values (goal: estimate or
predict a response).

Florin Radulescu, Note de curs

4 DMDW-4
Regression
Regression comes from statistics.
Meaning: predicting a value of a given
continuous variable based on the values of other
variables, assuming a linear or nonlinear model
of dependency ([Tan, Steinbach, Kumar 06]).
Used in prediction and forecasting - its use
overlaps machine learning.
Regression analysis is also used to understand
the relationships between independent variables
and dependent variables and can be used to
infer causal relationships between them.
Florin Radulescu, Note de curs

5 DMDW-4
Example

Linear regression example


(from http://en.wikipedia.org/wiki/File:Linear_regression.svg)

In this example: for new values on Ox axis the Oy


value can be predicted using the regression
function (the red line)
Florin Radulescu, Note de curs

6 DMDW-4
Classification
Input:
 A set of k classes C = {c1, c2, …, ck}
 A set of n labeled items D = {(d1, ci1), (d2, ci2), …, (dn
cin)}. The items are d1, …, dn, each item dj being labeled
with class cj C. D is called the training set.
 For calibration of some algorithms, a validation set is
also required. This validation set contains also labeled
items not included in the training set.
Output:
 A model or method for classifying new items.
The set of new items that will be classified using this
model/method is called the test set
Florin Radulescu, Note de curs

7 DMDW-4
Example. Model: decision tree

The result for the example:


Felix Yes Yes No No No Yes ?????

will be C0 Florin Radulescu, Note de curs

8 DMDW-4
Input data format

 In most of the cases the training set D (as well


as the validation set and the test set) may be
represented by a table having a column for each
attribute of D and the last column contains the
class label.
A very used example is Play-tennis where
weather conditions are used to decide if players
may or may not start a new game.
This dataset will be used also in this course.
Florin Radulescu, Note de curs

9 DMDW-4
Play tennis dataset
Day Outlook Temperature Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Florin Radulescu, Note de curs

10 DMDW-4
Approaches
 The interest in supervised learning is shared
between:
statistics,
data mining and
artificial intelligence
 There are a wide range of problems solved
by supervised learning techniques, the
number of algorithms or methods in this
category is very large.
There are many approaches and in this
course (and the next) only some of them are
covered, as follows:
Florin Radulescu, Note de curs

11 DMDW-4
Decision trees
 Decision trees: an example is the UPU
decision tree as described in the previous
example.
 In a decision tree non-leaf nodes contain
decisions based on the attributes of the
examples (attributes of argument di) and
each leaf ci is a class from C.
 ID3 and C4.5 - two well known algorithms
for building decision trees - are presented
in this lesson .
Florin Radulescu, Note de curs

12 DMDW-4
Rule induction systems

 Rule induction systems: from every


decision tree a set of rules can be inferred,
rules that can replace the decision tree.
 Consider the following decision tree for
deciding if a student will be allowed or not
in the students residence:

Florin Radulescu, Note de curs

13 DMDW-4
Rule induction systems

Residence

Bucharest Other

Class = No Fails

>3 <=3

Class = No Class = Yes

The decision tree can be replaced by the


following set of rules (one for each path):
{ Residence = Bucharest → Class = No
Residence = Other, Fails > 3 → Class = No
Residence = Other, Fails <= 3 → Class = Yes }
Florin Radulescu, Note de curs

14 DMDW-4
Rule induction systems

 But rules can be obtained not only from


decision trees but also directly from the
training set.
 This lesson presents also some methods
for doing this.

Florin Radulescu, Note de curs

15 DMDW-4
Classification using association rules

Classification using association rules:


Class association rules can be used in
building classifiers.
Some methods for performing this task will
be presented next week.

Florin Radulescu, Note de curs

16 DMDW-4
Naïve Bayesian classification

 Naïve Bayesian classification: for every


example di the method computes the
probability for each class cj from C.
 Classification is made picking the most
probable class for each example.
 The word naïve is used because some
simplifying assumptions are made.

Florin Radulescu, Note de curs

17 DMDW-4
Support vector machines

 Support vector machines: this method is


used for binary classification.
 Examples are classified in only two
classes: either positive or negative.
 It is a very efficient method and may be
used recursively to build a classifier with
more than two classes.

Florin Radulescu, Note de curs

18 DMDW-4
KNN

 K-nearest neighbor: a very simple but


powerful method for classifying examples
based on the labels (classes) of their
neighbors.

Florin Radulescu, Note de curs

19 DMDW-4
Ensemble methods

 Ensemble methods: Random Forest,


Bagging and Boosting.
 In these cases more than one classifier is
built and the final classification is made by
aggregating their results.
 For example, a Random Forest consists
in many decision trees and the output
class is the mode of the classes computed
by individual trees.
Florin Radulescu, Note de curs

20 DMDW-4
Road Map

What is supervised learning


Evaluation of classifiers
Decision trees. ID3 and C4.5
Rule induction systems
Summary

Florin Radulescu, Note de curs

21 DMDW-4
Accuracy and error rate
 For estimating the efficiency of a classifier several
measures may be used:
 Accuracy (or predictive accuracy) is the proportion of
correctly classified test examples :

 Error rate is the proportion of incorrectly classified test


examples:

Florin Radulescu, Note de curs

22 DMDW-4
Other measures
 In some cases where examples are classified in only
two classes (called Positive and Negative) other
measures can be also defined.
 Consider the confusion matrix containing the number of
correctly and incorrectly classified examples (Positive
examples as well as Negative examples):

Classified as Positive Classified as Negative


Actual positive TP = True Positive FN = False Negative
Actual negative FP = False Positive TN = True Negative

Florin Radulescu, Note de curs

23 DMDW-4
Other measures
 TP = the number of correct classifications for Positive examples.
 TN = the number of correct classifications for Negative
examples.
 FP = the number of incorrect classifications for Negative
examples.
 FN = the number of incorrect classifications for Positive
examples.
 Precision is the proportion of the correctly classified
Positive examples in the set of examples classified as
Positive:
Precision = TP / (TP + FP)

Florin Radulescu, Note de curs

24 DMDW-4
Other measures

 Recall (or sensitivity) is the proportion of


correctly classified Positive examples in the set
of all Positive examples, or the rate of
recognition for positive examples:
Recall = TP / (TP + FN)

Specificity is the rate of recognition of negative


examples:
Specificity = TN / (TN + FP)
Florin Radulescu, Note de curs

25 DMDW-4
Other measures

 Accuracy formula can be rewritten as:

where Pos and Neg are the total number of


Positive and Negative examples.

Florin Radulescu, Note de curs

26 DMDW-4
Other measures

 Precision and recall are usually used together


because for some test examples using only one
of them may lead to incorrect judgment on the
performances of a classifier.
 If a set contains 100 Positive examples and 100
Negative examples and the classifier has the
following result:
Classified as Positive Classified as Negative
Actual positive 30 70
Actual negative 0 100

Florin Radulescu, Note de curs

27 DMDW-4
Other measures

 Then precision p = 100% but recall r = 30%.


Combining precision with recall by harmonic
mean F1-score is obtained:

 For the above example F1-score = 46%;


generally F1-score is closer to the smaller value
of precision and recall.
Florin Radulescu, Note de curs

28 DMDW-4
Evaluation methods

 Evaluation methods use a data set D with


labeled examples.
 This set is split in several subsets and
these subsets become training/ test/
validation sets.

 Note: Evaluation refers to the classifier


building method/algorithm.

Florin Radulescu, Note de curs

29 DMDW-4
The holdout method
In this case the data set D is split in two: a
training set and a test set.
The test set is also called holdout set (from
here the name of the method).
The classifier obtained using the training set
is used for classification of examples from the
test set.
Because these examples are also labeled
accuracy, precision, recall and other
measures can then be obtained and based
on them the classifier is evaluated.
Florin Radulescu, Note de curs

30 DMDW-4
Cross validation method
There are several versions of cross validation:
1. k-fold cross validation. The data set D is split in
k disjoint subsets with the same size. For each
subset a classifier is built and run using that
subset as test set and the reunion of all k-1
remaining subsets as training set. In this way k
values for accuracy are obtained (one for each
classifier). The mean of these values is the final
accuracy. The usual value for k is 10.
2. 2-fold cross validation. For k=2 the above
method has the advantage of using large sets
both for training and testing.
Florin Radulescu, Note de curs

31 DMDW-4
Cross validation method

3. Stratified cross validation. Is a variation of k-fold


cross validation. Each fold has the same distribution of
labels.
 For example, for Positive and Negative examples each
fold contains roughly the same proportion of Positive and
Negative examples.
4. Leave one out cross validation. When D contains
only a small number of examples a special k-fold cross
validation may be used: each example becomes the
test set and all other examples the training set.
 Accuracy for each classifier is either 100% or 0%. The
mean of all these values is the final accuracy.

Florin Radulescu, Note de curs

32 DMDW-4
Bootstrap method

Is part of resampling methods and consists in


getting the training set from the data set by
sampling with replacement.
The instances which are not picked in the
training set are used as test set.
For example, if D has 1000 labeled examples,
by picking randomly an example 1000 times
gives us the training set. In this training set some
examples are picked more than once.
Florin Radulescu, Note de curs

33 DMDW-4
Bootstrap method

Statistically 63.2% of the examples in D are


picked from the training set and 36.8% are not.
These 36.8% becomes the test set.
After building a classifier and run this classifier
on the test set the accuracy is determined and
the classifier building method may be evaluated
based on its value.
More on this method and other evaluation
techniques can be found in [Sanderson 08].
Florin Radulescu, Note de curs

34 DMDW-4
Why 63.2?
 From Data Mining: Concepts and Techniques*: Suppose we
are given a data set of d tuples. “Where does the figure,
63.2%, come from?” Each tuple has a probability of 1/d of
being selected, so the probability of not being chosen is (1-
1/d).
 We have to select d times, so the probability that a tuple will
not be chosen during this whole time is (1- 1/d)d.
 If d is large, the probability approaches e-1 = 0.368 Thus,
36.8% of tuples will not be selected for training and thereby
end up in the test set, and the remaining 63.2% will form the
training set.
* Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, Second
Edition, 2006, Morgan Kaufman, page 365.
Florin Radulescu, Note de curs

35 DMDW-4
Scoring and ranking
 Sometimes the user is interested only in a single class
(called the Positive class for short), for example
buyers of a certain type of gadgets or players of a
certain game.
 If the classifier returns a probability estimate (PE) for
each example in the test case to belong to the Positive
class (indicating the likelihood to belong to that class)
we can score each example by the value of this PE.
 After that we can rank all examples based on their PE
and draw a lift curve.
 The classifier method is good if the lift curve is way
above the random line in the lift chart – see example.

Florin Radulescu, Note de curs

36 DMDW-4
Scoring and ranking

 The lift curve is draw by dividing the


ranked examples in several bins and
counting the actual Positive examples in
each bin.
This count gives the value for the lift
curve.
Remember that the evaluation of the
classification methods uses a test set with
labeled examples.
Florin Radulescu, Note de curs

37 DMDW-4
Example (from [Microsoft])

“The marketing department at Adventure


Works Cycles wants to create a targeted
mailing campaign.
From past campaigns, they know that a
10 percent response rate is typical.
They have a list of 10,000 potential
customers stored in a table in the
database.
Florin Radulescu, Note de curs

38 DMDW-4
Example (from [Microsoft])

Therefore, based on the typical response


rate, they can expect 1,000 of the potential
customers to respond.
However, the money budgeted for the
project is not enough to reach all 10,000
customers in the database.
Based on the budget, they can afford to
mail an advertisement to only 5,000
customers.
Florin Radulescu, Note de curs

39 DMDW-4
Example (from [Microsoft])

So, the marketing department has two


choices:
A. Randomly select 5,000 customers to
target (red line in the next figure)
B. Use a mining model to target the 5,000
customers who are most likely to respond
(blue line in the next figure)

Florin Radulescu, Note de curs

40 DMDW-4
Lift curve example

Source: http://www.saedsayad.com/model_evaluation_c.htm

Florin Radulescu, Note de curs

41 DMDW-4
Example (from [Microsoft])
If the company randomly selects 5,000
customers, they can expect to receive only
500 responses, based on the typical
response rate. This scenario is what
the random line in the lift chart represents.
However, if the marketing department uses a
mining model to target their mailing, they can
expect a larger response rate because they
can target those customers who are most
likely to respond.
Florin Radulescu, Note de curs

42 DMDW-4
Example (from [Microsoft])

If the model is perfect, it means that the


model creates predictions that are never
wrong, and the company could expect to
receive 1,000 responses by mailing to the
1,000 potential customers recommended
by the model (green line in the previous
figure).

Florin Radulescu, Note de curs

43 DMDW-4
Example (from [Microsoft])

This scenario is what the ideal line in the lift


chart represents.
The reality is that the mining model most likely
falls between these two extremes; between a
random guess and a perfect prediction.
Any improvement from the random guess is
considered to be lift.”

Florin Radulescu, Note de curs

44 DMDW-4
Road Map

What is supervised learning


Evaluation of classifiers
Decision trees. ID3 and C4.5
Rule induction systems
Summary

Florin Radulescu, Note de curs

45 DMDW-4
What is a decision tree
 A very common way to represent a classification
model or algorithm is a decision tree. Having a
training set D and a set of n example attributes A,
each labeled example in D is like: (a1 = v1, a2 = v2,
…, an = vn). Based on these attributes a decision
tree can be built having:
a. Internal nodes are attributes (with no path
containing twice the same attribute).
b. Branches refer to discrete values (one or more) or
intervals for these attributes. Sometimes more
complex conditions may be used for branching.

Florin Radulescu, Note de curs

46 DMDW-4
What is a decision tree
c. Leafs are labeled with classes. For each leaf
a support and a confidence may be
computed: support is the proportion of
examples matching the path from root to that
leaf and confidence is the classification
accuracy for examples matching that path.
When passing from decision trees to rules,
each rule has the same support and
confidence as the leaf from where it comes.
d. Any example match a single path of the tree
(so a single leaf or class).
Florin Radulescu, Note de curs

47 DMDW-4
Example

Outlook

Sunny Overcast Rain

Wind
Humidity

High Normal Strong Weak

Class = No Class = Yes Class = Yes Class = No Class = Yes

(3/14, 3/3) (2/14, 2/2) (4/14, 4/4) (3/14, 3/3) (2/14, 2/2)

Florin Radulescu, Note de curs

48 DMDW-4
Decision trees

 Numbers on the last line are the support and


the confidence associated with each leaf.
 For the same data set more than one decision
tree may be built.
 For example another Play tennis decision tree is
in the next figure (with less confidence than
previous tree):

Florin Radulescu, Note de curs

49 DMDW-4
Decision trees

Wind

Strong Weak

Class = No Class = Yes

(6/14, 2/6) (8/14, 6/8)

Florin Radulescu, Note de curs

50 DMDW-4
ID3

ID3 stands for Iterative Dichotomiser 3 and is an


algorithm for building decision trees introduced
by Ross Quinlan in 1986 (see [Quinlan 86]).
The algorithm constructs the decision tree in a
top-down manner choosing at each node the
‘best’ attribute for branching:
First a root attribute is chosen, building a
separate branch for each different value of the
attribute.
Florin Radulescu, Note de curs

51 DMDW-4
ID3
 The training set is also divided, each branch inheriting
the examples matching the attribute value of the
branch.
 Process repeats for each descendant until all
examples have the same class (in that case the node
becomes a leaf labeled with that class) or all attributes
have been used (the node also become a leaf labeled
with the mode value – the majority class).
 An attribute cannot be chosen twice on the same path;
from the moment it was chosen for a node it will never
be tested again for the descendants of that node.
Florin Radulescu, Note de curs

52 DMDW-4
Best attribute

 The essence of the ID3 is how the ‘best’


attribute is discovered. The algorithm uses
information theory trying to increase the purity of
the datasets from the father node to the
descendants.
 Let us consider a dataset D = {e1, e2, …, em}
with examples labeled with classes from C = {c1,
c2, …, cn}. Examples attributes are A1, A2, …, Ap.
The entropy of D can be computed as:

Florin Radulescu, Note de curs

53 DMDW-4
Entropy

 If attribute Ak having r distinct values is considered for


branching, it will partition D in r disjoint subsets D1, D2,
…, Dr.
 The combined entropy of these subsets, computed as a
weighted average of these entropies is:

 All probabilities - Pr(ci) - involved in the above equations


are determined by counting!
Florin Radulescu, Note de curs

54 DMDW-4
Information gain

 Because the purity of the datasets is


increasing, entropy(D) is bigger than
entropy(D, Ak). The difference between
them is called the information gain:

The ‘best’ attribute is determined by the


highest gain.
Florin Radulescu, Note de curs

55 DMDW-4
Example
 For Play tennis dataset there are four attributes
for the root of the decision tree: Outlook,
Temperature, Humidity and Wind.
The entropy of the whole dataset and the
weighted values for dividing using the four
attributes are:

Florin Radulescu, Note de curs

56 DMDW-4
For each attribute

In the same way:

Florin Radulescu, Note de curs

57 DMDW-4
Best attribute: Outlook
 The next table contains the values for entropy
and gain.
 The best attribute for the root node is Outlook,
with a maximum gain of 0.25:

Attribute entropy gain


Humidity 0.79 0.15
Wind 0.89 0.05
Temperature 0.91 0.03
Outlook 0.69 0.25

Florin Radulescu, Note de curs

58 DMDW-4
Notes on and extensions of ID3

1. Because is a greedy algorithm it leads to a


local optimum.
2. Attributes with many values leads to a higher
gain. For solving this problem the gain may be
replaced with the gain ratio:

Florin Radulescu, Note de curs

59 DMDW-4
Notes on and extensions of ID3
3. Sometimes (when only few examples are associated
with leaves) the tree overfits the training data and
does not work well on test examples.
 To avoid overfitting the tree may be simplified by
pruning:
 Pre-pruning: growing is stopped before normal end. The
leaves are not 100% pure and are labeled with the majority
class (the mode value).
 Post-pruning: after running the algorithm some sub-trees
are replaced by leaves. Also in this case the labels are
mode values for the matching training examples. Post-
pruning is better because in pre-pruning is hard to
estimate when to stop.

Florin Radulescu, Note de curs

60 DMDW-4
Notes on and extensions of ID3
4. Some attribute A may be continuous. Values for
A may be partitioned in two intervals:
A t and A > t.
The value of t may be selected as follows:
Sort examples upon A
Pick the average of two consecutive values
where the class changes as candidate.
For each candidates found in previous step
compute the gain if partitioning is made using
that value. The candidate with the maximum
gain is considered for partitioning.
Florin Radulescu, Note de curs

61 DMDW-4
Notes on and extensions of ID3

In this way the continuous attribute is replaced


with a discrete one (two values, one for each
interval).
This attribute competes with the remaining
attributes for ‘best’ attribute.
The process repeats for each node because the
partitioning value may change from a node to
another.

Florin Radulescu, Note de curs

62 DMDW-4
Notes on and extensions of ID3
5. Attribute cost: some attributes are more expensive
than others (measured not only in money).
 It is better that lower-cost attributes to be closer to
the root than other attributes.
 For example, for an emergency unit it is better to
test the pulse and temperature first and only when
necessary perform a biopsy.
 This may be done by weighting the gain by the cost:

Florin Radulescu, Note de curs

63 DMDW-4
C4.5
C4.5 is the improved version of ID3, and was
developed also by Ross Quinlan (as well as
C5.0). Some characteristics:
Numeric (continuous) attributes are allowed
deal sensibly with missing values
(see https://www.quora.com/In-simple-language-how-does-C4-5-deal-with-missing-values)

post-pruning to deal with for noisy data


The most important improvements from ID3
are:
1. The attributes are chosen based on gain-ratio
and not simply gain.
Florin Radulescu, Note de curs

64 DMDW-4
C4.5
The most important improvements from ID3
are:
2. Post pruning is performed in order to reduce the
tree size. The pruning is made only if it reduces
the estimated error. There are two prune
methods:
Sub-tree replacement: A sub-tree is replaced with a leaf
but each sub-tree is considered only after all its sub-
trees. This is a bottom-up approach.
Sub-tree raising: A node is raised and replaces a higher
node. But in this case some examples must be
reassigned. This method is considered less important
and slower than the first.

Florin Radulescu, Note de curs

65 DMDW-4
Road Map

What is supervised learning


Evaluation of classifiers
Decision trees. ID3 and C4.5
Rule induction systems
Summary

Florin Radulescu, Note de curs

66 DMDW-4
Rules

 Rules can easily be extracted from a decision


tree: each path from the root to a leaf
corresponds to a rule.
From the decision tree in example 2 five IF
THEN rules can be extracted:
Outlook

Sunny Overcast Rain

Wind
Humidity

High Normal Strong Weak

Class = No Class = Yes Class = Yes Class = No Class = Yes

(3/14, 3/3) (2/14, 2/2) (4/14, 4/4) (3/14, 3/3) (2/14, 2/2)

Florin Radulescu, Note de curs

67 DMDW-4
Rules
The rules are (one for each path):

1. IF Outlook = Sunny AND Humidity = High


THEN Play Tennis = No;
2. IF Outlook = Sunny AND Humidity = Normal
THEN Play Tennis = Yes;
3. IF Outlook = Overcast
THEN Play Tennis = Yes;
4. IF Outlook = Rain AND Wind = Strong
THEN Play Tennis = No;
5. IF Outlook = Rain AND Wind = Weak
THEN Play Tennis = Yes;
Florin Radulescu, Note de curs

68 DMDW-4
Rule induction
 In the case of a set of rules extracted from a decision
tree, rules are mutually exclusive and exhaustive.
 But rules may be obtained directly from the training
data set by sequential covering.
 A classifier built by sequential covering consists in an
ordered or unordered list of rules (called also decision
list), obtained as follows:
 Rules are learned one at a time
 After a rule is learned, the tuples covered by that rule are
removed
 The process repeats on the remaining tuples until some
stopping criteria are met (no more training examples, the
quality of a rule returned is below a user-specified
threshold, …)
Florin Radulescu, Note de curs

69 DMDW-4
Fining rules

 There are many algorithms for rule


induction: FOIL, AQ, CN2, RIPPER, etc.
There are two approaches in sequential
covering:
1. Finding ordered rules, by first determining
the conditions and then the class.
2. Finding a set of unordered rules by first
determining the class and then the
associated condition.
Florin Radulescu, Note de curs

70 DMDW-4
Ordered rules

 The algorithm:
RuleList
Rule learn-one-rule(D)
while Rule AND D do
RuleList RuleList + Rule // append Rule at the end of RuleList
D = D – {examples covered by Rule}
Rule learn-one-rule(D)
Endwhile
// append majority class as last/default rule:
RuleList RuleList + {c|c is the majority class}
return RuleList

Florin Radulescu, Note de curs

71 DMDW-4
Learn-one-rule
 Function learn-one-rule is built considering all
possible attribute-value pairs (Attribute op Value)
where Value may be also an interval.
The process tries to find the left side of a new
rule and this left side is a condition.
At the end the rule is constructed using as right
side the majority class of the examples covered
by the left side condition.

Florin Radulescu, Note de curs

72 DMDW-4
Learn-one-rule
1. Start with an empty Rule and a set of BestRules
containing this rule:
Rule
BestRules {Rule}
2. For each member b of BestRules and for each
possible attribute-value pair p evaluate the combined
condition b p. If this condition is better than Rule
then it replaces the old value of Rule.
3. At the end of the process a best rule with an
incremented dimension is found. Also in BestRules
the best n combined conditions discovered at this
step are kept (implementing a beam search).

Florin Radulescu, Note de curs

73 DMDW-4
Learn-one-rule
4. The evaluation of a rule may be done using the
entropy of the set containing examples covered
by that rule.
5. Repeat steps 2 and 3 until no more conditions
are added in BestRules. Note that a condition
must verify a given threshold at evaluation time,
so BestRules may have less then n members.
6. If Rule is evaluated and found enough efficient
(considering the given threshold) then Rule → c
is returned otherwise an empty rule is the result.
The class c is the majority class of the examples
covered by Rule.
Florin Radulescu, Note de curs

74 DMDW-4
Unordered rules

 The algorithm:
RuleList
foreach class c ∈ C do
D = Pos Neg // Pos = { examples of class c from D}
// Neg = D - Pos
while Pos do
Rule learn-one-rule(Pos, Neg, c);
if Rule =
then
Quitloop
else
RuleList RuleList + Rule // append Rule at the end of RuleList
Pos = Pos – {examples covered by Rule}
Neg = Neg – {examples covered by Rule}
endif
endwhile
endfor
Florin Radulescu, Note de curs
return RuleList
75 DMDW-4
learn-one-rule again
 For learning a rule two steps are performed: grow a
new rule and then prune it.
 Pos and Neg are split in two parts each: GrowPos,
GrowNeg, PrunePos and PruneNeg.
 The first part is used for growing a new rule and the
second for pruning.
 At the ‘grow’ step a new condition/rule is build, as in
the previous algorithm.
 Only the best condition is kept at each step (and not
best n).
 Evaluation for the new best condition C’ obtained by
adding an attribute-value pair to C is made using a
different gain:
Florin Radulescu, Note de curs

76 DMDW-4
learn-one-rule again

where:
p0, n0: the number of positive/negative examples
covered by C in GrowPos/ GrowNeg.
p1, n1: the number of positive/negative examples
covered by C’ in GrowPos/ GrowNeg.
The rule maximizing this gain is returned by the
‘grow’ step.
Florin Radulescu, Note de curs

77 DMDW-4
learn-one-rule again
 At the ‘prune’ step sub-conditions are deleted
from the rule and the deletion that maximize the
function below is chosen:

 Where p, n are the examples number in


PrunePos/PruneNeg covered by the rule after sub-
condition deletion.
 Next slide: another example of building all rules for
a given class: the IREP algorithm (Incremental
Reduced Error Pruning) in [Cohen 95]
Florin Radulescu, Note de curs

78 DMDW-4
IREP
procedure IREP(Pos, Neg)
begin
Ruleset :=
while Pos do
// grow and prune a new rule
split (Pos, Neg) into (GrowPos, GrowNeg) and (PrunePos, PruneNeg)
Rule := GrowRule(GrowPos, GrowNeg)
Rule := PruneRule(Rule, PrunePos, PruneNeg)
if the error rate of Rule on (PrunePos, PruneNeg) exceeds 50%
then return Ruleset
else add Rule to Ruleset
remove examples covered by Rule from (Pos, Neg)
endif
endwhile
return Ruleset
end
Florin Radulescu, Note de curs

79 DMDW-4
Summary
This course presented:
What is supervised learning: definitions, data formats
and approaches.
Evaluation of classifiers: accuracy and other error
measures and evaluation methods: holdout set, cross
validation, bootstrap and scoring and ranking.
Decision trees building and two algorithms developed
by Ross Quinlan (ID3 and C4.5) .
Rule induction systems
 Next week: Supervised learning – part 2.

Florin Radulescu, Note de curs

80 DMDW-4
References
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data, Second Edition, Springer, chapter 3.
[Han, Kamber 06] Jiawei Han and Micheline Kamber, Data
Mining: Concepts and Techniques, 2nd ed., The Morgan
Kaufmann Series in Data Management Systems, Jim Gray,
Series Editor Morgan Kaufmann Publishers, March 2006. ISBN
1-55860-901-6
[Sanderson 08] Robert Sanderson, Data mining course notes,
Dept. of Computer Science, University of Liverpool 2008,
Classification: Evaluation
http://www.csc.liv.ac.uk/~azaroth/courses/current/comp527/lectur
es/comp527-13.pdf

Florin Radulescu, Note de curs

81 DMDW-4
References
[Quinlan 86] Quinlan, J. R. 1986. Induction of Decision Trees.
Mach. Learn. 1, 1 (Mar. 1986), 81-106,
http://www.cs.nyu.edu/~roweis/csc2515-
2006/readings/quinlan.pdf
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org
[Microsoft] Lift Chart (Analysis Services - Data Mining),
http://msdn.microsoft.com/en-us/library/ms175428.aspx
[Cohen 95] William W. Cohen, Fast Effective Rule Induction, in
“Machine Learning: Proceedings of the Twelfth International
Conference” (ML95),
http://sci2s.ugr.es/keel/pdf/algorithm/congreso/ml-95-ripper.pdf

Florin Radulescu, Note de curs

82 DMDW-4
Supervised Learning
- Part 2 -
Road Map

Classification using class association rules


Naïve Bayesian classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging, Boosting,
Random Forest
Summary
Florin Radulescu, Note de curs

2 DMDW-5
CAR definition
IF:
 I is a set of items, I = {i1, i2, …, in},
 C a set of classes (C ∩ I = ), and
 T a set of transactions, T = {t1, t2, …, tm} where
each transaction is labeled with a class label c
C,
THEN:
 a class association rule (CAR) is a construction
with the following syntax:
X→y
where X I and y C.
Florin Radulescu, Note de curs

3 DMDW-5
Example: Dataset

 a set of six transactions labeled with


classes from C = C = {database,
datamining, programming}:
Doc1 {rule, tree, classification} datamining
Doc2 {relation, tuple, join, algebra, recommendation} database
Doc3 {variable, loop, procedure, rule} programming
Doc4 {clustering, rule, tree, recommendation} datamining
Doc5 {join, relation, selection, projection, classification} database
Doc6 {rule, tree, recommendation} datamining

Florin Radulescu, Note de curs

4 DMDW-5
Support and confidence

 The following constructions are valid CARs:


rule → datamining;
recommendation → database
For each CAR the support and confidence may
be computed:

Florin Radulescu, Note de curs

5 DMDW-5
Example: support and confidence

Doc1 {rule, tree, classification} datamining


Doc2 {relation, tuple, join, algebra, recommendation} database
Doc3 {variable, loop, procedure, rule} programming
Doc4 {clustering, rule, tree, recommendation} datamining
Doc5 {join, relation, selection, projection, classification} database
Doc6 {rule, tree, recommendation} datamining

Using these expressions:


sup(rule → datamining) = 3/6 = 50%, and
conf(rule → datamining) = 3/4 = 75%.

Florin Radulescu, Note de curs

6 DMDW-5
Using CARs

 There are two methods for using CARs in


classification presented in [Liu 11] :
Use CARs for building classifiers
Strongest rule
Subset of rules
Use CARs to build new attributes of the
dataset

Florin Radulescu, Note de curs

7 DMDW-5
Strongest rule

In this case after CARs are obtained by data


mining (as described in chapter 3) the CARs
set is used for classifying new examples:
For each new example the strongest rule
that covers that example is chosen for
classification (its class will be assigned to
the test example.
Strongest rule means rule with the highest
confidence and/or support. There are also
other measures for rule strength (chi-square
test from statistics for example).
Florin Radulescu, Note de curs

8 DMDW-5
Strongest rule

This is the simplest method to use CARs


for classifications:
Rules are ordered by their strength and
For each new test example the ordered rule
list is scanned and the first rule covering the
example is picked up.
A CAR covers an example if the example
contains the left side of the rule (the
transaction contains all the items in rule
left side).
Florin Radulescu, Note de curs

9 DMDW-5
Strongest rule: example

For example if we have an ordered rule list:


rule → datamining;
variable → programming
recommendation → database
Then the transaction
Doc-ex {rule, variable, loop, recommendation} ?

will be labeled with class ‘datamining’ because


the first rule is strongest than the others.
Florin Radulescu, Note de curs

10 DMDW-5
Subset of rules
This method is used in Classification Based on
Associations (CBA). In this case, having a
training dataset D and a set of CARs R, the
objectives are:
A. to order R using their support and
confidence, R = {r1, r2, …, rn}:
1. First rules with highest confidence
2. For the same confidence use the support to
order the rules
3. For the same support and confidence order by
rule generation-time (rules generated first are
‘greater’ than rules generated later).
Florin Radulescu, Note de curs

11 DMDW-5
Subset of rules
B. to select a subset S of R covering D:
1. Start with an empty set S
2. Consider ordered rules from R in sequence: for
each rule r
If D and r correctly classifies at least one example
in D then add r at the end of S and remove covered
examples from D.
3. Stop when D is empty
4. Add the majority class as default classification.
The result is:
Classifier = <ri1, ri2, …, rik, majority-class>
Florin Radulescu, Note de curs

12 DMDW-5
Using CARs

 There are two methods for using CARs in


classification presented in [Liu 11]:
Use CARs for building classifiers
Strongest rule
Subset of rules
Use CARs to build new attributes of the
dataset

Florin Radulescu, Note de curs

13 DMDW-5
Build new attributes (features)
 In this approach the training dataset is enriched with
new attributes, one for each CAR:
FOREACH transaction
IF transaction is covered by the left part of the CAR
THEN the value of the attribute is 1 (or TRUE)
ELSE the value of the new attribute is 0 (or FALSE)
ENDIF
ENDFOR
 There are also other methods to build classifiers using
a set of CARs, for examples grouping rules and
measuring the strength of each group, etc.

Florin Radulescu, Note de curs

14 DMDW-5
Use of association rules
 Association rules (not CARs) may be used in
recommendation systems:
The rules are ordered by their confidence and
support and then may be used, considering them
in this order, for labeling new examples
 Labels are not classes but other items
(recommendations).
For example, based on a set of association
rules containing books, the system may
recommend new books to customers based
on their previous orders.
Florin Radulescu, Note de curs

15 DMDW-5
Road Map

Classification using class association rules


Naïve Bayesian classification
Support vector machines
K-nearest neighbor
Ensemble methods: Bagging, Boosting,
Random Forest
Summary
Florin Radulescu, Note de curs

16 DMDW-5
Naïve Bayes: Overview
 This approach is a probabilistic one.
 The algorithms based on Bayes theorem compute
for each test example not a single class but a
probability of each class in C (the set of classes).
 If the dataset has k attributes, A1, A2, …, Ak, the
objective is to compute for each class c C = {c1,
c2, …, cn) the probability of the test example (a1,
a2, …, ak) to belong to the class c:
Pr(Class = c | A1 = a1, …, Ak = ak)
 If classification is needed, the class with the
highest probability may be assigned to that
example.
Florin Radulescu, Note de curs

17 DMDW-5
Bayes theorem

 Thomas Bayes (1701 – 1761) was a


Presbyterian minister and an English
mathematician.
The theorem named after him may be
expressed as follows:

where P(A|B) means probability of A given B.

Florin Radulescu, Note de curs

18 DMDW-5
Example
 “Students in PR106 are 60% from the AI
M.Sc. module and 40% from other
modules.
 20% of the students are placed in the first
2 rows of seats but for AI this percent is
30%.
 When the dean enters the class and sits
somewhere in the first 2 rows, near a
student, compute the probability that its
neighbor is from AI?”
Florin Radulescu, Note de curs

19 DMDW-5
Example
“Students in PR106 are 60% from the AI M.Sc. module and 40% from other modules. 20% of the
students are placed in the first 2 rows of seats but for AI this percent is 30%. When the dean
enters the class and sits somewhere in the first 2 rows, near a student, compute the probability
that its neighbor is from AI?”

1. Pr(AI) = 0.6
2. Pr(2 rows | AI) = 0.3
3. Pr(2 rows) = 0.2
So:
Pr(AI | 2 rows) = Pr(2 rows | AI)*Pr(AI) /
Pr(2 rows) = 0.3*0.6/0.2 = 0.9 or 90%.
Florin Radulescu, Note de curs

20 DMDW-5
Building classifiers

 The objective is to compute


Pr(Class = c | A1 = a1, …, Ak = ak).
Applying Bayes theorem:
Pr(C = cj | A1 = a1, …, Ak = ak) =

Florin Radulescu, Note de curs

21 DMDW-5
Building classifiers
 Making the following assumption: “all attributes
are conditionally independent given the class
C=cj” then:

Because of this assumption the method is called


“naïve”.
Not in all situations the assumption is valid.
The practice shows that the results obtained
using this simplifying assumption are good
enough in most of the cases.
Florin Radulescu, Note de curs

22 DMDW-5
Building classifiers

 Finally, replacing in the above expression we


obtain:
Pr(C = cj | A1 = a1, …, Ak = ak) =

All probabilities in the above expression may be


obtained by counting!

Florin Radulescu, Note de curs

23 DMDW-5
Building classifiers

 When only classification is needed, the


denominator of the above expression may be
ignored (is the same for all cj) and the labeling
class is obtained by maximizing the numerator:

Florin Radulescu, Note de curs

24 DMDW-5
Example

 Consider a simplified version of the PlayTennis


table :
Outlook Wind Play Tennis
Overcast Weak Yes
Overcast Strong No
Overcast Absent No
Sunny Weak Yes
Sunny Strong No
Rain Strong No
Rain Weak No
Rain Absent No

Florin Radulescu, Note de curs

25 DMDW-5
Example
Pr(Yes) = 2/8 Pr(No) = 6/8
Pr(Overcast | C = Yes) = 1/2 Pr(Weak | C = Yes) = 2/2
Pr(Overcast | C = No) = 2/6 Pr(Weak | C = No) = 1/6
Pr(Sunny | C = Yes) = 1/2 Pr(Strong| C = Yes) = 0/2
Pr(Sunny | C = No) = 1/6 Pr(Strong| C = No) = 3/6
Pr(Rain | C = Yes) = 0/2 Pr(Absent| C = Yes) = 0/2
Pr(Rain | C = No) = 3/6 Pr(Absent| C = No) = 2/6

 If the test example is:


Sunny Absent ???

Florin Radulescu, Note de curs

26 DMDW-5
Example

For C = Yes

For C = No

The result is No (not a very wise result!).

Florin Radulescu, Note de curs

27 DMDW-5
Special case: division by 0
 Sometimes a class does not occur with a specific
attribute value.
 In that case one term Pr(Ai = ai | C = cj) is zero, so the
above expression for probabilities of each class
evaluates to 0/0.
 For avoiding this situation, the expression:

must be modified.
(a = number of training examples with Ai = ai and C = cj
and b = number of training examples with C = cj )
Florin Radulescu, Note de curs

28 DMDW-5
Special case: division by 0

 The modified expression is:

where:
s = 1 / Number of examples in the training set
r = Number of distinct values for Ai

In this case all product terms are greater than


zero.
Florin Radulescu, Note de curs

29 DMDW-5
Example

For C = Yes

For C = No

The result is again No.

Florin Radulescu, Note de curs

30 DMDW-5
Special case: values

Non-categorical values or absent values:


All non-categorical attributes must be
discretized (replaced with categorical
ones).
Also, if some attributes have missing
values, these are ignored.

Florin Radulescu, Note de curs

31 DMDW-5
Road Map

Classification using class association rules


Naïve Bayesian classification
Support vector machines (SVMs)
K-nearest neighbor
Ensemble methods: Bagging, Boosting,
Random Forest
Summary
Florin Radulescu, Note de curs

32 DMDW-5
SVM: Overview
 In this course is presented only the general
idea of the Support Vector Machines (SVM)
classification method.
SVMs are described in detail in many
documentations and books, for example [Liu
11] or [Han, Kamber 06].
The method was discovered in Soviet Union
in '70 by Vladimir Vapnik and was developed
in USA after Vapnik joined AT&T Bell Labs in
early '90 (see [Cortes, Vapnik 95]).
Florin Radulescu, Note de curs

33 DMDW-5
SVM: Model

 Consider the training data set D = {(X1, y1),


(X2, y2), ..., (Xk, yk)} where:
Xi = (x1, x2, ..., xn) is a vector in Rn (all xi
components are real numbers)
yi is the class label, yi {-1, +1}. If Xi is
labeled with +1 it belongs to the positive
class, else to the negative class (-1).

Florin Radulescu, Note de curs

34 DMDW-5
SVM: Model
 A possible classifier is a linear function:
f(X) = <w X> + b
such as:

where:
 w is a weight vector,
 <w X> is the dot product of vectors w and X,
 b is a real number and
 w and b may be scaled up or down as shown below.
Florin Radulescu, Note de curs

35 DMDW-5
SVM: Model
The meaning of f is that the hyperplane
< w X> + b = 0
separates the points of the training set D in two:
one half of the space contains the positive
values and
the other half the negative values in D (like
hyperplanes H1 and H2 in the next figure).
All test examples can now be classified using f:
the value of f gives the label for the example.
Florin Radulescu, Note de curs

36 DMDW-5
Figure 1

 Source: Wikipedia

Florin Radulescu, Note de curs

37 DMDW-5
Best hyperplane
 SVM tries to find the ‘best’ hyperplane of that
form.
The theory shows that the best hyperplane is the
one maximizing the so-called margin (the
minimum orthogonal distance between a
positive and negative point from the training set
– see next figure for an example.

Florin Radulescu, Note de curs

38 DMDW-5
Figure 2

 Source: Wikipedia

Florin Radulescu, Note de curs

39 DMDW-5
The model
 Consider X+ and X- the nearest positive and negative
points for the hyperplane
<w X> + b = 0
 Then there are two other parallel hyperplanes, H+ and
H- passing through X+ and X- and their expression is:
H+ : <w X> + b = 1
H- : <w X> + b = -1
 These two hyperplanes are with dotted lines in Figure
1. Note that w and b must be scaled such as:
<w Xi> + b 1 for yi = 1
<w Xi> + b -1 for yi = -1

Florin Radulescu, Note de curs

40 DMDW-5
The model
 The margin is the distance between these two
planes and may be computed using vector space
algebra obtaining:

 Maximizing the margin means minimizing the value


of

 The points X+ and X- are called support vectors


and are the only important points from the dataset.
Florin Radulescu, Note de curs

41 DMDW-5
Definition: separable case
 When positive and negative points are linearly
separable, the SVM definition is the following:
 Having a training data set D = {(X1, y1), (X2, y2), ..., (Xk, yk)}
 Minimize the value of expression (1) above
 With restriction: yi (<w Xi> + b) 1, knowing the value of yi: +1
or -1
 This optimization problem is solvable by rewriting the
above inequality using a Lagrangian formulation and
then finding solution using Karush-Kuhn-Tucker (KKT)
conditions.
 This mathematical approach is beyond the scope of this
course.
Florin Radulescu, Note de curs

42 DMDW-5
Non-linear separation
 In many situations there is no hyperplane for
separation between the positive and negative
examples.
In such cases there is possible to map the
training data points (examples) in another
space, a higher dimensional one.
Here data points may be linearly separable.
The mapping function gets examples (vectors)
from the input space X and maps them in the so-
called feature space F:
:X→F
Florin Radulescu, Note de curs

43 DMDW-5
Non-linear separation

Each point X is mapped in (X). So, after


mapping the whole D there is another training
set, containing vectors from F and not from X,
with dim(F) n = dim(X):
D = {( (X1), y1), ( (X2), y2), ..., ( (Xk), yk)}

For an appropriate , these points are linearly


separable.
An example is the next figure.
Florin Radulescu, Note de curs

44 DMDW-5
Figure 3

 Source: Wikipedia

Florin Radulescu, Note de curs

45 DMDW-5
Kernel functions
 But how can we find this mapping function?
 In solving the optimization problem for finding the
linear separation hyperplane in the new feature space
F, all terms containing training examples are only of
the form (Xi) (Xj).
 By replacing this dot product with a function in both Xi
and Xj the need for finding disappears. Such a
function is called a kernel function:
 K(Xi, Xj) = (Xi) (Xj)
 For finding the separation hyperplane in F we must
only replace all dot products with the chosen kernel
function and then proceed with the optimization
problem like in separable case.
Florin Radulescu, Note de curs

46 DMDW-5
Kernel functions

Some of the most used kernel functions are:


Linear kernel
K(X, Y) = <X Y> + b
Polynomial Kernel
K(X, Y) = (a * <X Y> + b)p
Sigmoid Kernel
K(X, Y) = tanh(a * <X Y> + b)

Florin Radulescu, Note de curs

47 DMDW-5
Other aspects concerning SVMs
 SVM deals with continuous real values for
attributes.
When categorical attributes exists in the training
data a conversion to real values is needed.
When more than two classes are needed
SVM can be used recursively.
First use separates one class; the second use
separates the second class and so on. For N
classes N-1 runs are needed.
SVM are a very good method in hyper
dimensional data classification.
Florin Radulescu, Note de curs

48 DMDW-5
Road Map

Classification using class association rules


Naïve Bayesian classification
Support vector machines
K-nearest neighbor (kNN)
Ensemble methods: Bagging, Boosting,
Random Forest
Summary
Florin Radulescu, Note de curs

49 DMDW-5
kNN
 K-nearest neighbor (kNN) does not produce a
model but is a simple method for determining the
class of an example based on the labels of its
neighbors belonging to the training set.
 For running the algorithm a distance function is
needed for computing the distance from the test
example to the examples in the training set.
 A function f(x, y) may be used as distance function if
four conditions are met:
o f(x, y) 0
o f(x, x) = 0
o f(x, y) = f(y, x)
o f(x, y) f(x, z) + f(z, y).
Florin Radulescu, Note de curs

50 DMDW-5
Algorithm
Input:
 A dataset D containing labeled examples (the training set)
 A distance function f for measuring the dissimilarity between
two examples
 An integer k – parameter - telling how many neighbors are
considered
 A test example t
Output:
 The class label of t
Method:
 Use f to compute the distance between t and each point in D
 Select nearest k points
 Assign t the majority class from the set of k nearest
neighbors.
Florin Radulescu, Note de curs

51 DMDW-5
Example
K=3 Red
K = 5 Blue

kNN is very sensitive to the value of


parameter k.
The best k may be found for example by
cross validation.
Florin Radulescu, Note de curs

52 DMDW-5
Road Map

Classification using class association rules


Naïve Bayesian classification
Support vector machines
K-nearest neighbor (kNN)
Ensemble methods: Bagging, Boosting,
Random Forest
Summary
Florin Radulescu, Note de curs

53 DMDW-5
Ensemble methods

 Ensemble methods combine multiple


classifiers to obtain a better one.
Combined classifiers are similar (use the
same learning method) but the training
datasets or the weights of the examples in
them are different.

Florin Radulescu, Note de curs

54 DMDW-5
Bagging

 The name Bagging comes from Bootstrap


Aggregating.
 As presented in the previous lesson
bootstrap method is part of resampling
methods and consists in getting a training
set from the initial labeled data by
sampling with replacement.

Florin Radulescu, Note de curs

55 DMDW-5
Example


Original dataset a b c d e f

Training set 1 b b b c e f

Training set 2 b b c c d e

Training set 3 a b c c d f

Florin Radulescu, Note de curs

56 DMDW-5
Bagging
Bagging consists in:
 Starting with the original dataset, build n training
datasets by sampling with replacement (bootstrap
samples)
 For each training dataset build a classifier using the
same learning algorithm (called weak classifiers).
 The final classifier is obtained by combining the results
of the weak classifiers (by voting for example).
 Bagging helps to improve the accuracy for unstable
learning algorithms: decision trees, neural networks.
 It does not help for kNN, Naïve Bayesian classification
or CARs.
Florin Radulescu, Note de curs

57 DMDW-5
Boosting
Boosting consists in building a sequence of
weak classifiers and adding them in the
structure of the final strong classifier.
The weak classifiers are weighted based on
the weak learners' accuracy.
Also data is reweighted after each weak
classifier is built such as examples that are
incorrectly classified gain some extra weight.
The result is that the next weak classifiers in
the sequence focus more on the examples
that previous weak classifiers missed.
Florin Radulescu, Note de curs

58 DMDW-5
Random forest
 Random forest is an ensemble classifier consisting in a set
of decision trees The final classifier output the modal value of the
classes output by each tree.
 The algorithm is the following:
1. Choose T - number of trees to grow (e.g. 10).
2. Choose m - number of variables used to split each node. m M,
where M is the number of input variables.
3. Grow T trees. When growing each tree do the following:
 Construct a bootstrap sample from training data with
replacement and grow a tree from this bootstrap sample.
 When growing a tree at each node select m variables at
random and use them to find the best split.
 Grow the tree to a maximal extent. There is no pruning.
4. Predict new data by aggregating the predictions of the trees (e.g.
majority votes for classification, average for regression).

Florin Radulescu, Note de curs

59 DMDW-5
Summary
This course presented:
Classification using class association rules: CARs for
building classifiers and using CARs for building new
attributes (features) of the training dataset.
Naïve Bayesian classification: Bayes theorem, Naïve
Bayesian algorithm for building classifiers.
An introduction to support vector machines (SVMs):
model, definition, kernel functions.
K-nearest neighbor method for classification
Ensemble methods: Bagging, Boosting, Random
Forest
 Next week: Unsupervised learning – part 1
Florin Radulescu, Note de curs

60 DMDW-5
References
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data, Second Edition, Springer, chapter 3.
[Han, Kamber 06] Jiawei Han, Micheline Kamber, Data Mining:
Concepts and Techniques, Second Edition, Morgan Kaufmann
Publishers, 2006
[Cortes, Vapnik 95] Cortes, Corinna; and Vapnik, Vladimir N.;
"Support-Vector Networks", Machine Learning, 20, 1995.
http://www.springerlink.com/content/k238jx04hm87j80g/
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org

Florin Radulescu, Note de curs

61 DMDW-5
Unsupervised Learning
- Part 1 -
Road Map

Supervised vs. unsupervised learning.


Clustering
Types of clustering
K-Means
Distance functions
Handling different types of attributes
Summary
Florin Radulescu, Note de curs

2 DMDW-6
Supervised vs. unsupervised
 In the previous chapter (supervised learning), data
points (examples) are of two types:
Labeled examples (by some experts); these
examples are used as training set and sometimes,
part of them as validation set.
Unlabeled examples; these examples, members of
the so-called test set, are new data and the objective
is to label them in the same way the training set
examples are labeled.
 Labeled examples are used to build a model or
method (called classifier) and this classifier is the
‘machine’ used to label further examples
(unlabeled examples from the test set).
Florin Radulescu, Note de curs

3 DMDW-6
Supervised learning
So the starting points of supervised learning are:
1. The set of classes (labels) is known. These classes
reflects the inner structure of the data, so this structure
is previously known in the case of supervised learning
2. Some labeled examples (at least few for each class)
are known. So supervised learning may be
characterized also as learning from examples. The
classifier is built entirely based on these labeled
examples.
3. A classifier is a model or method for expanding the
experience kept in the training set to all further new
examples.
4. Based on a validation set, the obtained classifier may
be evaluated (accuracy, etc).
Florin Radulescu, Note de curs

4 DMDW-6
Unsupervised learning
In unsupervised learning:
The number of classes (called clusters) is not
known. One of the objectives of clustering is
also to determine this number.
The characteristics of each cluster (e.g. its
center, number of points in cluster, etc) are
not known. All these characteristics will be
available only at the end of the process.
There are no examples or other knowledge
related to the inner structure of the data to
help in building the clusters
Florin Radulescu, Note de curs

5 DMDW-6
Unsupervised learning

The objective is not to build a model for


further data points but to discover the inner
structure of an existing dataset.
In unsupervised learning there is no target
attribute: data points are not labeled at the
end of the process but the obtained clusters
may be further used as the input of a
supervised learning algorithm.

Florin Radulescu, Note de curs

6 DMDW-6
Unsupervised learning
Because there are no labeled examples,
there is no possible evaluation of the result
based on previously known information.
Cluster evaluation is made using computed
characteristics of the resulting clusters.
Unsupervised learning is a class of Data
mining algorithms including clustering,
association rules (already presented), self
organizing maps, etc. This chapter focuses
on clustering.
Florin Radulescu, Note de curs

7 DMDW-6
Clustering

Any clustering algorithm has the following


generic structure:
Input:
1. A set of n objects D = {d1, d2, …, dn} (called
usually points). The objects are not labeled
and there is no set of class labels defined.

Florin Radulescu, Note de curs

8 DMDW-6
Clustering
Input:
2. A distance function (dissimilarity
measure) that can be used to compute
the distance between any two points.
 Low valued distance means ‘near’, high
valued distance means ‘far’.
 Note: If a distance function is not
available, the distance between any two
points in D must be provided as input.
Florin Radulescu, Note de curs

9 DMDW-6
Clustering

Input:
3. For the most part of the algorithms the items
are represented by their coordinates in a k
dimensional space, called attribute values,
as every dimension defines an attribute for
the set of points.
 In this case the distance function may be
the Euclidean distance or other attribute
based distance.
Florin Radulescu, Note de curs

10 DMDW-6
Clustering
Input:
4. Some algorithms also need a predefined
value for the number of clusters in the
produced result.
Output:
A set of object (point) groups called clusters
where points in the same cluster are near
one to another and points from different
clusters are far one from another, considering
the distance function.
Florin Radulescu, Note de curs

11 DMDW-6
Example

Three clusters in 2D

Florin Radulescu, Note de curs

12 DMDW-6
Features
Each cluster may be described by its:
 Centroid – is the Euclidean center of the cluster,
computed as the mass center of the (equally
weighted) points in the cluster.
 When the cluster is not in a Euclidean space, the
centroid cannot be determined – there are no
coordinates. In that case a clustroid (or medoid) is
used as the center of a cluster.
 The clustroid/medoid is a point in the cluster, the
one best approximating its center.
Florin Radulescu, Note de curs

13 DMDW-6
Features

Also each cluster may be characterized by its:


Radius – is the maximum distance from the
centroid to the cluster points
Diameter – is the maximum distance
between two points within a cluster. Note that
the diameter is not twice the radius.

Florin Radulescu, Note de curs

14 DMDW-6
Road Map

Supervised vs. unsupervised learning.


Clustering
Types of clustering
K-Means
Distance functions
Handling different types of attributes
Summary
Florin Radulescu, Note de curs

15 DMDW-6
Classification

Based on the method for discovering the


clusters, the most important categories
are:
Centroid based clustering
Hierarchical clustering
Distribution-based clustering
Density-Based clustering

Florin Radulescu, Note de curs

16 DMDW-6
Centroid-based

In this approach initial centroids are


determined in some way and then points are
added to the clusters.
This method makes directly a partitioning of
the dataset.
The best known algorithm in this class is k-
Means.

Florin Radulescu, Note de curs

17 DMDW-6
Example

K-Means:

Florin Radulescu, Note de curs

18 DMDW-6
Hierarchical clustering

The result of a hierarchical clustering


algorithm is a dendrogram – a tree having
clusters as nodes, leaf nodes containing
clusters with a single data point.
 Each node is the reunion (upon merging) of
its sons.
A well known algorithm in this class is
BIRCH.

Florin Radulescu, Note de curs

19 DMDW-6
Example

Results of a hierarchical clustering


algorithm:

Florin Radulescu, Note de curs

20 DMDW-6
Distribution-based clustering

For these algorithms, clusters can be defined


as containing objects belonging most likely to
the same distribution.
Expectation-maximization algorithm is a
representative of this class.

Florin Radulescu, Note de curs

21 DMDW-6
Example

Florin Radulescu, Note de curs

22 DMDW-6
Density-based clustering

In this case, a cluster is defined as a region


with a higher density of points in the data
space.
Examples:
DBSCAN
OPTIX

Florin Radulescu, Note de curs

23 DMDW-6
Example

Florin Radulescu, Note de curs

24 DMDW-6
Hard vs. Soft clustering
Based on the number of clusters for each
point, clustering techniques may be classified
in:
1. Hard clustering. In that case each point
belongs to exactly one cluster.
2. Soft clustering. These techniques (called
also fuzzy clustering) compute for each data
point and each cluster a membership level
(the level or degree of membership of that
point to that cluster). FLAME algorithm is of
this type.
Florin Radulescu, Note de curs

25 DMDW-6
Hierarchical clustering
Hierarchical clustering algorithms can be
further classified in:
Agglomerative hierarchical clustering: starts with
a cluster for each point and merge the closest
clusters until a single cluster is obtained (bottom-
up).
Divisive hierarchical clustering: starts with a
cluster containing all points and split clusters in
two, based on density or other measure, until
single data point clusters are obtained (top-
down).
Florin Radulescu, Note de curs

26 DMDW-6
Dendrogram
 In both cases a dendrogram is obtained.
 The dendrogram is the tree resulting from the
merge or split action described above.
 For obtaining some clusters, the dendrogram may
be cut at some level.
 For the next example, cutting with the upper
horizontal line produces the clusters {(a), (bc),
(de), (f)}.
 The second cut produces {(a), (bc), (def)}. Based
on clusters’ characteristics (see cluster evaluation
next week) the best cut may be determined.
Florin Radulescu, Note de curs

27 DMDW-6
Example

Florin Radulescu, Note de curs

28 DMDW-6
Agglomerative hierarchical algorithm

The agglomerative approach is preferred in


hierarchical clustering.
The sketch of such an algorithm is the following:
Input:
A set of n points D = {d1, d2, …, dn}, a distance
function between them or the distance between
any two points.
Output:
The dendrogram resulting from the clustering
process above
Florin Radulescu, Note de curs

29 DMDW-6
Method
START with a cluster for each point of D.
COMPUTE the distance between any two clusters
WHILE the number of clusters is greater than 1
DO
DETERMINE the nearest two clusters
MERGE clusters in a new cluster c
COMPUTE the distances from c to the other
clusters
ENDWHILE
Florin Radulescu, Note de curs

30 DMDW-6
Distance between clusters
For determining the distance between two
clusters several methods can be used:
1. Single link method: the distance between
two clusters is the minimum distance
between a point in the first cluster and a
point in the second cluster.
2. Complete link method: the distance
between two clusters is the maximum
distance between a point in the first cluster
and a point in the second cluster.
Florin Radulescu, Note de curs

31 DMDW-6
Distance between clusters

For determining the distance between two


clusters several methods can be used:
3. Average link method: the distance
between two clusters is the average
distance between a point in the first cluster
and a point in the second cluster.
4. Centroid method: the distance between
two clusters is the distance between their
centroids.
Florin Radulescu, Note de curs

32 DMDW-6
Road Map

Supervised vs. unsupervised learning.


Clustering
Types of clustering
K-Means
Distance functions
Handling different types of attributes
Summary
Florin Radulescu, Note de curs

33 DMDW-6
Algorithm description

A centroid-based clustering algorithm. The


input includes the number of clusters to be
obtained (k from the algorithm name).
The algorithm structure is:
1. Start choosing randomly k initial cluster centers from
the dataset D to be processed.
2. Assign each point in the dataset to the nearest centroid.
3. Re-compute the centroids for each cluster found at step
2.
4. Go to step 2 until some stopping criteria are met.
Florin Radulescu, Note de curs

34 DMDW-6
Conditions

K-means assumes the existence of a


Euclidean space.
Points have coordinates and re-computation
of the centroids is made based on them.
If the first set of centroids (chosen at step 1)
is contained in the dataset D, after the first re-
computation the new centroids are not
necessarily points in D but some points in the
same space.
Florin Radulescu, Note de curs

35 DMDW-6
Conditions

New centroids are determined as the mass-


weight center of the points in each cluster,
assuming that each point has the same
weight.
In other words, if each cluster point is
represented as a vector, the new centroids
are the average values of these vectors.

Florin Radulescu, Note de curs

36 DMDW-6
K-means algorithm

Input:
A dataset D = {P1, P2, …, Pm} containing m
points in an n-dimensional Euclidian space
and a distance function.
k: the number of clusters to be obtained
Output:
The k clusters obtained

Florin Radulescu, Note de curs

37 DMDW-6
Method
1. Choose randomly k points in D as initial centroids
2. REPEAT
3. FOR (i=1; i<=m; i++)
4. using the distance function, assign Pi to
5. the nearest centroid
5. END FOR
6. FOR (i=1; i<=k; i++)
7. Consider the set of r points assigned to centroid i:
{Pj1, …, Pjr}
8. New centroid is (Pj1, …, Pjr) / r
//(each point is considered a vector)
9. END FOR
10. UNTIL stopping criteria are met
Florin Radulescu, Note de curs

38 DMDW-6
Stopping criteria
Stopping criteria may be:
1. Cluster are not changing from an iteration to
another.
2. Cluster changes are below a given threshold
(e.g. no more than p points are changing the
cluster between two successive iterations).
3. Cluster centroids movement is below a given
threshold (e.g. the sum of distances between
old and new positions for centroids is no more
than d between two successive iterations).
Florin Radulescu, Note de curs

39 DMDW-6
Stopping criteria

(stopping criteria continued):


4. The decrease of the SSD (sum of squared
distances) is under a given threshold.
 The SSD measures the compactness of the
whole set of clusters.
 It is the sum of the squared distances from
each point to its centroid:

Florin Radulescu, Note de curs

40 DMDW-6
Weaknesses

1. The algorithm is sensitive to outliers.


 Outliers are mainly errors and are placed
far away from any other point.
 The algorithm is trying to include outliers
in some clusters and the new centroids,
computed at each iteration, are far from
their natural position (without outliers).

Florin Radulescu, Note de curs

41 DMDW-6
Weaknesses

2. The algorithm is sensitive to the initial


position of the centroids.
 Changing the initial centroids may lead to
other resulting clusters, as in the next
example.

Florin Radulescu, Note de curs

42 DMDW-6
Example: initial centroids

 a, b, c and d may be grouped in two ‘horizontal’


clusters for initial centroids a and c or in two
‘vertical’ clusters for initial centroids a and b.

a b

c d

Florin Radulescu, Note de curs

43 DMDW-6
Weaknesses

3. From 2 results also that a global optimum


solution is not guaranteed - a local optimum
is obtained.
4. The number of clusters, k, must be provided
from outside of the algorithm
5. K-means has good results on clusters with a
convex, spherical shape. For non-convex
shapes the results are not realistic.

Florin Radulescu, Note de curs

44 DMDW-6
Weaknesses

6. It is not so efficient if data are stored on


disks but works well when data may be
loaded into the main memory.
7. If the mean of the points may not be
computed, the algorithm cannot be used.
8. For categorical data there is a variation of k-
means: k-mode.
9. If medoids are used, there is another
variation: k-medoids.
Florin Radulescu, Note de curs

45 DMDW-6
Strengths

1. It is very simple and easy to implement.


 There is a great number of packages and
individual implementations of k-means.
2. It is an efficient algorithm.
 Its complexity is linear in number of
clusters, number of iterations and number
of points.
 As the first two are small, k-means may
be considered a linear algorithm.
Florin Radulescu, Note de curs

46 DMDW-6
Road Map

Supervised vs. unsupervised learning.


Clustering
Types of clustering
K-Means
Distance functions
Handling different types of attributes
Summary
Florin Radulescu, Note de curs

47 DMDW-6
A distance function must be:

1. Non-negative: f(x, y) 0
2. Identity: f(x, x) = 0
3. Symmetry: f(x, y) = f(y, x)
4. Triangle inequality:
f(x, y) f(x, z) + f(z, y).

Florin Radulescu, Note de curs

48 DMDW-6
Distance function

A distance function is a measure of the


dissimilarity between its two arguments.
The distance between two points is based on
the values of the attributes of both
arguments.
If the points are associated with a Euclidean
space with k dimensions, then each point has
k coordinates and these values may be used
for computing the distance.
Florin Radulescu, Note de curs

49 DMDW-6
Euclidean distance

 Simple

Weighted: when some dimensions are


more important than others.

Florin Radulescu, Note de curs

50 DMDW-6
Euclidean distance

Squared Euclidean distance

This squared distance is used when distant


points are more important.

Florin Radulescu, Note de curs

51 DMDW-6
Other distance functions

 Manhattan distance (city block): the road


between the two points may be followed only
parallel with the axis

Chebychev distance: in the case of hyper


dimensionality

Florin Radulescu, Note de curs

52 DMDW-6
Binary attributes
In some situations all attributes have only two
values: 0 or 1 (positive / negative, yes / no,
true / false, etc).
For these cases the distance function may be
defined based on the following confusion
matrix:
a = number of attributes having 1 for x and y
b = number of attributes having 1 for x and 0 for y
c = number of attributes having 0 for x and 1 for y
d = number of attributes having 0 for x and y
Florin Radulescu, Note de curs

53 DMDW-6
Confusion matrix

Data point y

1 0

Data point x 1 a b a+b

0 c d c+d

a+c b+d a+b+c+d

Florin Radulescu, Note de curs

54 DMDW-6
Symmetric binary

When attribute values 0 and 1 have the same


weight, the distance can ge computed using
the proportion of different values (Simple
Matching Coefficient):

Florin Radulescu, Note de curs

55 DMDW-6
Asymmetric binary

When only the value 1 is important,


attributes having both a 0 value (their
number is d) may be ignored in the
distance function:

Florin Radulescu, Note de curs

56 DMDW-6
Nominal attributes

This is a generalized version of binary


attributes above.
The proportion of dissimilarities may also be
used as a distance function.
Suppose two points x and y having k nominal
attribute values each and s the number of
attributes where x and y have the same
value.

Florin Radulescu, Note de curs

57 DMDW-6
Nominal attributes

In this case the Simple Matching Coefficient


distance is written as:

Florin Radulescu, Note de curs

58 DMDW-6
Cosine distance
 Consider two points, x = (x1, x2, …, xk) and y = (y1,
y2, …, yk), in a space with k dimensions.
 In this case each point may be viewed as a vector
starting from the origin of axis and pointing to x or
y.
 The angle between these two vectors may be
used for measuring the similarity: if the angle is 0
or near this value then the points are similar.
 Because the distance is a measure of the
dissimilarity, the cosine of the angle – cos( ) - may
be used in the distance function as follows:
Dist(x, y) = 1 – cos( )
Florin Radulescu, Note de curs

59 DMDW-6
Example
Dimension 2
y

Dimension 1

Florin Radulescu, Note de curs

60 DMDW-6
Cosine distance

Whay cos( )? Because the value of cos( )


may be obtained using the dot product of x
and y as follows:

The cosine similarity can be used for


example in finding the distance between
documents.
Florin Radulescu, Note de curs

61 DMDW-6
Cosine distance: Example
 If a document is considered a bag of words, each
word of the considered vocabulary becomes a
dimension. On a dimension, a document has the
coordinate:
 1 or 0 depending on the presence or absence of the
word from the document
or
 A natural number, equal with the number of
occurrences of the word in the document.
 Considering a document y containing two or more
copies of another document x, the angle between
x and y is zero so the cosine distance is also equal
to 0 (the documents are 100% similar).
Florin Radulescu, Note de curs

62 DMDW-6
No Euclidean space case

There are cases when the members of the input


dataset D have no coordinates.
In that case the distance function is based on
other features.
An example is cited in [Ullman] and computes
the distance between two sequences – for
example two character strings or two genes from
the DNA.
The distance function is called the Edit Distance.
Florin Radulescu, Note de curs

63 DMDW-6
Edit distance

Considering two sequences x and y, the


distance between x and y may be defined as
the needed number of deletions and
insertions of a single sequence element for
transforming x in y.

Florin Radulescu, Note de curs

64 DMDW-6
Edit distance example
 Consider strings x and y:
x = 'Mary had a little lamb'
y = 'Baby: had a little goat'
 Operations for transforming x in y:
 2 deletions and 3 insertions to transform
'Mary' in 'Baby:'
 3 deletions and 3 insertions to transform
'lamb' in 'goat'
 So the distance is 2+3+3+3 = 11.
Florin Radulescu, Note de curs

65 DMDW-6
Edit distance formula

If we consider LCS(x, y) = the longest


common sequence of x and y then the edit
distance may be written as follows:

For the previous example:


||x|| = 22; ||y|| = 23;
LCS(x, y) = 'ay had a little a'; ||LCS(x, y)|| = 17
Dist(x, y) = 22 + 23 – 2*17 = 45 – 34 = 11 qed.
Florin Radulescu, Note de curs

66 DMDW-6
Road Map

Supervised vs. unsupervised learning.


Clustering
Types of clustering
K-Means
Distance functions
Handling different types of attributes
Summary
Florin Radulescu, Note de curs

67 DMDW-6
Data standardization

 Sometimes the importance of some


attribute is bigger that others only because
the value range of that attribute is bigger.
We can achieve normalization using some
transformations.
Presented also in second chapter!

Florin Radulescu, Note de curs

68 DMDW-6
Interval-scaled

Min-max normalization:
vnew = (v – vmin) / (vmax – vmin)
For positive values the formula is:
vnew = v / vmax
z-score normalization ( is the standard
deviation):
vnew = (v – vmean) /

Florin Radulescu, Note de curs

69 DMDW-6
Interval-scaled

Decimal scaling:

vnew = v / 10n

where n is the smallest integer for that all


numbers become (as absolute value) less
than the range r (for r = 1, all new values of v
are smaller or equal to 1).
Florin Radulescu, Note de curs

70 DMDW-6
Ratio-scaled

Log transform:

vnew = log(v)

This normalization may be used for ratio


scaled attributes with exponential growth.

Florin Radulescu, Note de curs

71 DMDW-6
Nominal, ordinal

Nominal attributes:
Use feature construction tricks presented in the
last chapter:
 If a nominal attributes has n values it is replaced
by n new attributes having a 1/0 value (the
attribute has/has not that particular value).
Ordinal attributes:
Values of an ordinal attribute are ordered, so it
can be treated as a numeric one, assigning some
numbers to its values.
Florin Radulescu, Note de curs

72 DMDW-6
Mixed attributes

In many cases the attributes of a dataset


have different types.
In this case there is no distance function that
may be applied to find the distance between
points.
Solutions:
Convert to a common type
Combine different distances
Florin Radulescu, Note de curs

73 DMDW-6
Convert to a common type
 If some attribute type is predominant, all other
attributes are converted to that type
Then use a distance function attached to that
type.
 Some conversions make no sense:
Converting a nominal attribute to an interval
scaled one is not obvious.
How can we convert values as {sunny, overcast,
rain} in numbers?
 Sometimes we can assign a value (for example the
average temperature of a sunny, overcast or rainy
day) but this association is not always productive.
Florin Radulescu, Note de curs

74 DMDW-6
Combine different distances
 A distance for each dimension is computed
using an appropriate distance function
Then these distances are combined in a
single one.
If:
 d(x, y, i) = the distance between x and y on
dimension i
 (x, y, i) = 0 or 1 depending on the fact that
the values of x and y on dimension i are
missing (even only one of them) or not.
Florin Radulescu, Note de curs

75 DMDW-6
Combine different distances
Then:

So says if that dimension is considered


(value 1) or not (value 0) for the combined
distance between x and y.
The combined distance is the average value
of the distances on the considered
dimensions.
Florin Radulescu, Note de curs

76 DMDW-6
Summary
This course presented:
A parallel between supervised vs.
unsupervised learning, the definition of
clustering and classifications of clustering
algorithms
The description of the k-means algorithm, one
of the most popular clustering algorithms
A discussion about distance functions
How to handle different types of attributes
Next week: Unsupervised learning – part 2
Florin Radulescu, Note de curs

77 DMDW-6
References
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data, Second Edition, Springer, chapter 3.
[Rajaraman, Ullman 10] Mining of Massive Datasets, Anand
Rajaraman, Jeffrey D. Ullman, 2010
[Ullman] Jeffrey Ullman, Data Mining Lecture Notes, 2003-2009,
web page: http://infolab.stanford.edu/~ullman/mining/mining.html

Florin Radulescu, Note de curs

78 DMDW-6
Unsupervised Learning
- Part 2 -
Road Map

K-Medoids, k-modes and k-means++


FastMap: multidimensional scaling
Cluster evaluation
Fuzzy clustering: fuzzy C-means
Clusters and holes
Summary

Florin Radulescu, Note de curs

2 DMDW-7
k-medoids
The algorithms in this category are similar to
k-means.
The main differences from k-means are:
K-medoids uses a data point as center of a
cluster (such a point is called a medoid). This is
the cluster member best approximating the
cluster center.
Stopping criterion is based not on SSD but on
sum of pairwise dissimilarities (distances).
The best known algorithm of this type is
Partitioning Around Medoids (PAM)
Florin Radulescu, Note de curs

3 DMDW-7
PAM

Input:
A dataset D = {P1, P2, …, Pm} containing m
points in an n-dimensional space and a
distance function between points in that
space.
k: the number of clusters to be obtained
Output:
The k clusters obtained
Florin Radulescu, Note de curs

4 DMDW-7
PAM - Method
1. Randomly choose k points in D as initial medoids: {m1, m2, …, mk}
2. REPEAT
3. FOR (i=1; i<=m; i++)
4. Assign Pi to the nearest medoid
5. END FOR
6. FOR (i=1; i<=k; i++)
7. FOR (j=1; j<=m; j++)
8. IF Pj is not a medoid THEN
9. Configuration(i, j) = swap Pj with mi
10. Compute the cost of the new configuration
11. Reverse the swap
12. END IF
13 END FOR
14 END FOR
15. Select the configuration with the best cost (lowest)
16. UNTIL New configuration = Old configuration
Florin Radulescu, Note de curs

5 DMDW-7
Configuration cost
The main idea is that each medoid may be
swapped with any non-medoid point.
If the new configuration is the best swap, a new
medoid is appointed replacing an old one.
The process continues until no better
configuration is possible.
The cost of a configuration is the sum of the
distances between points and their medoids:

Florin Radulescu, Note de curs

6 DMDW-7
k-modes
k-modes is designed to be used for points
having categorical (nominal or ordinal)
attributes.
The mode of a dataset is the most frequent
value.
This refers to a dataset containing atomic
values.
In clustering a point is characterized by a set
of attributes, in some cases of different types,
each attribute having a value from its domain.
Florin Radulescu, Note de curs

7 DMDW-7
k-modes
In that case we must redefine the mode for
applying the notion to a set of points.
The definition starts with the expression
returning the number of dissimilarities (like in the
previous course) between two points X and Y in
an n-dimensional space:

where X = (x1, …, xn), Y = (y1, …, yn) and


(xi, yi) = 0 if xi = yi and 1 otherwise
Florin Radulescu, Note de curs

8 DMDW-7
The mode
 If D = {P1, P2, …, Pm} is a set containing m points
with n attributes (categorical or not), the mode of D
may be defined as a vector (with the same number
of dimensions) Q = (q1, q2, …, qn) that minimizes:

D(Q, D) = ∑ i=1..m d(Pi, Q)

 Q is not necessarily a member of D. The mode of a


set is not unique. For example, the mode of [a, b],
[a, c], [c, b], and [b, c] is either [a, b] or [a, c].

Florin Radulescu, Note de curs

9 DMDW-7
k-means vs. k-modes

The differences between k-means and k-


modes are listed in the initial article ([Huang
98]):
1. Uses of a simple matching dissimilarity
measure for categorical objects,
2. Replaces means of clusters by modes, and
3. Uses a frequency-based method for finding
the modes.

Florin Radulescu, Note de curs

10 DMDW-7
Frequency-based method
 Let X be a set of categorical objects described by
categorical attributes A1 , A2 , …, Am
 Let nc(k , j) be the number of objects having category
c(k,j) in attribute Aj and
 Let fr(Aj = c(k,j) | X) = nc(k,j) / n the relative frequency of
category c(k,j) in X.
Then:
 Theorem: The function D(Q,X) is minimised iff
fr(Aj = qj | X) >= fr(Aj = c(k,j) | X) for qj ≠ c(k,j) for all j =
1..m.

The theorem defines a way to find Q from a given X


Florin Radulescu, Note de curs

11 DMDW-7
Frequency-based method

Example: the mode of [a, b], [a, c], [c, b],


and [b, c].
For the first attribute, the most frequent
value is a
For the second attribute, the most frequent
values are b and c (with the same
frequency)
So the modes are [a, b] and [a, c]
Florin Radulescu, Note de curs

12 DMDW-7
K-means++

One of the problems of k-means is that the


algorithm is sensitive to the initial centroids.
A bad choice may lead to bad clustering results,
as in figure 1: if a and c are chosen for initial
centroids the result is not the natural one:

a b

c d

Florin Radulescu, Note de curs

13 DMDW-7
k-means++
 K-means++ is not a new clustering algorithm but a
method to select initial centroids:
1. The first centroid is selected randomly from the data
points.
2. For each data point P, compute d = Dist(P, c), the
distance between P and the nearest centroid already
determined.
3. A new centroid is selected using a weighted
probability distribution: the point is chosen with a
probability proportional to d2.
4. Repeat steps 2 and 3 until k centroids are selected.
 After initial centroid selection, usual k-means algorithm
may be run for clustering the dataset.
Florin Radulescu, Note de curs

14 DMDW-7
Road Map

K-Medoids, k-modes and k-means++


FastMap: multidimensional scaling
Cluster evaluation
Fuzzy clustering: fuzzy C-means
Clusters and holes
Summary

Florin Radulescu, Note de curs

15 DMDW-7
FastMap
There are cases when there is no Euclidean
space and only the distances between two
points are available (given as input or by a
distance function specific to the dataset).
 In that case all the algorithms assuming the
existence of coordinates and of a Euclidean
space cannot be used.
This paragraph presents a solution for solving
the above problem: associate a Euclidian
space with few dimensions with such a
dataset.
Florin Radulescu, Note de curs

16 DMDW-7
FastMap

If we have N points and the distances


between any two points, there is a solution
for the exact placing of these points in a
space with N-1 dimensions.
Example: Placing exactly 3 points in 2D:

A B
C
Florin Radulescu, Note de curs

17 DMDW-7
FastMap
If N is big, computations are slow. So we
need to place N points into a space with k
dimensions, where k << N.
This process of creating a Euclidian space
knowing only the distances between any two
points is called multidimensional scaling
There are many algorithms for this, the most
known being FastMap, MetricMap, and
Landmark MDS (LMDS).
These algorithms approximate classical MDS
using a subset of the data and fitting the
remainder to the solution. Florin Radulescu, Note de curs

18 DMDW-7
FastMap

FastMap is a recursive algorithm and at each


step the following operations are done:
Two distant points are selected as an axis. In
the next figure these points are a and b.
For every point c compute the coordinate x
for this axis using the generalized Pythagoras
theorem (also known as the law of cosine):
x = (D2 (a, c) + D2 (a, b)-D2 (b, c))/ (2*D (a, b))
Florin Radulescu, Note de curs

19 DMDW-7
x = (D2 (a, c) + D2 (a, b)-D2 (b, c))/ (2 D (a, b))

D(a, c) D(b, c)

x D(a, b)

a b

Florin Radulescu, Note de curs

20 DMDW-7
D’ 2 = D 2 – (x – y)2

Use for further axis not the original distances D


between points but the remainder D’ after
subtracting the distance with respect with the
coordinates already computed.
d
D D’
c

a b
x y

Florin Radulescu, Note de curs

21 DMDW-7
Weaknesses
This process stops after computing the
desired number of coordinates for every point
or no more axes can be found.
Weakness:
For real data the problem is that if the
distance matrix is not a Euclidian one, the
value for D’2 may be negative!
In that case the only way to continue is to
assume D’ is 0.
But this assumption leads to propagated
errors.
Florin Radulescu, Note de curs

22 DMDW-7
Example

The algorithm was run with this assumption


on a 2000 nodes matrix.
 The optimal number of axes was 18 (enough
big value)
The average of the differences between real
distances and computed distances
(computed from the resulting coordinates)
was high: 0.78 after 18 steps, as shown in
next figure.
Florin Radulescu, Note de curs

23 DMDW-7
Example
 Average of | Dreal – Dcomputed | over all pairs of nodes
4,00

3,50

3,00

2,50
AVERAGE

2,00

1,50

1,00

0,50

0,00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
STEP

Florin Radulescu, Note de curs

24 DMDW-7
Road Map

K-Medoids, k-modes and k-means++


FastMap: multidimensional scaling
Cluster evaluation
Fuzzy clustering: fuzzy C-means
Clusters and holes
Summary

Florin Radulescu, Note de curs

25 DMDW-7
Cluster evaluation
After performing the clustering process, the
result must be evaluated in order to validate it
(or not).
Because real clusters are not known for a
test dataset, this is a hard problem.
Some methods were developed for this
purpose.
These methods are designed not for
evaluating the clustering results on a
particular dataset but for evaluating the
quality of the clustering algorithm.
Florin Radulescu, Note de curs

26 DMDW-7
Methods

Most used methods:


1. User inspection
2. Ground truth
3. Cohesion and separation
4. Silhouette
5. Indirect evaluation

Florin Radulescu, Note de curs

27 DMDW-7
1. User inspection
 In that case some experts are inspecting the results of the clustering
algorithm and rate it.
 User inspection may include:
 Evaluate cluster centroids
 Evaluate distribution of points in clusters
 Evaluate clusters by their representation (sometimes clusters
may be represented as a decision tree for example).
 Test some points to see if they really belong to the assigned
cluster. This can be made when clustering documents: after
clustering, some documents in each cluster are analyzed to see
if they are in the same category.
 This method is hard to use for numerical data and huge volumes of
information because the user inspection is based on the experience
and intuition of the experts.
 Also, this method is subjective and may lead sometimes to a wrong
verdict.
Florin Radulescu, Note de curs

28 DMDW-7
2. Ground truth
(comparatie cu situatia reala)

In this case the input of the clustering


algorithm is a labeled dataset.
In this way we know in advance the cluster
for each point and after running the clustering
algorithm we can compare the real clusters
with the results obtained.
Evaluation may be made using some
measures, as entropy and purity (known from
previous chapter – supervised learning)
Florin Radulescu, Note de curs

29 DMDW-7
Entropy

Remember that if we have a dataset D = {e1, e2,


…, em} with examples labeled with classes from
C = {c1, c2, …, cn}, the entropy of D can be
computed as:

After clustering, D is split in r disjoint subsets D1,


D2, …, Dr. the combined entropy of these
subsets is:

Florin Radulescu, Note de curs

30 DMDW-7
Purity

 For each cluster, the purity of the cluster is


the probability of the most present class:

We can compute a purity of the clustering


process by combining the purities of resulting
clusters:

Florin Radulescu, Note de curs

31 DMDW-7
Ground truth
 These measures are usually used when
comparing two clustering algorithms on the
same labeled dataset.
 Other measures that can be used are
precision, recall and F-score. The
expressions for these measures were also
presented in the previous chapter.
 The real problem is that an algorithm may
perform well on a dataset and not so well on
other dataset.
Florin Radulescu, Note de curs

32 DMDW-7
3. Cohesion and separation
 Other measures that can be used to evaluate the
clustering algorithm are based on internal
information:
1. Intra-cluster cohesion measures the
compactness of the clusters.
 Using the sum of squares of the distances (SSD)
from each point to its cluster center we obtain a
measure of this cohesion.

 A small value is better than a bigger one.


Florin Radulescu, Note de curs

33 DMDW-7
Cohesion and separation

2. Inter-cluster separation (or isolation)


measures how far are the clusters one from
another.
 The distance between clusters may be
computed in the known ways (single link,
complete link, etc)

Florin Radulescu, Note de curs

34 DMDW-7
4. Silhouette
 For each point D in a cluster, a 'silhouette' value
can be computed, and this value is in the same
time:
 a measure of the similarity of D with the points of
its cluster
 a measure of the dissimilarity of D with the points
of other clusters.
 Values are between -1 and 1. Positive values
denote that D is similar with the points in his
cluster and negative ones that D is not well
assigned (better to be assigned to other cluster).
Florin Radulescu, Note de curs

35 DMDW-7
Silhouette
Silhouette was introduced by Peter J.
Rousseeuw in a 1987article: "Silhouettes: a
Graphical Aid to the Interpretation and
Validation of Cluster Analysis", published in
Computational and Applied Mathematics.
Computing s(D) - the silhouette value of D -
implies:
Compute a(D) = the average distance from D to
all other points in his cluster.
Compute b(D) = the lowest average distance
from D to all points in other cluster.
Florin Radulescu, Note de curs

36 DMDW-7
Silhouette

Then:
s(D) = (b(D) - a(D)) / max(a(D), b(D))
Or:
1-a(D)/b(D) if a(D) < b(D)
s(D) = 0 if a(D) = b(D)
b(D)/a(D) -1 if a(D) > b(D)
So: -1 <= s(D) <= 1
Florin Radulescu, Note de curs

37 DMDW-7
Silhouette
The average value of s for the points of a
cluster is a measure of the cohesion of the
points in the cluster.
Also, the average value of s for all the points
of the dataset is a measure of the
performance of the clustering process.
For k-means, if k is too big or too small, some
of the clusters have narrower silhouettes than
the rest. Examining clusters' silhouettes we
can determine the best value for k.
Florin Radulescu, Note de curs

38 DMDW-7
5. Indirect evaluation

In many cases clustering is made in order to


perform another task.
Example: customers are grouped based on
their buying habits for email marketing.
If the primary goal (email marketing in our
example) has no good results, it means that
maybe the clustering was not so good.
In this way a clustering algorithm can be
rated based on another task results.
Florin Radulescu, Note de curs

39 DMDW-7
Road Map

K-Medoids, k-modes and k-means++


FastMap: multidimensional scaling
Cluster evaluation
Fuzzy clustering: fuzzy C-means
Clusters and holes
Summary

Florin Radulescu, Note de curs

40 DMDW-7
Fuzzy clustering
 Fuzzy logic was first proposed by Lotfi A. Zadeh of the
University of California at Berkeley in a 1965 paper.

Prof. Lotfi A. Zadeh and Prof. Mircea Petrescu in Bucharest


Florin Radulescu, Note de curs

41 DMDW-7
Fuzzy clustering

In the case of soft clustering, any point


belongs to more than one cluster and for
each (point, cluster) pair there is a value of
the membership level of that point to that
cluster.
One of the most known algorithms in this
class is fuzzy C-means.

Florin Radulescu, Note de curs

42 DMDW-7
The model
Input:
A dataset containing n elements (points), D =
{e1, e2, …, en}.
The number of clusters C
A level of cluster fuzziness, m
Output:
A list of centroids {c1, c2, …, cC}
A matrix U = [uij], i = 1…n, j = 1…C, and uij =
the level/degree of membership of element ei
to the cluster cj.
Florin Radulescu, Note de curs

43 DMDW-7
The model
 The process is trying to minimize the objective
function:

where:
 uij and ci are as described above.
 dij is the distance from the element ei to the centroid cj
 m is the fuzziness factor and in many cases the
default value is 2.
If m is close or equal to 1, uij is close to 0 or 1 so a
non-fuzzy solution is obtained (as in k-means).
When m is increased from 2 to bigger values, uij
have lower values and the clusters are fuzzier.
Florin Radulescu, Note de curs

44 DMDW-7
The algorithm
1. Choose randomly initial cluster centers
2. REPEAT
3. Compute all dij values
4. Compute new values for the membership levels uij:

5. Compute new cluster centers cj:

6. UNTIL (stopping criteria is met)


Florin Radulescu, Note de curs

45 DMDW-7
Stoping criteria

The stopping criteria may include:


1. The number of iterations reached a given
value.
2. The cluster centers movement after
iteration i is below a certain threshold t:

Florin Radulescu, Note de curs

46 DMDW-7
Road Map

K-Medoids, k-modes and k-means++


FastMap: multidimensional scaling
Cluster evaluation
Fuzzy clustering: fuzzy C-means
Clusters and holes
Summary

Florin Radulescu, Note de curs

47 DMDW-7
Clusters and holes

The space between the clusters is empty.


This portion of the space, containing no or
very few points may be called hole.
Discovering holes is sometimes very useful
because a position in a hole says that a
certain combination of attributes values is not
possible (or the possibility of having that
combination is very low).

Florin Radulescu, Note de curs

48 DMDW-7
Clusters and holes

Example: in a 2D space with dimensions


blood pressure and cholesterol level a hole
defines a range of pairs (blood-pressure,
cholesterol-level) with a zero or low
probability of occurrence.
There are several techniques for discovering
regions that may be considered as holes.
This paragraph presents two of them.

Florin Radulescu, Note de curs

49 DMDW-7
Decision tree clusters

 One of the possible representations of a


set of clusters is a decision tree.
 Example:
6

0 1 2 3 4 5 6 7 8 9 10 11 12

Florin Radulescu, Note de curs

50 DMDW-7
Decision tree clusters

 The decision tree is:


x

>=6.5 <6.5

BLACK
y

<=3 >3

BLUE RED

Florin Radulescu, Note de curs

51 DMDW-7
Decision tree clusters

Because in holes there are no points, for


using supervised learning to discover the
holes the trick is the following:
Consider all points having the same class
(called existing points, E)
Add uniformly another type of points
(called non-existing points, N)
Next figure is an illustration of this method
Florin Radulescu, Note de curs

52 DMDW-7
E and N points

 Green: N points, Black: E points


6

0 1 2 3 4 5 6 7 8 9 10 11 12

Florin Radulescu, Note de curs

53 DMDW-7
Processing
 A supervised learning algorithm can be used for
building a decision tree for separating the two
types of points: existing and non-existing.
 The decision tree is built using the best cut for
each axis, and this best cut is based on the
information gain.
 Because for computing the information gain only
the probability for each type of points in a given
region is needed, the non-existing points need not
to be physically added but because their uniform
spread the probability is proportional with the area
of that region.
Florin Radulescu, Note de curs

54 DMDW-7
Processing
 For the existing points, the probability for each sub-
region is computed by counting, as usual.
 The algorithm assumes that all regions are
rectangular and the number of N points in each
region is at least equal with the number of E points.
 After each split of a rectangle, if the inherited
number of N points is less than the number of E
points, their number is increased to the number of E
points.
 The result is a decision tree splitting the space in
rectangles, some of them being clusters and the
others holes.
Florin Radulescu, Note de curs

55 DMDW-7
Result

0 1 2 3 4 5 6 7 8 9 10 11 12

Florin Radulescu, Note de curs

56 DMDW-7
Maximal hyper rectangles

This is the second approach.


The goal is to find the maximal hyper
rectangles containing no or few data
points.
Note that the clusters in the previous
example are contained in three rectangles:
Cluster 1 (red): x >= 2, x <=5 , y>=3.5, y<=5
Cluster 2 (black) : x>=8, x<=11, y>=2, y<=4.5
Cluster 3 (blue): x>=4, x<=5, y>=0.5, y<=2.5
Florin Radulescu, Note de curs

57 DMDW-7
FR and MHR
Such a rectangle is called a filled region
(FR).
A maximal hyper rectangle is defined as
follows:
Definition: Given a k-dimensional continuous
space S and n FRs in S, a maximal hyper-
rectangle (MHR) in S is an empty HR that
does not intersect (in a normal sense) with
any FR, and has at least one FR lying on
each of its 2k bounding surfaces. These FRs
are called the bounding FRs of the MHR.
Florin Radulescu, Note de curs

58 DMDW-7
Algorithm
1. Let S be a k-dimensional continuous space and a
set of n FRs (not always disjoint) in S,
2. Start with one MHR, occupying the entire space
S.
3. Each FR is incrementally added to S. For each
insertion, the set of MHRs is updated:
 All the existing MHRs that intersect with this FR must
be removed from the set.
 For each dimension two new hyper-rectangle bounds
(lower and upper) are identified. If the new hyper-
rectangles verify the MHR definition and are
sufficiently large, insert them into the MHRs list.
Florin Radulescu, Note de curs

59 DMDW-7
Example

 Addition of a second FR (H2 after H1):

H1

H2

Florin Radulescu, Note de curs

60 DMDW-7
Summary
This course presented:
 K-Medoids, k-modes and k-means++ where k-medoids and k-
modes are clustering algorithms and k-means++ is a method for
determining a better than random set of initial cluster centers for
k-means.
 FastMap: a multidimensional scaling algorithm to build a
Euclidean space given the distances between any two points
 Cluster evaluation techniques. A method not included in this
course but still important is silhouette (see [Rousseeuw 87])
 Clusters and holes: how to determine regions with no or few data
points
 Fuzzy clustering and fuzzy C-means for performing soft
clustering.
 Next week: Semi-supervised learning

Florin Radulescu, Note de curs

61 DMDW-7
References
[Liu et al. 98] Bing Liu, Ke Wang, Lai-Fun Mun and Xin-Zhi Qi, "Using
Decision Tree Induction for Discovering Holes in Data," Pacific Rim
International Conference on Artificial Intelligence (PRICAI-98), 1998
[Liu et al. 00] Bing Liu, Yiyuan Xia, Phlip S. Yu. "Clustering through decision
tree construction." Proceedings of 2000 ACM CIKM International
Conference on Information and Knowledge Management (ACM CIKM-
2000), Washington, DC, USA, November 6-11, 2000
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks, Contents,
and Usage Data, Second Edition, Springer, chapter 4.
[Huang 97] Huang, Z: A fast clustering algorithm to cluster very large categorical
data sets in data mining. In: SIGMOD Workshop on Research Issues on Data
Mining and Knowledge Discovery, pp. 1-8, 1997
[Huang 98] Huang, Z: Extensions to the k-Means Algorithm for Clustering
Large Data Sets with Categorical Values, DMKD 2, 1998,
http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf
[Torgerson 52] Torgerson, W.S. (1952). Multidimensional Scaling: Theory
and Method, Psychometrika, vol 17, pp. 401-419.
[Faloustsos, Lin 95] Faloutsos, C., Lin K.I. (1995). FastMap: A Fast
Algorithm for Indexing, Data-Mining and Visualization of Traditional and
Multimedia Datasets. In: Proceedings of the 1995 ACM SIGMOD
International Conference on Management of Data.
Florin Radulescu, Note de curs

62 DMDW-7
References
[Wand et al, 99] Wang, J.T-L., Wang, X., Lin, K-I., Shasha, D., Shapiro,
B.A., Zhang, K. (1999). Evaluating a class of distance-mapping
algorithms for data mining and clustering, In: Proc of ACM KDD, pp.
307-311.
[de Silva, Tenenbaum 04]de Silva, V., Tenenbaum J.B. (2004). Sparse
multi-dimensional scaling using landmark points,
[Yang et al 06] Yang, T., Liu, J., McMillan, L., Wang, W., (2006). A Fast
Approximation to Multidimensional Scaling, In: Proceedings of the
ECCV Workshop on Computation Intensive Methods for Computer
Vision (CIMCV).
[Platt 05] Platt, J.C., (2005). FastMap, MetricMap, and Landmark MDS
are all Nyström Algorithms, In: 10th International Workshop on Artificial
Intelligence and Statistics, pp. 261-268.
[Bezdek 81] Bezdek, James C. (1981). Pattern Recognition with Fuzzy
Objective Function Algorithms. Kluwer Academic Publishers Norwell,
MA, USA, ISBN 0-306-40671-3
[Rousseeuw 87] Peter J. Rousseeuw (1987). "Silhouettes: a Graphical
Aid to the Interpretation and Validation of Cluster
Analysis". Computational and Applied Mathematics 20: 53–65

Florin Radulescu, Note de curs

63 DMDW-7
Partially Supervised Learning
Road Map

What is partially supervised learning


Learning from labeled and unlabeled data
Learning with positive and unlabeled data
Summary

Florin Radulescu, Note de curs

2 DMDW-8
Partially supervised learning
 In supervised learning the goal is to build a
classifier starting from a set of labeled examples.
 Unsupervised learning starts with a set of
unlabeled examples trying to discover the inner
structure of this set, like in clustering.
 Partially supervised learning (or semi-supervised
learning) learning includes a series of algorithms
and techniques using a (small) set of labeled
examples and a (possible large) set of unlabeled
examples for performing classification or
regression.
Florin Radulescu, Note de curs

3 DMDW-8
Partially supervised learning
 The need for such algorithms and techniques comes
from the cost of obtaining labeled examples.
 This is made in many cases manually by experts and
the volume of these examples is sometimes small.
 When learning is made starting from a finite number of
training examples in a high-dimensional space and for
each dimension the number of possible values is
large, the amount of needed training data required to
ensure that there are several samples with each
combination of values is huge.

Florin Radulescu, Note de curs

4 DMDW-8
Hughes effect
 For a given number of training samples the
predictive power decreases as the dimensionality
increases.
 This phenomenon is called Hughes effect or
Hughes phenomenon, after Gordon F. Hughes
 He published in 1968 the paper "On the mean
accuracy of statistical pattern recognizers".
 Adding extra information to a small number of
labeled training examples will increase the
accuracy (by delaying the occurrence of the effect
described).
Florin Radulescu, Note de curs

5 DMDW-8
Effect of unlabeled examples

 In a 2D space containing only two labeled


examples: a positive example and a
negative example.

Florin Radulescu, Note de curs

6 DMDW-8
Effect of unlabeled examples

 Based on these labeled examples a


classifier may be built, represented by the
dotted line.
The points at the left of the line will be
classified as positive and the others as
negative.
This classifier is not very accurate
because the number of the labeled
examples is too small.
Florin Radulescu, Note de curs

7 DMDW-8
Effect of unlabeled examples

Suppose now that several unlabeled


examples are added, as in the next figure:

Florin Radulescu, Note de curs

8 DMDW-8
Effect of unlabeled examples

The unlabeled examples that are near or


linked in some way to the two labeled
examples may be considered having the
same label and the classifier changes.
Unlabeled examples form two clusters,
one containing the positive example and
the other the negative one.

Florin Radulescu, Note de curs

9 DMDW-8
Effect of unlabeled examples

Consequently, the border between the


positive area and negative area has an
irregular shape (the dotted line).
These two classifiers, in terms of
accuracy, are very different, the second
being much more accurate than the first.

Florin Radulescu, Note de curs

10 DMDW-8
Positive and unlabeled examples

Another illustration is for the case of a


training set containing only positive examples
(or examples belonging to the same class).
The next figure shows six points labeled as
positive placed in a 2D space.
Because there are no negative examples,
there is no way to determine a separation
between the positive area and negative area.

Florin Radulescu, Note de curs

11 DMDW-8
Positive and unlabeled examples

 All the lines figured may be at this point


separation lines.

Florin Radulescu, Note de curs

12 DMDW-8
Positive and unlabeled examples

 In the following figureseveral unlabeled


examples are added.

Florin Radulescu, Note de curs

13 DMDW-8
Positive and unlabeled examples

Some unlabeled examples are placed


near the positive examples and some
other unlabeled examples are placed
separately.
The natural assumption is that examples
in the first category are positive examples
and the other examples are negative.

Florin Radulescu, Note de curs

14 DMDW-8
Distribution of unlabeled examples

The effect of using unlabeled examples is


related to the fact that labeled and
unlabeled examples comes from the same
distribution or from different distributions.
There are three cases, derived from the
treatment of missing data (here missing
data = label is missing)

Florin Radulescu, Note de curs

15 DMDW-8
MCAR

MCAR = Missing Completely At Random.


Consider an example with the vector of
attribute values x and a class y. For MCAR
we have:
P (labeled=1|x, y) = P (labeled=1),
Labeled and unlabeled examples come
from the same distribution (or the fact that
an example is labeled or not is not related
to the attribute values or the class of the
example).
Florin Radulescu, Note de curs

16 DMDW-8
MAR

 MAR = Missing At Random.


 In this case:
P (labeled=1| x, y) = P (labeled=1| x),
 The probability for an example to be labeled is not
related to the class.
 We have also:
P(y=1| x, labeled=1) = P(y=1| x, labeled=0) = P(y=1| x),
 So, for a fixed x the probability to be labeled is the
same with the probability not to be labeled.
 But in this case the conditional distribution of x given
y is not the same in labeled and unlabeled data.
Florin Radulescu, Note de curs

17 DMDW-8
MNAR

MNAR = Missing Not At Random.


In this case:
P (labeled=1|x, y) <> P (labeled=0|x, y),
Labeled and unlabeled examples are not
from the same distribution.

Florin Radulescu, Note de curs

18 DMDW-8
Road Map

What is partially supervised learning


Learning from labeled and unlabeled data
Learning with positive and unlabeled data
Summary

Florin Radulescu, Note de curs

19 DMDW-8
Learning from labeled and unlabeled data

Some techniques are presented in the next


slides for using unlabeled data along with a
training set containing labeled examples
belonging to all classes, based on [Chawla,
Karakoulas, 2005] study.
The study evaluates four learning techniques:
Co-training
ASSEMBLE
Re-weighting
Expectation-Maximization
Florin Radulescu, Note de curs

20 DMDW-8
Learning from labeled and unlabeled data

Co-training and ASSEMBLE assume a


MCAR distribution and the other two
techniques a MAR one.
The study used Naïve Bayes as
underlying supervised learner and, for co-
training (that requires two classifiers), the
second classifier was C4.5.

Florin Radulescu, Note de curs

21 DMDW-8
Co-training (Blum and Mitchel)

 Co-training was proposed by Blum and Mitchell in the


paper “Combining labeled and unlabeled data with co-
training” presented in 1998 at the Workshop on
Computational Learning Theory:
 The attributes x describing examples can be split in two
disjoint subsets that are independent, or, in other words,
the instance space X can be written as a Cartesian
product:
X = X1 X2
where X1 and X2 correspond to two different views of an
example
 Alternate definition:
each example x is given like a pair: x = (x1, x2)
Florin Radulescu, Note de curs

22 DMDW-8
Co-training (Blum and Mitchel)

The main assumption is that X1 and X2 are


sufficient for learning a classifier each.
The example presented in the original
article is a set of web pages.
Each page is described by
x1 = {words on the web page} and also by
x2 = {words on the links pointing to the web
page}.
Florin Radulescu, Note de curs

23 DMDW-8
Co-training algorithm (v1)
1. Initially LA = LB = L, UA = UB = U
2. Build two classifiers, A from LA and X1 and B from
LB and X2
3. Allow A to label the set UA, obtaining L1
4. Allow B to label the set UB, obtaining L2
5. Based on confidence, select C1 from L1 and C2 from
L2 (subsets containing a number of most confident
examples for each class)
6. Add C1 to LB and subtract it from UB
7. Add C2 to LA and subtract it from UA
8. Go to step 2 until stopping criteria are met
Florin Radulescu, Note de curs

24 DMDW-8
Co-training (Blum and Mitchel)
 The process ends when there are no more
unlabeled examples or C1 and C2 are empty
 In that case there are some unlabeled examples
but the confidence of their classifications –
probability of the assigned class for example - is
below a given threshold.
 In the end, the final classifier is obtained by
combining A and B (the final two classifiers
obtained at step 2).
 The experiments described in the original article
are made using a slightly different form of the
algorithm, presented on the next slide.
Florin Radulescu, Note de curs

25 DMDW-8
Co-training algorithm (v0)
1. Given:
• A set L of labeled examples
• A set U of unlabeled examples
2. Create a pool U’ of examples by choosing u examples at
random from U.
3. Loop for k iterations:
3.1. Use L to train a classifier h1 that considers only the x1 portion of
x
3.2. Use L to train a classifier h2 that considers only the x2 portion of
x
3.3. Allow h1 to label p positive and n negative examples from U’
3.4. Allow h2 to label p positive and n negative examples from U’
3.5. Add these self-labeled examples to L
3.6. Randomly choose 2p + 2n examples from U to replenish U’
Florin Radulescu, Note de curs

26 DMDW-8
Co-training algorithm (v0)

The experiments were made using Naive


Bayes as h1 and h2, p = 1, n = 3, k = 30
and u = 75.
Begining with 12 labeled web pages and
using 1000 additional unlabeled web
pages, the results were:
average error for learning only from labeled
data 11.1%;
average error using co-training 5.0%
Florin Radulescu, Note de curs

27 DMDW-8
Co-training results

 Results described in [Blum, Mitchell, 98]

Page-based Link-based Combined


classifier classifier classifier

Supervised training 12.9 12.4 11.1

Co-training 6.2 11.6 5.0

Florin Radulescu, Note de curs

28 DMDW-8
Co-training (Goldman and Zhou)

In [Goldman, Zhou, 2000] there is another


approach in co-training:
o because not always is possible to split the feature
space X in two disjoint and independent
subspaces X1 and X2, use two different
algorithms for building the two classifiers h1 and
h2, on the same feature space.
o Each algorithm labels some unlabeled examples
and these will be included by the other algorithm
in its training data:
o if the two classifiers are denoted with A and B, A
will produce LB and B will produce LA.
Florin Radulescu, Note de curs

29 DMDW-8
Algorithm
1. Repeat until LA and LB do not change during iteration. For each algorithm
do
2. Train algorithm A on L LA to obtain the hypothesis HA (a hypothesis
defines a partition of the instance space). Similar for B
3. Each algorithm considers each of its equivalence classes and decides
which one to use to label data from U for the other algorithm, using two
tests. For A the tests are (similar for B):
o The class k used by A to label data for B has accuracy at least as good
as the accuracy of B.
o The conservative estimate of the class k is bigger than the conservative
estimate of B.
(The conservative estimate is an estimation for 1/ 2 where is the
hypothesis error. This prevents the degradation of B performances due
to the noise.)
4. All examples in U passing these tests are placed in LB (similar for B).
5. End Repeat
6. At the end, combine HA and HB
Florin Radulescu, Note de curs

30 DMDW-8
ASSEMBLE
 ASSEMBLE is an ensemble algorithm
presented in [Bennet at all, 2002]
It won the NIPS* 2001 Unlabeled data
competition.
It alternates between assigning “pseudo-
classes” to the instances from the unlabeled
data set and constructing the next base classifier
using the labeled examples but also the
unlabeled examples
For these examples the previous assigned
pseudo-class is considered.
*NIPS = Neural Information Processing Systems Conference

Florin Radulescu, Note de curs

31 DMDW-8
ASSEMBLE - advantages
 Any weight-sensitive classification algorithm can be
boosted using labeled and unlabeled data.
 ASSEMBLE can exploit unlabeled data to reduced
the number of classifiers needed in the ensemble
therefore speeding up learning.
 ASSEMBLE works well in practice.
 Computational results show the approach is
effective on a number of test problems, producing
more accurate ensembles than AdaBoost using the
same number of base learners.

Florin Radulescu, Note de curs

32 DMDW-8
Re-weighting
 Re-weighting is a technique for reject-inferencing in
credit scoring presented in [Crook, Banasik, 2002].
 The main idea is to extrapolate information on the
examples from approved credit applications to the
unlabeled data.
 The re-weighting may be used if data is of the MAR
type, so the provided population model for all
applicants is the same as that for accepts only:
P(y=1| x, labeled=1) = P(y=1| x, labeled=0) = P(y=1| x),
 So: for a given x the distribution of the examples
having a certain label is the same in the labeled and
unlabeled set.
Florin Radulescu, Note de curs

33 DMDW-8
Re-weighting
 All credit institutions have an archive of approved
applications and for each of these applications
there is also a Good/Bad performance label.
 Based on the classification variables used to
accept/reject an application, applications (past-
labeled but also those unlabeled) can be scored
and partitioned in score groups.
 For every score group the distribution of classes in
the labeled examples is then extrapolate to the
unlabeled examples of the same score group,
picking at random examples from here.
Florin Radulescu, Note de curs

34 DMDW-8
Re-weighting example

Suppose we have a set of previously accepted


applications (Labeled column in Table 1) and a
set of unlabeled applications. Each application
have a score and can be included in a score
group (there are 5 score groups below)
Score group Unlabeled Labeled Class0 Class1 Group Re-w. Re-w.
/ Bad / Good weight Class0 Class1

0.0-0.2 10 10 6 4 2 12 8

0.2-0.4 10 20 10 10 1.5 15 15

0.4-0.6 20 60 20 40 1.33 27 53

0.6-0.8 20 100 10 90 1.2 12 108

0.8-1.0 20 200 10 190 1.1 11 209

Florin Radulescu, Note de curs

35 DMDW-8
Re-weighting example
 The group weight is computed as (XL + XU) / XL.
 For every score group the weight is used to compute
the number of examples of class0 and class1 from
the whole score group examples (labeled and
unlabeled).
 Example: for score group 0.8-1.0, the weight is 1.1
so re-weighting class0 and class1 we obtain 10*1.1
= 11 for class0 and 190*1.1 = 209 for class1.
 It means that we pick at random 11-10=1 example
from the unlabeled set and label it as class0 and
209-190 = 19 examples (the rest of them, 20-1) and
label them as class1.
Florin Radulescu, Note de curs

36 DMDW-8
Re-weighting example
 This procedure is run for every score group.
 At the end, all unlabeled examples have a
class0/class1 label.
 Note that class0/class1 is not the same as
rejected/accepted; the initial set of labeled
examples contains only accepted applications!
 Using the whole set of examples (L+U) now
having all a class0/class1 label, we can learn a
new classifier that incorporate not only the data
from labeled examples but also information from
unlabeled ones.
Florin Radulescu, Note de curs

37 DMDW-8
Expectation-Maximization
 Expectation-maximization is an iterative method
for finding maximum likelihood or maximum a
posteriori (MAP) estimates of parameters in
statistical models, where the model depends on
unobserved latent variables (see Wikipedia).
 It consists in an iterative process having two steps:
1. Expectation step: using current estimates of the
parameters compute, guess a probability
distribution over completition of missing data
2. Maximization step: compute new estimates of the
parameters using these completitions
Florin Radulescu, Note de curs

38 DMDW-8
Expectation-Maximization
In [Liu 11] and [Nigam et al, 98] the process is
described as follows:
Initial: Train a classifier using only the set of
labeled documents.
Loop:
Use this classifier to label (probabilistically) the
unlabeled documents (E step)
Use all the documents to train a new classifier (M
step)
Until convergence.
Florin Radulescu, Note de curs

39 DMDW-8
Expectation-Maximization
 For Naïve Bayes, the expectation step means computing for every
class cj and every unlabeled document di the probability Pr(cj | di; ).
 Notations are:
 ci – class ci
 D – the set of documents
 di – a document di in D
 V – words vocabulary (set of significant words)
 wdi, k – the word in position k in document di
 Nti - the number of times that word wt occurs in document di
 - the set of parameters of all components, = { 1, 2, …,
K, 1, 2, …, K}. j is the mixture weight (or mixture
probability) of the mixture component j; j is the parameters of
component j. K is the number of mixture components.

Florin Radulescu, Note de curs

40 DMDW-8
Expectation step

Expectation step: compute class labels


(probabilities):
Pr (cj | di ; ) = =

Florin Radulescu, Note de curs

41 DMDW-8
Maximization step

Maximization step: re-compute the


parameters:

Pr (wt | cj ; ) =

Pr(cj | ) =

Florin Radulescu, Note de curs

42 DMDW-8
Expectation-Maximization
 EM algorithm works well if the two mixture model
assumptions for a particular data set are true:
o The data (or the text documents) are generated
by a mixture model,
o There is one-to-one correspondence between
mixture components and document classes.
 In many real-life situations these two assumptions
are not met.
 For example, the class Sports may contain
documents about different sub-classes such as
Football, Tennis, and Handball.
Florin Radulescu, Note de curs

43 DMDW-8
Road Map

What is partially supervised learning


Learning from labeled and unlabeled data
Learning with positive and unlabeled data
Summary

Florin Radulescu, Note de curs

44 DMDW-8
Positive and unlabeled data
 Sometimes all labeled examples are only from the
positive class. Examples (see [Liu 11]):
 Given a collection of papers on semi-supervised learning,
find all semi-supervised learning papers in proceeding or
another collection of documents.
 Given the browser bookmarks of a person, find other
documents that may be interesting for that person.
 Given the list of customers of a direct marketing company,
identify other persons (from a person database) that may
be also interested in those products.
 Given the approved and good (as performance)
applications from a credit company, identify other persons
that may be interested in getting a credit.

Florin Radulescu, Note de curs

45 DMDW-8
Theoretical foundation
 Suppose we have a classification function f and an input
vector X labeled with class Y, where Y {1, -1}. We rewrite
the probability of error:
Pr[f(X) Y] = Pr[f(X) = 1 and Y = -1] + Pr[f(X) = -1 and Y = 1] (1)
 Because:
Pr[f(X) = 1 and Y = -1] = Pr[f(X) = 1] – Pr[f(X) = 1 and Y = 1] =
Pr[f(X) = 1] – (Pr[Y = 1] – Pr[f(X) = -1 and Y = 1]).
 Replacing in (1) we obtain:
Pr[f(X) Y] = Pr[f(X) = 1] – Pr[Y = 1] +
2Pr[f(X) = -1|Y = 1]Pr[Y = 1] (2)
 Pr[Y = 1] is constant.
 If Pr[f(X) = -1|Y = 1] is small minimizing error is approximately
the same as minimizing Pr[f(X) = 1].
Florin Radulescu, Note de curs

46 DMDW-8
Theoretical foundation
 If the sets of positive examples P and unlabeled
examples U are large, holding Pr[f(X) = -1|Y = 1]
small while minimizing Pr[f(X) = 1] is
approximately the same as:
o minimizing PrU[f(X) = 1]
o while holding PrP[f(X) = 1] ≥ r (where r is recall
Pr[f(X)=1| Y=1]) which is the same as (PrP[f(X) = -1] ≤
1 – r)
 In other words:
o The algorithm tries to minimize the number of
unlabeled examples labeled as positive
o Subject to the constraint that the fraction of errors
on the positive examples is no more than 1-r.
Florin Radulescu, Note de curs

47 DMDW-8
2-step strategy
 For implementing the theory above there is a 2-step strategy
(presented in [Liu 11]):
 Step 1: Identify in the unlabeled examples a subset called
“reliable negatives” (RN).
 These examples will be used as negative labeled examples in
the next step.
 We start with only positive examples but must build a negative
labeled set in order to use a supervised learning algorithm for
building the model (classifier)
 Step 2: Build a sequence of classifiers by iteratively applying
a classification algorithm and then selecting a good classifier.
 In this step we can use Expectation Maximization or SVM for
example.

Florin Radulescu, Note de curs

48 DMDW-8
Obtaining reliable negatives (RN)

 Building the reliable negative set is really


the key in this case.
There are several methods ([Zhang, Zuo
2009] ):
Spy technique
1-DNF algorithm
Naïve Bayes
Rocchio
(see https://www.comp.nus.edu.sg/~leews/publications/ICDM-03.pdf)
Florin Radulescu, Note de curs

49 DMDW-8
Spy technique
 In this technique, first randomly select a set S of
positive documents from P and puts them in U.
 These examples are the spies.
 They behave identically to the unknown positive
documents in P.
 Then using I-EM algorithm with (P-S) as positive
and U S as negative, a classifier is obtained
 It uses the probabilities assigned to the documents
in S to decide a probability threshold th to identify
possible negative documents in U:
 all documents with a probability less than any spy
will be assigned to RN
Florin Radulescu, Note de curs

50 DMDW-8
Spy algorithm
1. RN = {};
2. S = Sample(P, s%);
3. US = U S;
4. PS= P-S;
5. Assign each document in PS the class label 1;
6. Assign each document in US the class label -1;
7. I-EM(US, PS); // This produces a Naïve Bayes classifier.
8. Classify each document in Us using the NB classifier;
9. Determine a probability threshold th using S;
10. For Each document d Us
11. If its probability Pr(1|d) < th
12. Then RN = RN {d};
13. End If
14. End For
Florin Radulescu, Note de curs

51 DMDW-8
1-DNF algorithm
The algorithm builds a so-called positive
feature set (PF) containing words that occur
in the positive examples set of documents P
more frequently than in the unlabeled
examples set U.
Then using PF it tries to identify (for filtering
out) possible positive documents from U.
A document in U that does not have any
positive feature in PF is regarded as a strong
negative document.
Florin Radulescu, Note de curs

52 DMDW-8
Algorithm
1. PF = {}
2. For i = 1 to n
3. If (freq(wi, P)/|P|> freq(wi, U)/|U|)
4. Then PF = PF {wi}
5. End if
6. End for
7. RN = U;
8. For each document d U
9. If ( wi, freq (wi , d ) > 0) and (wi PF)
10. Then RN = RN - {d}
11. End if
12. End for

Florin Radulescu, Note de curs

53 DMDW-8
Naïve Bayes
 In this case, a classifier is built considering all
unlabeled examples as negative. Then the
classifier is used to classify U and the negative
labeled examples form the reliable negative set.
 The algorithm is :
1. Assign label 1 to each document in P;
2. Assign label –1 to each document in U;
3. Build a NB classifier using P and U;
4. Use the classifier to classify U. Those documents in
U that are classified as negative form the reliable
negative set RN.
Florin Radulescu, Note de curs

54 DMDW-8
Rocchio
The algorithm of building RN is the same as
for Naïve Bayes with the difference that at
step 3 a Rocchio classifier is built instead of a
Naïve Bayes one.
Rocchio builds a prototype vector for each
class (a vector describing all documents in
the class) and then using the cosine similarity
finds the class for test examples: the class of
the prototype most similar with the given
example.
Florin Radulescu, Note de curs

55 DMDW-8
Summary

This course presented:


What is partially supervised learning, with
illustrations of the impact of unlabeled data
Learning from labeled and unlabeled data,
where were presented co-training,
ASSEMBLE, re-weighting and EM
Learning with positive and unlabeled
examples
Next week: web usage mining
Florin Radulescu, Note de curs

56 DMDW-8
References
[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks, Contents, and Usage
Data, Second Edition, Springer.
[Chawla, Karakoulas 2005] Nitesh V. Chawla, Grigoris Karakoulas, Learning From
Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains,
Journal of Artificial Intelligence Research, volume 23, 2005, pages 331-366.
[Nigam et al, 98] Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom Mitchell,
Using EM to Classify Text from Labeled and Unlabeled Documents, Technical Report
CMU-CS-98-120. Carnegie Mellon University. 1998
[Blum, Mitchell, 98] Blum, A., Mitcell, T. Combining labeled and unlabeled data with co-
training, Procs. Of Workshop on Computational Learning Theory, 1998.
[Goldman, Zhou, 2000] Sally Goldman, Yan Zhou, Enhancing Supervised Learning with
Unlabeled Data, Proceedings of the Seventeenth International Conference on Machine
Learning (ICML), 2000, pages 327 – 334
[Bennet at al, 2002] Bennet, K., Demiriz, A., Maclin, R., Exploiting unlabeled data in
ensemble methods, Procs. Of the 6th Intl. Conf. on Knowledge Discovery and
Databases, 2002, pages 289-296.
[Crook, Banasik, 2002] Sample selection bias in credit scoring models, Intl. Conf.on
Credit Risk Modeling and Decisioning, 2002.
[Zhang, Zuo 2009] Bangzuo Zhang, Wanli Zuo, Reliable Negative Extracting Based on
kNN for Learning from Positive and Unlabeled Examples, journal of Computers, vol. 4,
no. 1, 2009 Florin Radulescu, Note de curs

57 [Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org DMDW-8


Web usage mining

Prof.dr.ing. Florin Radulescu


Universitatea Politehnica din Bucureşti
Road Map

Objectives and approaches in weblog


mining
Web log formats
Statistic approaches
Data mining approaches
Summary

Florin Radulescu, Note de curs

2 DMDW-9
Objectives
Weblog mining methods, techniques and
algorithms are intended to discover patterns
in clickstreams recorded by the web servers
and also profiles of the users interacting with
them.
The input data are:
1. Web server logs, and particularly the access
logs. A web server maintains other logs also (for
example error logs) that are not discussed in
this lesson
Florin Radulescu, Note de curs

3 DMDW-9
Objectives
The input data are – cont.:
2. Site structure. The link structure of the site is
used to perform path completition. This means
that pages seen in browser window but not
requested from the web server due to caching
(proxy or local) are determined using this
structure
3. Site content. The content for each page can be
used to attach different event labels (product
view, buy/bid, etc) to different pages for better
understanding of surfer behavior.
Florin Radulescu, Note de curs

4 DMDW-9
Objectives

The input data are – cont.:


4. Information on visitors – not always
available. If the users of a website are
authenticated and their account contains
other profile information (age, gender,
annual revenue, etc), a data mining
application use this information for
knowledge extraction.
5. Application data, specific to the particular
website.
Florin Radulescu, Note de curs

5 DMDW-9
Tasks in web mining
 There are four types of tasks in web mining (see
[Kosala, Blockeel, 2000]):
1. Resource finding: the task of retrieving intended
Web documents.
2. Information selection and pre-processing:
automatically selecting and pre-processing specific
information from retrieved Web resources.
3. Generalization: automatically discovers general
patterns at individual Web sites as well as across
multiple sites.
4. Analysis: validation and/or interpretation of the
mined patterns.
Florin Radulescu, Note de curs

6 DMDW-9
Categories of tasks

 Three categories of tasks in web mining:


 Web content mining
 Web structure mining
 Web usage mining

Florin Radulescu, Note de curs

7 DMDW-9
Web content mining
Web content mining is dedicated to the
extraction and integration of data, information
and knowledge from Web page contents, no
matter the structure of the website.
The hyperlinks contained in each page or the
hyperlinks pointing to them are not relevant in
that case, only the information content.
In [Cooley et al, 97] web content mining is
also split in two approaches:
the agent-based approach and
the database approach.
Florin Radulescu, Note de curs

8 DMDW-9
Agent based approach
 The objective is to build intelligent tools for information
retrieval:
 Intelligent Search Agents. In this case, intelligent Web
agents are developed. These agents search for relevant
information using domain characteristics and user profiles,
then organize and interpret the discovered information.
 Information Filtering/Categorization. In this case, the
agents use information retrieval techniques and
characteristics of open hypertext Web documents to
automatically retrieve, filter, and categorize them.
 Personalized Web Agents. In the third case, the agents
learn about user preferences and discover Web
information based on them (also preferences of similar
users may used).

Florin Radulescu, Note de curs

9 DMDW-9
Database approach
The objectives involve improvements of the
management for semi-structured data
available on the Web.
Multilevel Databases. At the lowest level of the
database are semi-structured information stored
in Web repositories (hypertext documents), and
at the higher levels meta data or generalizations
are extracted and organized using relational or
object-oriented model
Web Query Systems. In this case, specialized
query languages are used for querying the Web.
Examples are W3QL, WebLog, Lorel, UnQL, etc.
Florin Radulescu, Note de curs

10 DMDW-9
Web structure mining
 Web structure mining uses graph theory to
analyze the node and connection structure of a
web site (see also [Wikipedia]). The new research
area emerged in the domain is called Link Mining.
 The following summarization of link mining is from
[da Costa, Gong 2005]:
1. Link-based Classification. In this case the task is
to focus on the prediction of the category of a web
page, based on words that occur on the page, links
between pages, anchor text, html tags and other
possible attributes found on the web page.

Florin Radulescu, Note de curs

11 DMDW-9
Web structure mining
 Summarization of link mining – cont.:
2. Link-based Cluster Analysis. Cluster analysis finds
naturally occurring sub-classes. In that case the data
is clustered with similar objects in the same cluster,
and dissimilar objects in different clusters. Link-
based cluster analysis is unsupervised so it can be
used to discover hidden patterns in data.
3. Link Type. The goal is to predict the existence of
links, the type of link, or the purpose of a link.
4. Link Strength. In this approach links are weighted
(importance, etc).
5. Link Cardinality. The goal is to compute a
prediction for the number of links between objects.
Florin Radulescu, Note de curs

12 DMDW-9
Applications

The most known practical applications in this


area are Page Rank (used by Google) and Hubs
and Authorities.
In the first case, the importance of a page is
computed based on the importance of its
ancestors (an ancestor is a page containing a
link to that page).
In the second case, each page has a measure
for being a hub (or an index) and another
measure of being an authority.
Florin Radulescu, Note de curs

13 DMDW-9
Applications
Authorities are pages containing information
about a topic, and hubs are pages not
containing actual information, but links to
pages containing topic information.
The measure of being hub or authority are
computed recursively: the authority measure
is the sum of hub measures for the hubs
pointing at it and the hub measure is the sum
of the authority measures for the pages
referred by that page.
Florin Radulescu, Note de curs

14 DMDW-9
Web usage mining
 Web usage mining tries to predict user behavior when interacting
with the Web. This is the main topic to be discussed in detail in this
lesson.
 Data involved in web usage mining may be classified in four
categories:
1. Usage data. Here we have server, client and proxy logs. There
are several problems encountered here in identifying users and
sessions based on their IP address (see [Srivastava et al., 2000]):
o Single IP address / Multiple Server Sessions: because
several users access the web server via an ISP provider
and the provider allow the access using some proxies,
many users have the same IP address in the web server
access log in the same period.

Florin Radulescu, Note de curs

15 DMDW-9
Web usage mining
 Problems encountered - cont.:
o Multiple IP address / Single Server Session: also
because of the ISP policy, accesses of the same user
session can be assigned to different proxies, so having
different IP addresses in the web server access log.
o Multiple IP address / Single User: the same user
accessing the web from different computers will be
recorded with different IP addresses for different
sessions.
o Multiple agent / Single User: The same user may use
several browsers, even on the same computer, so will
be recorded in the log files with different user agents.

Florin Radulescu, Note de curs

16 DMDW-9
Web usage mining
2. Content data. The website contains
documents in HTML or other format or
dynamic pages generated from scripts and
related databases.
 The content of a page can be used for
associating events or other semantic that
can be used in the process of web usage
mining.
 Webpages contains also meta data as
descriptive keywords, document attributes,
semantic tags, etc.
Florin Radulescu, Note de curs

17 DMDW-9
Web usage mining

3. Structure data. This data capture the link


structure of the website.
 Links are between pages but also intra
page links (from a position in a document
to another position in the same or other
document), even for dynamically
generated pages.
 This structure is used for example in path
completition (see related paragraph).
Florin Radulescu, Note de curs

18 DMDW-9
Web usage mining

4. User data. In some cases additional


information on users is available:
personal information about the user
(gender, age, revenue), domain of
interests, past activity (bids, purchases),
past visits history, and so on.
 This information can also be used for the
web usage mining process.
Florin Radulescu, Note de curs

19 DMDW-9
Road Map

Objectives and approaches in weblog


mining
Web log formats
Statistic approaches
Data mining approaches
Summary

Florin Radulescu, Note de curs

20 DMDW-9
Web log formats

There are several log file formats described in


the domain literature. The most known is the the
Common Logfile Format, described in [W3.org
1] as follows:
remotehost rfc931 authuser [date] "request"
status bytes

Example:
127.0.0.1 - frank [10/Oct/2015:13:55:36 -0700] "GET
/apache_pb.gif HTTP/1.0" 200 2326

Florin Radulescu, Note de curs

21 DMDW-9
What is everything
 Elements from the previous definition:
Field Description

Remote host address The IP address of the client that made the request.

Remote log name Usualy not used. It was provided for the case of a client machine
running ident protocol server (identd) - see RFC 1413.

User name The name of the authenticated user that accessed the server.
Anonymous users are indicated by a hyphen. The best practice is for
the application always to provide the user name.

Date, time, and The local date and time at which the activity occurred. The offset from
Greenwich mean time Greenwich mean time is also indicated.
(GMT) offset

Request and Protocol The HTTP protocol version that the client used.
version

Service status code The HTTP status code. (A value of 200 indicates that the request
completed successfully.)

Bytes sent The number of bytes sent by the server.


(see https://msdn.microsoft.com/en-us/library/windows/desktop/aa814379(v=vs.85).aspx)

Florin Radulescu, Note de curs

22 DMDW-9
What is everything
 For the prefious example:

127.0.0.1 - frank [10/Oct/2015:13:55:36 -0700] "GET


/apache_pb.gif HTTP/1.0" 200 2326

 The remote host is 127.0.0.1, remote hostname is


unavailable (a hyphen indicates such a case),
authuser is frank, the date is October 10, 2015,
with the time indicated, the request is a GET for a
gif file placed in Document Root, the status code is
200 – success and the document length (transfer
length) is 2326.
Florin Radulescu, Note de curs

23 DMDW-9
Combined Log File Format

 The Combined Log File Format adds


some other information, most important
being:
referrer This gives the site that the client (user agent) reports
having been referred from.

user-agent This is the identifying information that the client


browser reports about itself

Florin Radulescu, Note de curs

24 DMDW-9
Example
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET
/apache_pb.gif HTTP/1.0" 200 2326
"http://www.example.com/start.html" "Mozilla/4.08 [en]
(Win98; I ;Nav)"
 The first seven fields are the same.
 The last two fields indicated the referrer as start.html
from www.example.com and the user agent as
Netscape.
 Mozilla was originally the codename for the defunct
Netscape Navigator software project, along with
Netscape's mascot, a cartoon reptile inspired by
Godzilla, - see [Wikipedia]. Now: Firefox browser.
 There is also an Extended Log File Format.
Florin Radulescu, Note de curs

25 DMDW-9
Road Map

Objectives and approaches in weblog


mining
Web log formats
Statistic approaches
Data mining approaches
Summary

Florin Radulescu, Note de curs

26 DMDW-9
Statistic approaches
For obtaining statistics about a website there
are two possibilities:
1. Local statistics. There are several
packages that analyze the log file of the
webserver and present detailed statistics
about the accesses recorded in them. Some
examples are: Analog, W3Perl, AWStats,
Webalizer, etc.
2. External statistics. In this case behavioral
information cannot be obtained, only
statistics about visitors.
Florin Radulescu, Note de curs

27 DMDW-9
Examples

 External statistics from www.trafic.ro for


portal.edu.ro website (on July 7 2012,
13:00):

Florin Radulescu, Note de curs

28 DMDW-9
Examples – cont.

Florin Radulescu, Note de curs

29 DMDW-9
Examples – cont.

Florin Radulescu, Note de curs

30 DMDW-9
Road Map

Objectives and approaches in weblog


mining
Web log formats
Statistic approaches
Data mining approaches
Summary

Florin Radulescu, Note de curs

31 DMDW-9
Data mining approaches

 [Srivastava et al., 2000] describes the


web usage mining process as having the
following structure:
Data preprocessing
Pattern discovery
Pattern analysis

Florin Radulescu, Note de curs

32 DMDW-9
The web usage mining process

 Structure:
Site files
(site content)

Preprocessing Pattern Pattern


discovery analysis

Preprocessed Rules, Interesting


Log
clickstream Patterns, Rules, Patterns,
files
data Statistics Statistics

Florin Radulescu, Note de curs

33 DMDW-9
Data preprocessing

This step includes:


Data cleaning
Pageview identification
User identification
Session identification (sessionization) and
episode identification
Path completition
Data integration, including event identification
Florin Radulescu, Note de curs

34 DMDW-9
Data cleaning

Data cleaning tasks includes:


removal of unnecessary fields (the server
access logs contains fields that can be
removed in some cases being irrelevant for
the intended analysis),
removal of log entries coming from robots
(spider navigation) and
removal of erroneous entries (status not
success).
Florin Radulescu, Note de curs

35 DMDW-9
Pageview identification

 Pageview identification: A pageview is a


collection of Web objects or resources
representing a specific “user event,” e.g.,
clicking on a link, viewing a product page,
adding a product to the shopping cart.
In usual terms, a pageview is the
collection of objects that can be
materialized in what the browser window
shows at a given moment.
Florin Radulescu, Note de curs

36 DMDW-9
Pageview identification

For a usual site, with static pages and no


frames, each HTML file with all embedded
objects (music, images, etc) is a pageview.
Each pageview corresponds to several
entries in the web log. These entries must
be identified and treated as a single
access to the web server.

Florin Radulescu, Note de curs

37 DMDW-9
Example
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/ HTTP/1.1" 200 765 "-" "Mozilla/5.0
(Linux; Android 7.0; SM-G930F Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/62.0.3202.84 Mobile Safari/537.36"
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/mit.css HTTP/1.1" 200 855
"http://info.cs.pub.ro/scoaladevara/" "Mozilla/5.0 (Linux; Android 7.0; SM-G930F Build/NRD90M)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36"
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/stanga.html HTTP/1.1" 200 810
"http://info.cs.pub.ro/scoaladevara/" "Mozilla/5.0 (Linux; Android 7.0; SM-G930F Build/NRD90M)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36"
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/sus.html HTTP/1.1" 200 597
"http://info.cs.pub.ro/scoaladevara/" "Mozilla/5.0 (Linux; Android 7.0; SM-G930F Build/NRD90M)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36"
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/intre.html HTTP/1.1" 200 428
"http://info.cs.pub.ro/scoaladevara/" "Mozilla/5.0 (Linux; Android 7.0; SM-G930F Build/NRD90M)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36"
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/program.html HTTP/1.1" 200 1357
"http://info.cs.pub.ro/scoaladevara/stanga.html" "Mozilla/5.0 (Linux; Android 7.0; SM-G930F
Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36"
• 78.96.80.227 - - [20/Nov/2017:00:25:14 +0200] "GET /scoaladevara/logo.png HTTP/1.1" 200 16696
"http://info.cs.pub.ro/scoaladevara/sus.html" "Mozilla/5.0 (Linux; Android 7.0; SM-G930F Build/NRD90M)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36"
• etc

Florin Radulescu, Note de curs

38 DMDW-9
User identification

User identification. In a web server


environment users can be identified by:
a) Authentication: some sites may be
accessed only by authenticated users and
so the user is known from the beginning to
the end of its activity on that site.

Florin Radulescu, Note de curs

39 DMDW-9
User identification
b) Cookies: In the absence of authentication
facilities, client side cookies may be used.
o A cookie is a unique piece of information
(like a passport) issued by the web server
and sent to the browser and subsequently
used by the browser to access pages on that
web server.
o In that way each cookie identifies a user
session but the cookie can live beyond the
session and be recognized in subsequent
sessions of the same user.
Florin Radulescu, Note de curs

40 DMDW-9
User identification

c) If cookie mechanism is not available, the


users can be identified by their IP address
and user-agent.
o Some problems in this case were listed at
slides 15-16.

Florin Radulescu, Note de curs

41 DMDW-9
Example

 An example is presented in the next


figure. Based on the IP and user agent,
three users can be distinguished:
1. User1 with Chrome/Win7
2. User2 with FireFox/Win7
3. User3 with IE9/WinXP SP1

Florin Radulescu, Note de curs

42 DMDW-9
Web server log

Tim e Client IP Req. URL Ref. URL User Agent


12:55 1.2.3.4 A - Chrome20;Win7
12:59 1.2.3.4 B A Chrome20;Win7
13:04 1.2.3.4 D B Chrome20;Win7
13:10 2.3.4.5 C - IE9;WinXP;SP1
13:13 1.2.3.4 E B Chrome20;Win7
13:14 1.2.3.4 B - FireFox9;Win7
13:15 2.3.4.5 F C IE9;WinXP;SP1
13:16 1.2.3.4 D B FireFox9;Win7
13:17 1.2.3.4 C A Chrome20;Win7
13:18 1.2.3.4 A - Chrome20;Win7
13:19 1.2.3.4 E B FireFox9;Win7
13:20 2.3.4.5 A C IE9;WinXP;SP1
13:21 1.2.3.4 C A Chrome20;Win7
13:22 1.2.3.4 A B FireFox9;Win7
13:23 2.3.4.5 B A IE9;WinXP;SP1
13:24 1.2.3.4 G C Chrome20;Win7
13:25 1.2.3.4 C A FireFox9;Win7
13:26 1.2.3.4 B A Chrome20;Win7
13:28 1.2.3.4 G C FireFox9;Win7
13:31 1.2.3.4 E B Chrome20;Win7

Florin Radulescu, Note de curs

43 DMDW-9
User 1

User1:
Tim e Client IP Req. URL Ref. URL User Agent
12:55 1.2.3.4 A - Chrome20;Win7
12:59 1.2.3.4 B A Chrome20;Win7
13:04 1.2.3.4 D B Chrome20;Win7
13:13 1.2.3.4 E B Chrome20;Win7
13:17 1.2.3.4 C A Chrome20;Win7
13:18 1.2.3.4 A - Chrome20;Win7
13:21 1.2.3.4 C A Chrome20;Win7
13:24 1.2.3.4 G C Chrome20;Win7
13:26 1.2.3.4 B A Chrome20;Win7
13:31 1.2.3.4 E B Chrome20;Win7

Florin Radulescu, Note de curs

44 DMDW-9
User2 and User3

 User2 Tim e
13:14
Client IP
1.2.3.4
Req. URL
B
Ref. URL
-
User Agent
FireFox9;Win7
13:16 1.2.3.4 D B FireFox9;Win7
13:19 1.2.3.4 E B FireFox9;Win7
13:22 1.2.3.4 A B FireFox9;Win7
13:25 1.2.3.4 C A FireFox9;Win7
13:28 1.2.3.4 G C FireFox9;Win7

 User3
Tim e Client IP Req. URL Ref. URL User Agent
13:10 2.3.4.5 C - IE9;WinXP;SP1
13:15 2.3.4.5 F C IE9;WinXP;SP1
13:20 2.3.4.5 A C IE9;WinXP;SP1
13:23 2.3.4.5 B A IE9;WinXP;SP1

Florin Radulescu, Note de curs

45 DMDW-9
Sessionization
 Session identification (sessionization): The
web activity of a user is segmented in sessions.
 As a general idea, a user session begins when he
opens the browser window and ends when that
window is closed.
 A user session contains visits on several websites,
on each website being recorded as a session in
the web server log.
 From the point of view of a single web server, only
pageviews from that server are known and
represents the user session.
Florin Radulescu, Note de curs

46 DMDW-9
Sessionization
 There are several methods to identify user
sessions:
a) Authentication and cookies, discussed earlier.
b) Embedded session IDs: at the beginning of a new
session the server generates a unique session ID.
Web pages are dynamically generated and the ID
is contained in every link, so subsequent hits are
recognized.
c) Software agents: programs loaded into the
browsers that send back usage data.
d) Heuristics: when the above methods are not
available, several heuristics may be used to split the
activity of a user into sessions.
Florin Radulescu, Note de curs

47 DMDW-9
Euristics
 Some known heuristics for sessionization are:
1. Duration of a session is limited at a given amount
of time (for example 20 minutes)
2. Session ends when the time of stay on a webpage
is above a given amount of time (for example, if
between two successive hits there is more than 20
minutes, a new session begins there)
3. Pageviews in a session are linked. If a pageview
is not accessible from an open session, it starts a
new session. Note that the same user may have
several open sessions in the same time (several
different browser windows pointing on the same
web server).
Florin Radulescu, Note de curs

48 DMDW-9
Example

Tim e Client IP Req. URL Ref. URL User Agent


12:55 1.2.3.4 A - Chrome20;Win7
12:59 1.2.3.4 B A Chrome20;Win7 Session 1
13:04 1.2.3.4 D B Chrome20;Win7
13:13 1.2.3.4 E B Chrome20;Win7
13:17 1.2.3.4 C A Chrome20;Win7

13:18 1.2.3.4 A - Chrome20;Win7


13:21 1.2.3.4 C A Chrome20;Win7 Session 2
13:24 1.2.3.4 G C Chrome20;Win7
13:26 1.2.3.4 B A Chrome20;Win7
13:31 1.2.3.4 E B Chrome20;Win7

Florin Radulescu, Note de curs

49 DMDW-9
Episode identification

 Episode identification: An episode is a


sequence of pageviews in a session that
are related (semantically or functionally).

Florin Radulescu, Note de curs

50 DMDW-9
Path completition
Path completition: Because of the cache
that browsers and proxies implement, some
pageviews are not requested from the web
servers but are directly served by the proxy
or the browser cache is used to display it.
In that case the web server log do not contain
entries for that pageview.
The obvious example for that situation is
pressing the “Back” button of the browser. In
that case the cached version of the previous
page is displayed in most of the cases.
Florin Radulescu, Note de curs

51 DMDW-9
Example
 For the web site structure:
A

B C

D E F G

• For User 3 the real user navigation path was:


C F C A B
• The web server log records only pages C F A
B, the return from F to C being omitted
Florin Radulescu, Note de curs

52 DMDW-9
Data integration and event identification

 Data integration and event identification: now


users are identified, sessions are also identified,
but for mining the data some other information
must be integrated from various sources.
 For example in an e-commerce application other
data may be:
User data, if available: age, gender, revenue,
previous products bought, domains of interest,
products visualized in the past, and so on.
Product information: category, price, fabric (for textile
products), fat and sugar (for food products), etc.
Florin Radulescu, Note de curs

53 DMDW-9
Events
 At this moment some pageviews or some successions
of pageviews can be associated with specific events.
 Identifying events adds more semantic to the user
sessions, semantic that may be used in further
analysis process.
 Examples:
o Product view: a pageview where a product is displayed
o Product click-through: when the user clicks on a product to
display more data about it
o Shopping cart change: when a user add or remove a
product in the shopping cart
o Buy: when the shopping cart is validated and the customer
finalize the buying transaction
Florin Radulescu, Note de curs

54 DMDW-9
Pattern discovery

Summary list of types of algorithms and


methods ([Srivastava et al., 2000]):
Statistical Analysis
Association rules
Clustering
Classification
Sequential patterns discovery
Dependency modeling
Florin Radulescu, Note de curs

55 DMDW-9
Statistical Analysis

Statistical Analysis. This is the most simple


and common way to extract information and
knowledge about the visitors of a website.
Statistical methods includes computing
measures like frequency, mean, median,
generating reports with statistical information
and so on.
Some examples are already presented in this
lesson (trafic.ro examples).
Florin Radulescu, Note de curs

56 DMDW-9
Association rules

Association rules. Obtaining association


rules allow relating pages accessed
together in the same session.
This is important for marketing purposes,
for future site restructuring and for
generating recommendations.

Florin Radulescu, Note de curs

57 DMDW-9
Clustering
 Clustering. Clustering algorithms can be used for
discovering usage clusters and page clusters.
 In the first case users with similar surfing behavior
are discovered (each cluster contains similar
users).
 This may be used for market segmentation and
personalization.
 In the second case, clusters contain similar web
pages or related based on their content.
 These clusters can be used by the search engines
for better results and also for recommendation
purposes.
Florin Radulescu, Note de curs

58 DMDW-9
Classification

 Classification. Classification algorithms


can be used for segmenting users in
several classes or categories.
Types of classification algorithms suitable
for that are decision trees classifiers,
Naïve Bayes classifiers, KNN or SVM.

Florin Radulescu, Note de curs

59 DMDW-9
Sequential patterns discovery

 Sequential patterns discovery. The goal


is to find frequent sequential patterns in
sessions, so that the presence of some
pageviews in a particular order may be
followed by another pageview.

Florin Radulescu, Note de curs

60 DMDW-9
The GSP algorithm

An example of sequence mining algorithm


is GSP.
GSP stands for Generalized Sequential
Pattern algorithm and is inspired from
Apriori algorithm.
The sketch of the algorithm is the following
(see [Wikipedia]):

Florin Radulescu, Note de curs

61 DMDW-9
The GSP algorithm

F1 = the set of frequent 1-sequence;


k=2;
do while F(k-1)!= Null
Generate candidate sets Ck (set of candidate k-sequences);
For all input sequences s in the database D
Increment count of all a in Ck if s supports a
End For
Fk = {a Є Ck such that its frequency exceeds the threshold}
k= k+1;
End do
Result = Set of all frequent sequences is the union of all Fks

Florin Radulescu, Note de curs

62 DMDW-9
GSP vs. Apriori

 The main difference between GSP and


Apriori is at candidate generation step.
If for example there are two frequent 2-
sequences, A B and A C, the Apriori
style generation produce A B C.
In GSP three different candidate
sequences are generated: A B C,
A C B and A BC.
Florin Radulescu, Note de curs

63 DMDW-9
Dependency modeling

 Dependency modeling. The goal of this


method is to obtain a model for a particular
object (for example a model for
customers) including all significant
dependencies among variable measures
involved.
Cited methods in this area are Hidden
Markov Models and Bayesian Belief
Models.
Florin Radulescu, Note de curs

64 DMDW-9
Summary

This course presented:


Objectives and approaches in weblog mining
Web log formats
Statistic approaches
Data mining approaches
Next week: Data warehousing - introduction

Florin Radulescu, Note de curs

65 DMDW-9
References
[Liu 11] Bing Liu, Web Data Mining, Exploring Hyperlinks,
Contents, and Usage Data, Second Edition, Springer, 2011,
chapter 12
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org
[W3.org 1] Logging Control In W3C httpd, page visited June 1,
2012: http://www.w3.org/Daemon/User/Config/Logging.html
[W3.org 2] Extended Log File Format, page visited June 1,
2012: http://www.w3.org/TR/WD-logfile.html
[Apache.org 1] Apache HTTP Server Version 2.4, Log files,
page visited June 1, 2012:
http://httpd.apache.org/docs/2.4/logs.html

Florin Radulescu, Note de curs

66 DMDW-9
References
[Kosala, Blockeel, 2000] Raymond Kosala, Hendrik Blockeel, Web
Mining Research: A Survey, ACM SIGKDD Explorations Newsletter,
June 2000, Volume 2 Issue 1.
[Cooley et al, 97] Cooley, R.; Mobasher, B.; Srivastava, J.; Web
mining: information and pattern discovery on the World Wide Web.
Tools with Artificial Intelligence, 1997, Ninth IEEE International
Conference.
[da Costa, Gong 2005] Miguel Gomes da Costa Júnior Zhiguo
Gong, Web Structure Mining: An Introduction, Proceedings of the
2005 IEEE International Conference on Information Acquisition June
27 - July 3, 2005, Hong Kong and Macau, China
[Srivastava et al., 2000] J. Srivastava, R. Cooley, M.Deshpande,
P.Tan, Web usage mining: discovery and applications of web usage
patterns from web data, SIGKDD Explorations, Volume 1(2), 2000,
available at http://www.sigkdd.org/explorations/
Florin Radulescu, Note de curs

67 DMDW-9
Data warehousing - introduction

Prof.dr.ing. Florin Radulescu


Universitatea Politehnica din Bucureşti
Road Map

What is a data warehouse


Operational data stores
Data Warehouse Architecture
Summary

Florin Radulescu, Note de curs

2 DMDW-10
Foreword
 The goal of this lesson is to present a
comprehensive introduction to Data warehousing,
with definitions of the main terms used.
 The lesson is a summary of the scientific literature
of the domain, based mainly on the books
published by two authors:
W.H. Inmon, the originator of the term Data
Warehousing
R. Kimball, who developed the dimensional
methodology (known also as Kimball methodology)
which has become a standard in the area of decision
support.
Florin Radulescu, Note de curs

3 DMDW-10
Definitions
Wikipedia:
Data warehouse is a repository of an
organization's electronically stored data.
Data warehouses are designed to facilitate
reporting and analysis.
A data warehouse houses a standardized,
consistent, clean and integrated form of data
sourced from various operational systems in
use in the organization, structured in a way to
specifically address the reporting and analytic
requirements.
Florin Radulescu, Note de curs

4 DMDW-10
Definitions
R. Kimball (see [Kimball, Ross, 2002]):
A data warehouse is a copy of transactional
data specifically structured for querying and
analysis.
According to this definition:
The form of the stored data (RDBMS, flat file) is
not linked with the definition of a data warehouse.
Data warehousing is not linked exclusively with
"decision makers" or used in the process of
decision making.
Florin Radulescu, Note de curs

5 DMDW-10
Definitions
W.H. Inmon (see [Inmon 2002]):
A data warehouse is a:
subject-oriented,
integrated,
nonvolatile,
time-variant
collection of data in support of management’s
decisions.
The data warehouse contains granular
corporate data.
Florin Radulescu, Note de curs

6 DMDW-10
Defintion explained

 The definition provided by W.H. Inmon is


the accepted definition of a data
warehouse: a subject-oriented, integrated,
non-volatile, time-variant collection of data
for supporting management decisions in a
company.
The significance of each component of
this definition is the following:
Florin Radulescu, Note de curs

7 DMDW-10
Subject-oriented

Operational data systems of a company


are organized considering the main
activities, so they are activity-oriented and
not subject oriented.
A classical example in the literature is an
insurance company where the main
activities are auto insurances, health
insurances, life insurances and casualty
insurances.
Florin Radulescu, Note de curs

8 DMDW-10
Subject-oriented
 For each activity there is possibly another software
system managing data on the main subject
areas: policies, customers, claims and premiums
in the area, so there are possible four separate
databases, one for each activity, with similar but
not identical structures.
 When uploading data in the company data
warehouse, the data must first be restructured on
these major subject areas, integrating data on
customers, policies, claims and premiums from
each activity (as in the next slide).
Florin Radulescu, Note de curs

9 DMDW-10
Subject-oriented

Florin Radulescu, Note de curs

10 DMDW-10
Subject-oriented

Other examples of major subject areas:


in a production company: product, order,
vendor, bill of material, and raw goods.
a retail company: product, stock keeping unit
(SKU - cod de bare), sale, vendor, etc.

Florin Radulescu, Note de curs

11 DMDW-10
Integrated
 When preparing data for uploading in the data
warehouse, one of the most important activities is
the integration. Data is loaded from operational
sources and must be converted, summarized, re-
keyed, etc., before loading it in the data
warehouse.
 The next slide illustrates some of the most known
actions performed for data integration:
Combine multiple encodings in a single one. For
example, the gender may be encoded as (0, 1), (m, f),
(male, female) in separate operational systems. If (m,
f) is chosen as the data warehouse encoding, all data
encoded using other convention must be converted.
Florin Radulescu, Note de curs

12 DMDW-10
Integrated

Florin Radulescu, Note de curs

13 DMDW-10
Integrated
 Actions performed for data integration – cont.:
Chose a unique measure unit for each piece of
information. For example, if length is measured in cm,
inches, yards and meters in different operational
systems, one unit must be chosen for the data
warehouse and all other values must be converted.
If the same object has in some data sources different
values for the same attribute (e.g. description, name,
features, etc), these must be combined in a single
one.
If the same object has different keys in the source
systems it must be re-keyed to have a single key in
the data warehouse.
Florin Radulescu, Note de curs

14 DMDW-10
Non-volatile
In usual operational systems data is updated
or deleted to reflect the actual values. In a
data warehouse data is never updated and
deleted: after data is loaded, it stays there for
future reporting, like a snapshot reflecting the
situation in a certain moment.
The next load operations, instead of changing
the old snapshots, are added as new
snapshots and so the data warehouse is a
sequence of such snapshots that coexist.
Florin Radulescu, Note de curs

15 DMDW-10
Non-volatile

Florin Radulescu, Note de curs

16 DMDW-10
Non-volatile

In this way the data warehouse contains not


the operational data at a given moment but
all the history of operational data.
Because of this lack of change, once loaded,
data in a data warehouse may be considered
as read-only.

Florin Radulescu, Note de curs

17 DMDW-10
Time variant
As described above, a data warehouse
contains a sequence of snapshots, each
snapshot being actual at a given moment of
time.
Because a DW contains the whole history of
a company, it is possible to retrieve
information in a time horizon of 5-10 years or
even more.
Each unit of information is stamped or linked
with the moment during which that
information was accurate.
Florin Radulescu, Note de curs

18 DMDW-10
Time variant

Florin Radulescu, Note de curs

19 DMDW-10
Time variant
In an operational system only the current
data is kept. For example, if a customer
changes address, in the operational system
old address is replaced (update) with the new
one.
In the data warehouse all successive
addresses of a customer are kept.
Because date and time are very important in
analyzing data and reporting, the key
structure contains usually the date and
sometimes the time.
Florin Radulescu, Note de curs

20 DMDW-10
Why building a DW?
In [Kimball, Ross, 2002] there is a list of reasons for a
company to build its own data warehouse:
 “We have mountains of data in this company, but we
can’t access it.”
 “We need to slice and dice the data every which way.”
 “You’ve got to make it easy for business people to get
at the data directly.”
 “Just show me what is important.”
 “It drives me crazy to have two people present the
same business metrics at a meeting, but with different
numbers.”
 “We want people to use information to support more
fact-based decision making.”
Florin Radulescu, Note de curs

21 DMDW-10
Requirements for a DW

Also [Kimball & Ross 2002] lists the


demands that must be met by a data
warehouse in order to be productive and
to return the investment.
(Building a DW in a company is not always
a cheap operation)

Florin Radulescu, Note de curs

22 DMDW-10
Information must be easy accessible

 DW content must be understandable.


 DW content must be intuitive or obvious to the non-
database specialists, because they are the key users
of the system.
 Names must be meaningful (for data categories,
features, attributes and so on, so the structure of the
DW must be understandable for a non-specialist user).
 The DW must provide options for combining data in
the DW, the process being known and referred to as
slicing and dicing.
 The methods and tools for accessing data in the data
warehouse must be simple, easy to use, and the
answer must be returned in a short time.
Florin Radulescu, Note de curs

23 DMDW-10
Information must be consistent
 The process of fueling a data warehousing with
data contains a step of preprocessing, where data
is assembled from many sources, cleansed,
quality assured. Data is released (published) to
the users only when it is fit for usage.
 As described earlier, an integration step is
performed when data is load from operational
sources, unifying encodings, units of measure,
keys, names and common values/features, etc.
 Common definitions for the contents of the data
warehouse must be available for DW users.
Florin Radulescu, Note de curs

24 DMDW-10
Flexibility

 A data warehouse must be designed to be


flexible considering the inevitable changes
in computer science and engineering. Its
content must be structured in such a way
that changes in the software and hardware
platform must be possible.
Adding new data, reports, queries must be
possible and must not interfere with
existing ones.
Florin Radulescu, Note de curs

25 DMDW-10
Security

Because of its confidential content, the


data warehouse must have the means for
rejecting unauthorized access.
Potential leaks of content may be harmful
for the company if competitors have
access to the data in the DW.

Florin Radulescu, Note de curs

26 DMDW-10
Decision support
The primary goal of implementing a data
warehouse in an organization is the decision
support
The ultimate output from a DW is the set of
decisions based on its content, analyzed and
presented in different ways to the decision
makers.
The original label for a data warehouse and
the tools around it was ‘decision support
system’.
Florin Radulescu, Note de curs

27 DMDW-10
Acceptance
 The ultimate test for the success in implementing
a data warehouse is the acceptance test.
 If the business community does not continue to
use it in the first six months after training, then the
system has failed the acceptance test, no mater
how bright is the technical solution.
 It is possible to ignore using it because decisions
may be adopted also without a decision support
system.
 Key point in user acceptance is simplicity and user
friendliness.
Florin Radulescu, Note de curs

28 DMDW-10
Road Map

What is a data warehouse


Operational data stores
Data Warehouse Architecture
Summary

Florin Radulescu, Note de curs

29 DMDW-10
ODS
The concept of Operational Data Store (ODS)
was also introduced by W.H. Inmon and its
definition, found in [Inmon 98] is the following:
An ODS is an integrated, subject-oriented,
volatile (including update), current-valued
structure designed to serve operational users
as they do high performance integrated
processing.
We can compare an ODS with a database
integrating data from multiple sources. Its
goal is to help analysis and reporting.
Florin Radulescu, Note de curs

30 DMDW-10
ODS vs. DW

Source: [Inmon 98]


Florin Radulescu, Note de curs

31 DMDW-10
ODS features
According to Inmon, the main features of an ODS
are:
 enablement of integrated, collective on-line
processing.
 delivers consistent high transaction performance--
two to three seconds.
 supports on-line update.
 is integrated across many applications.
 provides a foundation for collective, up-to- the-
second views of the enterprise.
 the ODS supports decision support processing.
Florin Radulescu, Note de curs

32 DMDW-10
Similarities DW - ODS
Subject-oriented data:
 Before data is loaded in the ODS, it must first be
restructured on major subject areas (as in the
case of insurance company: integrating data on
customers, policies, claims and premiums from
each activity).
Integrated content:
 Data is sourced from multiple operational systems
(sources), and the integration step includes, like in
DW case, cleaning, unifying encodings, re-keying,
removing redundancies, preserving integrity, etc.
Florin Radulescu, Note de curs

33 DMDW-10
Dissimilarities DW - ODS

Its content is volatile (or updateable):


In an ODS data is updated, is like a
transaction processing system. Limited or
no history is maintained.
Its content is not time-variant (or
current):
An ODS is designed to contain limited
history, containing ‘real time’ or ‘near real
time’ data.
Florin Radulescu, Note de curs

34 DMDW-10
Road Map

What is a data warehouse


Operational data stores
Data Warehouse Architecture
Summary

Florin Radulescu, Note de curs

35 DMDW-10
DW architecture

Florin Radulescu, Note de curs

36 DMDW-10
DW architecture
The basic elements of a Data Warehouse environment
are:
 Operational Source Systems. These are the source of
the data in the DW, and are placed outside of the data
warehouse
 Data Staging Area. Here data is prepared
(transformed) for loading in the presentation area. This
area is not accessible to the regular user.
 Data Presentation. This part is what regular users see
and consider to be a DW.
 Data Access Tools. These tools are used for analyzing
and reporting. They provide the interface between the
user and the DW.
Florin Radulescu, Note de curs

37 DMDW-10
Data staging area
 The data staging area (DSA) of a data
warehouse is compared in [Kimball, Ross, 2002]
with the kitchen of a restaurant. It is:
 A storage area and
 A set of processes performing the so-called
Extract-Transform-Load (ETL) operation:
Extract – Extracting data from Operational Source
Systems
Transform – Integrating data from all sources, as
described below
Load – Publishing data for users, meaning loading
data in the Data presentation area
Florin Radulescu, Note de curs

38 DMDW-10
Integration tasks
 Dealing with synonyms: same data with different
name in different operational systems
 Dealing with homonymous: same name for
different data
 Unifying keys from different sources
 Unifying encodings
 Unifying unit measures and levels of detail
 Dealing with different software platforms
 Dealing with missing data
 Dealing with different value ranges, etc.
Florin Radulescu, Note de curs

39 DMDW-10
Data staging area

DSA contains everything between the


operational source systems and the data
presentation area.
As we said earlier, this area is not
accessible to the regular users of the data
warehouse.

Florin Radulescu, Note de curs

40 DMDW-10
Main approaches
 Storing data in a DW (so also in DSA) may be
done following two main approaches:
1. The normalized approach (supported by the work
of W.H. Inmon – see [Inmon 2002]
2. The dimensional approach (supported by the work
of Ralph Kimball – see [Kimball, Ross, 2002])
 These approaches are not mutually exclusive, and
there are other approaches.
 Dimensional approaches can involve normalizing
data to a degree.
 This lesson is based on the dimensional approach
Florin Radulescu, Note de curs

41 DMDW-10
Normalized approach
 In the normalized approach, data are
stored following database normalization
rules.
Tables are grouped by subject areas (data on
customers, policies, claims and premiums for
example).
The main advantage of this approach is that
loading data is straightforward because the
philosophy of structuring data is the same for
operational source systems and the data
warehouse.
Florin Radulescu, Note de curs

42 DMDW-10
Normalized approach
 The main disadvantage of this approach is the
number of joins needed to obtain meaningful
information.
 A regular user needs also to have a good
knowledge about the data in the DW and also a
training period in obtaining de-normalized tables
from normalized ones.
 Missing a join condition when performing a query
may lead to Cartesian products instead of joins. In
other words, regular user may need assistance
from a database specialist to perform usual
operations.
Florin Radulescu, Note de curs

43 DMDW-10
Dimensional approach

In a dimensional approach, data are


partitioned in two main categories:
Facts [fapte, masuri] (numeric transaction
data). In a retail example, the fact table
contains quantity sold, total price, total cost,
total gross profit.
Dimensions [dimensiuni] (standardized
contexts for facts). In a retail example,
dimensions may be: product, date, time,
location, customer, salesperson, etc.
Florin Radulescu, Note de curs

44 DMDW-10
Dimensional approach
• Advantages of the dimensional approach
are:
– Data is easy to understand, easy to use, no need
for assistance from a database specialist, speed
in solving queries.
– Data being de-normalized (or partially de-
normalized) the number of joins needed for
performing a query is lower than in the
normalized approach.
– Joins between the fact table and its dimensions is
easy to perform because the fact table contains
surrogate keys for all involved dimension tables.
Florin Radulescu, Note de curs

45 DMDW-10
Dimensional approach

Disadvantages of dimensional approach:


The ETL process is harder to be performed
because of the different philosophy in
structuring data in the operational systems
and the data warehouse: transform and load
steps are more complicated than in the
normalized approach.
A second disadvantage is that is more difficult
to modify the data warehouse scheme when
the company changes its way to do business.
Florin Radulescu, Note de curs

46 DMDW-10
Data presentation area
 At the end of the ETL process prepared data is
loaded in the Data Presentation Area (DPA).
 After that moment, data is available for users for
querying, reporting and other analytical
applications. 9
 Because regular users have access only to that
area, they may consider the presentation area as
being the data warehouse.
 This area is structured as a series of integrated
data marts, each presenting the data from a
single business process.
Florin Radulescu, Note de curs

47 DMDW-10
Data presentation area

In the DPA data is stored, presented, and


accessed in dimensional schemas.
We can imagine a hypercube with edges
labeled with the dimensions, e.g.
customer, product and time.

Florin Radulescu, Note de curs

48 DMDW-10
Hypercube

Source: [Rainardi 2008]


Florin Radulescu, Note de curs

49 DMDW-10
Data marts – Definition 1
[SQLServer 2005]:
 A data mart is defined as a repository of data
gathered from operational data and other sources
that is designed to serve a particular community of
knowledge workers.
 Data may derive from an enterprise-wide database
or data warehouse or be more specialized.
 The emphasis of a data mart is on meeting the
specific demands of a particular group of
knowledge users in terms of analysis, content,
presentation, and ease-of-use.
Florin Radulescu, Note de curs

50 DMDW-10
Data marts – Definition 2
[Wikipedia] defines a data mart as a
structure / access pattern specific to data
warehouse environments, used to retrieve
client-facing data.
The data mart is a subset of the data
warehouse and is usually oriented to a
specific business line or team.
Whereas data warehouses have an
enterprise-wide depth, the information in data
marts pertains to a single department.
Florin Radulescu, Note de curs

51 DMDW-10
Data marts – Definition 2
In some deployments, each department or
business unit is considered the owner of its
data mart including all
the hardware, software and data.This
enables each department to isolate the use,
manipulation and development of their data.
In other deployments where conformed
dimensions are used, this business unit
ownership will not hold true for shared
dimensions like customer, product, etc.
Florin Radulescu, Note de curs

52 DMDW-10
DW vs. Data marts

Data warehouse:
Holds multiple subject areas
Holds very detailed information
Works to integrate all data sources
Does not necessarily use a dimensional
model but feeds dimensional models.

Florin Radulescu, Note de curs

53 DMDW-10
DW vs. Data marts
Data mart:
Often holds only one subject area- for
example, Finance, or Sales
May hold more summarized data (although
may hold full detail)
Concentrates on integrating information from
a given subject area or set of source systems
Is built focused on a dimensional model using
a star schema.
Florin Radulescu, Note de curs

54 DMDW-10
Other data mart features

 Data contained is detailed, atomic data.


This is necessary for evaluating ad hoc
user queries, not covered by the pre-
defined queries or other options of the
tools used in accessing data.
Data marts contains also summary data,
obtain via aggregation, these data being
used for performance (speed)
enhancement.
Florin Radulescu, Note de curs

55 DMDW-10
Other data mart features
Data marts use common dimensions and
facts.
Kimball refers them as ‘conformed’.
This means for example that the same date
dimension is used in all data marts, and in all
star schemes of the DW, if the significance is
the same for all cases.
Because data marts use conformed
dimensions and facts, they can be combined
and used together
Florin Radulescu, Note de curs

56 DMDW-10
So

We can say that a data warehouse is


created from the union of its
organizational data marts (Kimball)

Florin Radulescu, Note de curs

57 DMDW-10
Examples

A large enterprise data warehouse will


consist of 20 or more very similar-looking
data marts, with similar dimensional
models.
Each data mart may contain several fact
tables, each with 5 to 15 dimension tables.
Many of these dimension tables will be
shared between several fact tables.
Florin Radulescu, Note de curs

58 DMDW-10
Data access tools

Almost all DW regular users (80% to 90%)


will access the data via some prebuilt
parameter-driven analytic applications.
Generally a user has four channels to
interact with a DW:
Ah-hoc query tools.
Report writers
Analytic applications.
Modeling tools.
Florin Radulescu, Note de curs

59 DMDW-10
Ah-hoc query tools
By this channel the user obtains raw data
verifying the conditions specified in the ad-
hoc query.
For using this channel the user must have a
good knowledge on the DW structure and on
query language used.
This channel is for specialists and
experienced users.
Sometimes there are some pre-built queries
that may be used.
Florin Radulescu, Note de curs

60 DMDW-10
Report writers

This channel is at the same level as the


first one.
Raw data is presented as a report.
Usually there are several pre-built reports
that user may run without knowledge on
DW structure and query language.
Building new reports may need extra
abilities.
Florin Radulescu, Note de curs

61 DMDW-10
Analytic applications

In this category there are:


interactive reports,
Dashboards [tablou de bord],
scorecards, and
other reporting tools allowing users to access
and analyze on data in the DW.

Florin Radulescu, Note de curs

62 DMDW-10
Exemple: Interactive report
x

Florin Radulescu, Note de curs

63 DMDW-10
Exemple: Dashboard
x

Florin Radulescu, Note de curs

64 DMDW-10
Exemple: Scorecard
x

Florin Radulescu, Note de curs

65 DMDW-10
Exemple: Other tools
x

Florin Radulescu, Note de curs

66 DMDW-10
Modeling tools

In this category can be mentioned data


mining products, forecasting and scoring
tools.
At this level the result is not only a
sophisticated report on existing data but
also extracted new knowledge, models for
forecasting and other outputs providing
new knowledge to the user.
Florin Radulescu, Note de curs

67 DMDW-10
Summary
 This course presented:
 Some definitions of a with data warehouse and a detailed
discussion based on Inmon definition, explaining what
means the four features of a DW: subject-oriented,
integrated, non-volatile and time-variant. Some reasons for
building a data warehouse are also discussed.
 A definition of the concept of Operational data store with a
parallel between ODS and DW
 A discussion about the architecture of a DW presenting the
Data Stage Area, Data presentation Area and Data Access
Tools, the main parts of such a construction.
 Next week: Dimensional modeling

Florin Radulescu, Note de curs

68 DMDW-10
References
[Inmon 2002] W.H. Inmon - Building The Data Warehouse. Third
Edition, Wiley & Sons, 2002
[Kimball, Ross, 2002] Ralph Kimball, Margy Ross - The Data
Warehouse Toolkit, Second Edition, Wiley & Sons, 2002
[CS680, 2004] Introduction to Data Warehouses, Drexel Univ. CS 680
Course notes, 2004 (page
https://www.cs.drexel.edu/~dvista/cs680/2.DW.Overview.ppt visited
2010)
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org, visited
2009.
[SQLServer 2005] Dan Gallagher, Tim D. Nelson, and Steve Proctor,
Data mart, nov. 2005, Site:
http://searchsqlserver.techtarget.com/definition/data-mart, visited June
20, 2012
[Inmon, 98] W.H. Inmon - The Operational Data Store, July 1, 1998,
web page visited June 20, 2012: http://www.information-
management.com/issues/19980701/469-1.html
[Rainardi, 2008] Vincent Rainardi, Building a Data Warehouse with
Examples in SQL Server, Springer, 2008
Florin Radulescu, Note de curs

69 DMDW-10
Dimensional Modeling

Prof.dr.ing. Florin Radulescu


Universitatea Politehnica din Bucureşti
Road Map

Facts and dimensions


Steps in dimensional modeling
Modeling example
Summary

Florin Radulescu, Note de curs

2 DMDW-11
Facts and dimensions

In the previous lesson we saw that:


A large enterprise data warehouse will
consist of 20 or more very similar-looking
data marts, with similar dimensional
models.
Each data mart may contain several fact
tables, each with 5 to 15 dimension tables.
Many of these dimension tables will be
shared between several fact tables.
Florin Radulescu, Note de curs

3 DMDW-11
Facts

A fact table represents a business process


and contains the values for the main
measurements describing that process.
For a sale business process, this table
contains for example quantity sold, total
price, total cost, total gross profit.
The attributes of a fact table may be
additive, semi-additive or non-additive
Florin Radulescu, Note de curs

4 DMDW-11
Additive measures

Additive measures can be aggregated


across all dimensions.
All examples in the previous slide are
additive.
For example, SUM(total price) is
meaningful on all dimensions: time,
location, store, customer, etc.

Florin Radulescu, Note de curs

5 DMDW-11
Semi-Additive
 Semi-Additive measures can be aggregated
across some dimensions but not all.
 Here are for example periodic measurements:
account balance for a bank account or
inventory level for a retail chain.
 In the first case an average may be computed for
knowing the average daily balance but the sum of
daily balances is not meaningful.
 In the second case, inventory level is additive on
product and warehouse but not across time: the
sum of yesterday and today inventory level for a
given product is not a meaningful value.
Florin Radulescu, Note de curs

6 DMDW-11
Non-additive measures
 Non-additive measures cannot be aggregated
across all/any dimension.
 A classical example is the unit price.
 Considering a retail company, the sum of unit
prices along any dimension (product, customer,
location, etc.) is not meaningful.
 For that reason, if these values can be computed
based on additive measures, the non-additive
measures are not stored in the fact tables.
 For our example, the unit price can always be
computed dividing total cost by the quantity sold.
Florin Radulescu, Note de curs

7 DMDW-11
Grain

The level of detail of o record in a fact


table is called the “grain” of the table.
Besides business process measurements,
the fact table contains also foreign keys for
all the dimension tables and possibly
some pseudo-foreign keys for some
degenerate dimensions.
More details on this topic in the next
paragraphs of this lesson.
Florin Radulescu, Note de curs

8 DMDW-11
Dimensions
In [CS680, 2004], dimension tables are
characterized as follows:
 Represent the who, what, where, when and how of
a measurement/artifact
 Represent real-world entities not business
processes
 Give the context of a measurement (subject)
Example: in a retail company DW, the Sales fact
table can be linked with dimensions like Location
(Where), Time (When), Product (What), Customer
(Who), Sales Channel (How).
Florin Radulescu, Note de curs

9 DMDW-11
Dimensions
The Dimension Attributes are the columns of
the dimension table. [Wikipedia] lists some
features for these attributes:
Verbose - labels consisting of full words,
Descriptive,
Complete - no missing values,
Discretely valued - only one value per row in
dimensional table,
Quality assured - no misspelling, no
impossible values.
Florin Radulescu, Note de curs

10 DMDW-11
Star scheme

Florin Radulescu, Note de curs

11 DMDW-11
Advantages
 Each fact table is surrounded by several linked
dimension tables, as in Figure 1.
 Because of its appearance, such a construction is
called a ‘star scheme’.
 A star scheme has several advantages:
Is easy to understand. Graphic representations have
almost always this advantage
Provide better performance: data is de-normalized in
fact and dimension tables, so for obtaining a query
result needs only the joins between the fact table and
the implied dimensions
Is extensible. Attributes and dimensions may be
added easily
Florin Radulescu, Note de curs

12 DMDW-11
SQL Query
SELECT P.Name, SUM(S.Sales), . . .

FROM Sales S, Date D, Product P, Location L, Promotion R

WHERE S.Date_Id = D.Date_Id // join condition

AND S.Poduct_Id = P.Product_Id

AND S.Location_Id = L.Location_Id

AND S.Promotion_Id = R.Promotion_Id

// additional condition

AND D.Month='JUN' AND D.Year='2012' AND L.Country_Name='ROU'

GROUP BY P.Product_Id, P.Name // P.Name is needed because

// is in the SELECT clause

Florin Radulescu, Note de curs

13 DMDW-11
Snow-flake schemes

Sometimes the dimension tables are


normalized: each dimension is stored as a
set of tables.
In this case, the scheme is called ‘Snow-
flake scheme’

Florin Radulescu, Note de curs

14 DMDW-11
Example

Florin Radulescu, Note de curs

15 DMDW-11
Road Map

Facts and dimensions


Steps in dimensional modeling
Modeling example
Summary

Florin Radulescu, Note de curs

16 DMDW-11
The four step approach

There are four steps in dimensional


modeling design.
These steps must be performed in
particular order and every review of a step
triggers the review of all subsequent steps.

Florin Radulescu, Note de curs

17 DMDW-11
The four step approach

These four steps are:


1. Select the business processes
2. Declare the grain
3. Choose the dimensions
4. Identify the facts

Florin Radulescu, Note de curs

18 DMDW-11
Select the business processes

 An organization has several departments


and carries out several business
processes.
Selecting the business process does not
refer to one structural department of the
organization but to a process carried out
by one or more departments together.

Florin Radulescu, Note de curs

19 DMDW-11
Select the business processes

In a general company for example some


main business processes are:
Supply chain management
Orders,
Shipments,
Invoicing,
Stocking and inventory
General ledger
Florin Radulescu, Note de curs

20 DMDW-11
No duplicate data
 This approach also ensures that data contains no duplicate
data.
 If a department approach in structuring the data warehouse is
used, same data may be used by several departments and
must be presented redundantly in the DW.
 For example, inventory data are used for supply chain
management but also for production management in a car
factory.
 A data warehouse organized based on departmental structure
will duplicate inventory data but organizing it on business
processes will avoid redundancy and both departments –
supply management and production – will use the same data.

Florin Radulescu, Note de curs

21 DMDW-11
No duplicate data

No duplicate data means also that:


Consistency is better preserved; redundancy
is known for inducing consistency problems
(as the update anomaly)
Data are published once and this single view
is used as published for decision support in all
departments and activities.

Florin Radulescu, Note de curs

22 DMDW-11
Step 2: Declare the grain
 Each line in a fact table is a grain in our data
warehouse. In step 2 of the dimensional design
process the level of detail for these lines / grains must
be defined.
 Thinking at a retail company with registered users (like
Metro or Selgros), for the POS sales business
process, a grain may be:
1. An individual line item on a customer’s retail sales
ticket or invoice, as measured by a scanner device
(in that case the same item may be on several lines
in the same ticket/invoice because the quantity was
greater than one and each product was scanned
individually).
Florin Radulescu, Note de curs

23 DMDW-11
Step 2: Declare the grain

Florin Radulescu, Note de curs

24 DMDW-11
Step 2: Declare the grain

2. The same
significance as
above but lines
containing the
same part
number are
summarized in a
single line.
Florin Radulescu, Note de curs

25 DMDW-11
Step 2: Declare the grain
3. A daily reunion of the sales tickets of a customer
containing items and prices.

Florin Radulescu, Note de curs

26 DMDW-11
Step 2: Declare the grain

4. A sales ticket / invoice for a customer. The


same customer may have several sales tickets
each day.

Florin Radulescu, Note de curs

27 DMDW-11
Step 2: Declare the grain

5. A daily summary on the sales tickets of a


customer, containing only total sales amount.

Florin Radulescu, Note de curs

28 DMDW-11
Step 2: Declare the grain

6. A weekly summary on the sales tickets of a


customer, containing only total sales amount.


Florin Radulescu, Note de curs

29 DMDW-11
Discussion

In the first three cases data on what


products bought each customer is
preserved in the DW, allowing much more
queries and much more ‘slicing and dicing’
actions: reports, analysis, etc.
 For the last three cases these data are
removed and only a part of the above
actions may be performed.
Florin Radulescu, Note de curs

30 DMDW-11
Discussion
A key idea in choosing the granularity level is
emphasized in [Kimball, Ross, 2002]:
 “Preferably you should develop dimensional models
for the most atomic information captured by a
business process. Atomic data is the most detailed
information collected; such data cannot be subdivided
further.”
and
 “A data warehouse almost always demands data
expressed at the lowest possible grain of each
dimension not because queries want to see individual
low-level rows, but because queries need to cut
through the details in very precise ways.”
Florin Radulescu, Note de curs

31 DMDW-11
Atomic data features
Some features of atomic data listed in
Kimball & Ross book are:
Is highly dimensional,
Being highly dimensional, data may be drilled in
more ways,
Dimensional approach is favored by atomic data,
each extra dimension being easily added to the
star schemes,
Provides maximum analytic flexibility,
Detailed data allow more ad hoc queries,
Florin Radulescu, Note de curs

32 DMDW-11
Atomic data features

Features – cont.:
Low level grain does not prohibit adding also
summary high level grain in the DW for speeding
up frequent queries and reports.
Note that declaring the grain is a critical step.
If later the granularity choice is proved to be
wrong, the process must go back to step 2 for
re-declaring the grain correctly, and after that
steps 3 and 4 must be run again.
Florin Radulescu, Note de curs

33 DMDW-11
Step 3: Choose the dimensions

Knowing the grain, warehouse:


dimensions can be Product,
determined easily: Customer,
each dimension is a Date,
possible fact table Ticket number
line description. Status,
Example of common Store,
dimensions used in a Salesperson
sales data Promotion
Florin Radulescu, Note de curs

34 DMDW-11
Step 3: Choose the dimensions

Dimensions can be found by asking


ourselves haw can we describe a single
line in the fact table.
In the above example, a line in the fact
table represents a single line on a sales
ticket.
This line is about selling a product to a
customer at a given date and in a given
store possibly under a promotion.
Florin Radulescu, Note de curs

35 DMDW-11
Step 3: Choose the dimensions

The line is on a ticket having a number, a


status (for example paid by credit card)
and has been made by a particular
salesperson.
The number of attributes in a dimension
table is not so small.
This lesson presents an example showing
that each dimension may have tens of
attributes.
Florin Radulescu, Note de curs

36 DMDW-11
Step 4: Identify the facts
 Every line in a fact table must contain some attribute
values.
 These attributes represents the measures assigned to
the business process that must be determined at this
step.
 In the case of a star scheme containing data on POS
retail sales in a store chain, possible attributes of the
fact table are:
 Quantity sold – additive value
 Total line value amount – additive value
 Line cost amount – additive value
 Line profit amount – additive value
 Unit price – not an additive value
Florin Radulescu, Note de curs

37 DMDW-11
Discussion

The product sold, store, date, time, customer,


promotion, sales ticket and other data are also
identified by the linked records in the
corresponding dimension tables.
In the fact table are stored only information
specific to an association of an instance for each
dimension.

Florin Radulescu, Note de curs

38 DMDW-11
Discussion
 Additive measures are preferred. So unit price,
which is not an additive value, will be removed
because it can be computed by division from the
total line value amount and quantity sold.
 Redundant data can be stored in a fact table if
they are additive or semi-additive.
 For example, Line profit amount may be computed
by subtracting the cost amount from the value
amount.
 The presence of these redundant values is
allowed for speeding up processing.
Florin Radulescu, Note de curs

39 DMDW-11
Road Map

Facts and dimensions


Steps in dimensional modeling
Modeling example
Summary

Florin Radulescu, Note de curs

40 DMDW-11
Modeling example
 A retail sales modeling example is presented in
[Kimball, Ross, 2002] for a store chain.
 Each store has several departments and sales
several tens of thousands items (called stock
keeping units – SKU).
 Each SKU has either a universal product code
imprinted by the manufacturers or a local code for
bulk goods (for example agricultural products -
vegetables and fruits, meat, bakery, etc.).
 Package variation of a product is another SKU and
by consequence has a different code.
Florin Radulescu, Note de curs

41 DMDW-11
Modeling example

Some products are sold under some


promotions. There are four types of
promotions in our example:
Temporary price reductions,
Ads in newspapers and newspaper inserts,
Displays in the store (end-aisle displays
included),
Coupons.

Florin Radulescu, Note de curs

42 DMDW-11
Step 1

Step 1: Select the business process


The most important business process in a
retail company is customer purchases as
captured by the POS system, so the
business process modeled is “POS retail
sales”.

Florin Radulescu, Note de curs

43 DMDW-11
Step 2

Step 2: Declare the grain


As seen earlier, in dimensional modeling is
preferably to store atomic information,
data collected at the POS location and not
summarized or aggregated data based on
POS transactions.
So the grain in this modeling example will
be an individual line on the sales ticket
generated by the POS.
Florin Radulescu, Note de curs

44 DMDW-11
Step 3

Step 3: Choose the dimensions


As presented earlier, a line on a sales
ticket is about selling a product to a
customer at a given date and in a given
store possibly under a promotion.
The line is on a ticket having a number
and is made by a particular salesperson.

Florin Radulescu, Note de curs

45 DMDW-11
Step 3 – star scheme

Product Store
Product_Key (PK) Store_Key (PK)
Product attributes Store attributes
POS_Sales
Product_Key (FK)
Date_Key (FK)
Store_Key (FK)
SP_Key (FK)
Date Promotion_Key (FK)
Date_Key (PK) Ticket_number (FK) Salesperson
Date attributes Fact table attributes SP_Key (PK)
SP attributes

Promotion
Promotion_Key (PK)
Promotion attributes

Florin Radulescu, Note de curs

46 DMDW-11
Step 3 - details

From this definition of the grain, the


dimensions are:
Product
Date
Store
Promotion
Salesperson
Sales ticket
Florin Radulescu, Note de curs

47 DMDW-11
Degenerate dimensions

The sales ticket is a so-called “degenerate


dimension”.
Such degenerate dimensions come from
operational control numbers: ticket
number, order number, invoice number,
and so on.
These dimensions are empty - without
other attributes.
Florin Radulescu, Note de curs

48 DMDW-11
Degenerate dimensions

No associated dimension table is present in the


star scheme, but they are necessary in some
queries, for example for finding products sold
together in the same sales basket.
For these dimensions only the pseudo-foreign
key associated with the dimension is present as
attribute in the fact table.
Every dimension has a surrogate primary key
and this key is also contained as foreign key in
the fact table.
Florin Radulescu, Note de curs

49 DMDW-11
Step 4: Identify the facts

The candidate fact attributes:


Quantity_sold – additive value
Line_amount – additive value
Cost_amount – additive value
Profit amount – additive value
Unit_price – not an additive value

Florin Radulescu, Note de curs

50 DMDW-11
Star scheme again

Product POS_Sales Store


Product_Key (PK) Product_Key (FK) Store_Key (PK)
Product attributes Date_Key (FK) Store attributes
Store_Key (FK)
SP_Key (FK)
Promotion_Key (FK)
Ticket_number (FK)
Quantity_sold
Date Line_amount
Date_Key (PK) Cost_amount Salesperson
Date attributes Profit amount SP_Key (PK)
SP attributes

Promotion
Promotion_Key (PK)
Promotion attributes

Florin Radulescu, Note de curs

51 DMDW-11
Discussion

Because Unit_price is not additive, this


attribute will not be stored in the fact table
but can always be computed by dividing
Line_amount to Quantity_sold.
This applies to percentages and ratios,
(almost all are non-additive values) and
and in these cases the numerator and
denominator must be stored in the fact
table.
Florin Radulescu, Note de curs

52 DMDW-11
Date attributes example

Date
Date Key (PK) Calendar Quarter
Date Calendar Year-Quarter
Full Date Description Calendar Half Year
Day of Week Calendar Year
Day Number in Epoch Fiscal Week
Week Number in Epoch Fiscal Week Number in Year
Month Number in Epoch Fiscal Month
Day Number in Calendar Month Fiscal Month Number in Year
Day Number in Calendar Year Fiscal Year-Month
Day Number in Fiscal Month Fiscal Quarter
Day Number in Fiscal Year Fiscal Year-Quarter
Last Day in Week Indicator Fiscal Half Year
Last Day in Month Indicator Fiscal Year
Calendar Week Ending Date Holiday Indicator
Calendar Week Number in Year Weekday Indicator
Calendar Month Name Selling Season
Calendar Month Number in Year Major Event
Calendar Year-Month (YYYY-MM) SQL Date Stamp

Florin Radulescu, Note de curs

53 DMDW-11
Product attributes example

Product
Product Key (PK) Product Description
SKU Number (Natural Key) Brand Description
Category Description Department Description
Package Type Description Package Size
Fat Content Diet Type
Weight Weight Units of Measure
Storage Type Shelf Life Type
Shelf Width Shelf Height
Shelf Depth

Florin Radulescu, Note de curs

54 DMDW-11
Store attributes example

Store
Store Name Store Region
Store Number (Natural Key) Floor Plan Type
Store Street Address Photo Processing Type
Store City Financial Service Type
Store County Selling Square Footage
Store State Total Square Footage
Store Zip Code First Open Date
Store Manager Last Remodel Date
Store District

Florin Radulescu, Note de curs

55 DMDW-11
Promotion attributes example

Promotion
Promotion Key (PK) Coupon Type
Promotion Name Ad Media Name
Price Reduction Type Display Provider
Promotion Media Type Promotion Cost
Ad Type Promotion Begin Date
Display Type Promotion End Date

Florin Radulescu, Note de curs

56 DMDW-11
Summary
This course presented the dimensional model
of data warehouses:
Definitions for facts and dimensions, definitions
for star scheme and snow-flake scheme.
The four steps in dimensional modeling: identify
the business process, declare the grain, choose
dimensions and identify the facts
A modeling example for a sales chain with
illustration of attributes in fact and dimension
tables
Next week: Data warehouse case study
Florin Radulescu, Note de curs

57 DMDW-11
References
[CS680, 2004] Introduction to Data Warehouses, Drexel Univ. CS
680 Course notes, 2004 (page
https://www.cs.drexel.edu/~dvista/cs680/2.DW.Overview.ppt
visited 2010)
[Kimball, Ross, 2002] Ralph Kimball, Margy Ross - The Data
Warehouse Toolkit, Second Edition, Wiley & Sons, 2002
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org

Florin Radulescu, Note de curs

58 DMDW-11
Dimensional Modeling – part 2
Case Studies
Prof.dr.ing. Florin Radulescu
Universitatea Politehnica din Bucureşti
Road Map

❑Types of Dimensional Models


❑Surrogate keys
❑Conformed Dimensions
❑Summary

Florin Radulescu, Note de curs

2 DMDW-12
Types of Dimensional Models
❑In [3] are discussed five distinct types of
Dimensional Models. A Dimensional Model in
the next slides is either a star scheme or a
data mart – several interconnected star
schemes:
1. Accumulating Snapshot Tables
2. Aggregate Tables
3. Fact Tables
4. Factless Fact Tables
5. Snapshot Tables
Florin Radulescu, Note de curs

3 DMDW-12
Types of Dimensional Models

1. Accumulating Snapshot Tables


2. Aggregate Tables
3. Fact Tables
4. Factless Fact Tables
5. Snapshot Tables

Florin Radulescu, Note de curs

4 DMDW-12
Accumulating Snapshot Tables

❑“A row in an accumulating snapshot fact


table summarizes the measurement events
occurring at predictable steps between the
beginning and the end of a process.
❑Pipeline or workflow processes, such as
order fulfillment or claim processing, that
have a defined start point, standard
intermediate steps, and defined end point can
be modeled with this type of fact table.
Florin Radulescu, Note de curs

5 DMDW-12
Accumulating Snapshot Tables

❑There is a date foreign key in the fact table


for each critical milestone in the process.
❑An individual row in an accumulating
snapshot fact table, corresponding for
instance to a line on an order, is initially
inserted when the order line is created.
❑As pipeline progress occurs, the
accumulating fact table row is revisited
and updated.
Florin Radulescu, Note de curs

6 DMDW-12
Accumulating Snapshot Tables
❑ This consistent updating of accumulating snapshot
fact rows is unique among fact tables.
❑ In addition to the date foreign keys associated with
each critical process step, accumulating snapshot
fact tables contain foreign keys for other
dimensions and optionally contain degenerate
dimensions.
❑ They often include numeric lag measurements
consistent with the grain, along with milestone
completion counters.” (source: [2])

Florin Radulescu, Note de curs

7 DMDW-12
Accumulating Snapshot Tables

❑Modeling the predefined steps may be


done using “Stage or Status Dimension”
as described in [3]. This way we track the
progress not only in time but also in steps.
❑The next slide example presents a student
registration star scheme where the
progress is modelled by the
Ref_Registration_Stage dimension.
Florin Radulescu, Note de curs

8 DMDW-12
x

Florin Radulescu, Note de curs

9 DMDW-12
Accumulating Snapshot Tables

❑In addition to the Dimension that models


the stage, there are 10 other Dimensions
whose foreign keys are present in the Fact
table and an additional intersection table
that models the relationships between the
various pairs of students
(Students_Relationships).

Florin Radulescu, Note de curs

10 DMDW-12
Types of Dimensional Models

1. Accumulating Snapshot Tables


2. Aggregate Tables
3. Fact Tables
4. Factless Fact Tables
5. Snapshot Tables

Florin Radulescu, Note de curs

11 DMDW-12
Aggregate Tables

❑“Aggregate fact tables are simple numeric


rollups of atomic fact table data built solely
to accelerate query performance.
❑These aggregate fact tables should be
available to the Business Intelligence (BI)
layer at the same time as the atomic fact
tables so that BI tools smoothly choose
the appropriate aggregate level at query
time.
Florin Radulescu, Note de curs

12 DMDW-12
Aggregate Tables

❑This process, known as aggregate


navigation, must be open so that every
report writer, query tool, and BI application
harvests the same performance benefits.
❑A properly designed set of aggregates
should behave like database indexes,
which accelerate query performance but
are not encountered directly by the BI
applications or business users.
Florin Radulescu, Note de curs

13 DMDW-12
Aggregate Tables

❑Aggregate fact tables contain foreign keys to


shrunken conformed dimensions, as well as
aggregated facts created by summing
measures from more atomic fact tables.
❑Finally, aggregate OLAP cubes with
summarized measures are frequently built in
the same way as relational aggregates, but
the OLAP cubes are meant to be accessed
directly by the business users.” (source: [2])
Florin Radulescu, Note de curs

14 DMDW-12
Aggregate Tables

❑In [4] there is an example of three such


Aggregate Fact Tables which summarize
sales by days, weeks and months:
❑POS_DAY
❑POS_WEEK
❑POS_MONTH
❑The facts are: SALES_UNIT,
SALES_RETAIL and GROSS_PROFIT
Florin Radulescu, Note de curs

15 DMDW-12
Aggregate Tables

❑Also, these three Aggregate Fact Tables


need three Dimensions:
❑PERIOD
❑LOCATION
❑PRODUCT
❑In the example the Aggregate Fact Tables
use composed primary keys and not
surrogate keys as recommended.
Florin Radulescu, Note de curs

16 DMDW-12
Florin Radulescu, Note de curs

17 DMDW-12
Aggregate Tables

❑The example shows also three scenarios for


maintaining the content of the tables.
❑One example is presented on the next slide.
❑The author says that for using simple
aggregate tables in Oracle Data Warehouse,
the best and most obvious choice is to use a
simple, parallel, direct mode load insert (see
next slide, source: [4]).

Florin Radulescu, Note de curs

18 DMDW-12
Aggregate Tables – Load example
alter session enable parallel dml;

insert /*+ parallel(aggr,10) append */


into pos_week aggr
select /*+ parallel(fact,10) full(fact) */
$WEEK_ID, location_id, product_id
sum(nvl(sales_unit,0)),
sum(nvl(sales_retail,0)),
sum(nvl(gross_profit,0))
from pos_day fact
where period_id between $BEG_ID and $END_ID
group by $WEEK_ID, location_id, product_id;

commit;
Florin Radulescu, Note de curs

19 DMDW-12
Types of Dimensional Models

1. Accumulating Snapshot Tables


2. Aggregate Tables
3. Fact Tables
4. Factless Fact Tables
5. Snapshot Tables

Florin Radulescu, Note de curs

20 DMDW-12
Fact Tables

❑The tables of facts have been presented in


previous courses.
❑Here is the definition from the Kimball
Group website ([2]):
❑“Facts are the measurements that result
from a business process event and are
almost always numeric.

Florin Radulescu, Note de curs

21 DMDW-12
Fact Tables

❑A single fact table row has a one-to-one


relationship to a measurement event as
described by the fact table’s grain.
❑Thus a fact table corresponds to a
physical observable event, and not to the
demands of a particular report.

Florin Radulescu, Note de curs

22 DMDW-12
Fact Tables

❑Within a fact table, only facts consistent


with the declared grain are allowed.
❑For example, in a retail sales transaction,
the quantity of a product sold and its
extended price are good facts, whereas
the store manager’s salary is disallowed.”

Florin Radulescu, Note de curs

23 DMDW-12
Types of Dimensional Models

1. Accumulating Snapshot Tables


2. Aggregate Tables
3. Fact Tables
4. Factless Fact Tables
5. Snapshot Tables

Florin Radulescu, Note de curs

24 DMDW-12
Factless Fact Tables
Factless Fact tables are defined in [2] as
follows:
❑“Although most measurement events capture
numerical results, it is possible that the event
merely records a set of dimensional entities
coming together at a moment in time.
❑For example, an event of a student attending
a class on a given day may not have a
recorded numeric fact, but a fact row with
foreign keys for calendar day, student,
teacher, location, and class is well-defined.
Florin Radulescu, Note de curs

25 DMDW-12
Factless Fact Tables

❑Likewise, customer communications are


events, but there may be no associated
metrics.
❑Factless fact tables can also be used to
analyze what didn’t happen.

Florin Radulescu, Note de curs

26 DMDW-12
Factless Fact Tables

❑These queries always have two parts: a


factless coverage table that contains all
the possibilities of events that might
happen and an activity table that contains
the events that did happen.
❑When the activity is subtracted from the
coverage, the result is the set of events
that did not happen.”
Florin Radulescu, Note de curs

27 DMDW-12
Factless Fact Tables

The same thing is defined in [3] as follows:


❑A Factless Fact is one that has no data
associated with it. In other words, it has
Dimensions but no Facts.
❑A common example is an Event, where
the occurrence of the Event is itself a Fact.

Florin Radulescu, Note de curs

28 DMDW-12
Factless Fact Tables

❑An example is given there with a star


scheme containing a variation of the
student registration scheme presented
earlier.
❑There are no facts in the
Facts_of_Student_Registrations table,
only aggregated values (averages, counts,
totals, etc.)
Florin Radulescu, Note de curs

29 DMDW-12
x

Florin Radulescu, Note de curs

30 DMDW-12
Types of Dimensional Models

1. Accumulating Snapshot Tables


2. Aggregate Tables
3. Fact Tables
4. Factless Fact Tables
5. Snapshot Tables

Florin Radulescu, Note de curs

31 DMDW-12
Snapshot Fact Tables

❑A Snapshot Fact Table (or Periodic


Snapshot Fact Table) is defined in [2] as:
❑“A row in a periodic snapshot fact
table summarizes many measurement
events occurring over a standard period,
such as a day, a week, or a month.
❑The grain is the period, not the individual
transaction.
Florin Radulescu, Note de curs

32 DMDW-12
Snapshot Fact Tables

❑Periodic snapshot fact tables often contain


many facts because any measurement
event consistent with the fact table grain is
permissible.
❑These fact tables are uniformly dense in
their foreign keys because even if no
activity takes place during the period, a
row is typically inserted in the fact table
containing a zero or null for each fact.”
Florin Radulescu, Note de curs

33 DMDW-12
Snapshot Fact Tables

❑So Snapshot Fact Tables contain historic


data at periodic intervals, such as Day,
Week or month.
❑From [3] there is an example for a monthly
snapshot for customers and car parts.

Florin Radulescu, Note de curs

34 DMDW-12
x

Florin Radulescu, Note de curs

35 DMDW-12
Snapshot Fact Tables

❑In Orders_Monthly_Snapshot we have a


composed primary key:
❑ Month_number, marked PF (Primary /
Foreign), part of the primary key but also
foreign key (is primary key in
Ref_Monthly_Calendar) and Fact_id,
differentiating facts within a month.

Florin Radulescu, Note de curs

36 DMDW-12
Snapshot Fact Tables

❑Also Parts_for_Cars represents a many to


many relationship, modeled as an
intersection entity.
❑This dimension table gives us the
composition of a car as set of parts.

Florin Radulescu, Note de curs

37 DMDW-12
Road Map

❑Types of Dimensional Models


❑Surrogate keys
❑Conformed Dimensions
❑Summary

Florin Radulescu, Note de curs

38 DMDW-12
Surrogate keys

❑According to the Webster’s Unabridged


Dictionary, a surrogate is an “artificial or
synthetic product that is used as a
substitute for a natural product.”
❑In [2] is explained the use of surrogate
keys in dimension tables (see more about
surrogate keys in DW at
https://www.kimballgroup.com/1998/05/sur
rogate-keys/)
Florin Radulescu, Note de curs

39 DMDW-12
Surrogate keys

❑“A dimension table is designed with one


column serving as a unique primary key.
❑This primary key cannot be the operational
system’s natural key because there will be
multiple dimension rows for that natural
key when changes are tracked over time.

Florin Radulescu, Note de curs

40 DMDW-12
Surrogate keys
❑In addition, natural keys for a dimension may
be created by more than one source system,
and these natural keys may be incompatible
or poorly administered.
❑The DW/BI system needs to claim control of
the primary keys of all dimensions; rather
than using explicit natural keys or natural
keys with appended dates, you should create
anonymous integer primary keys for every
dimension.
Florin Radulescu, Note de curs

41 DMDW-12
Surrogate keys

❑These dimension surrogate keys are


simple integers, assigned in sequence,
starting with the value 1, every time a new
key is needed.
❑The date dimension is exempt from the
surrogate key rule; this highly predictable
and stable dimension can use a more
meaningful primary key.”
Florin Radulescu, Note de curs

42 DMDW-12
Surrogate keys for fact tables

❑In the case of fact tables surrogate keys


are not needed except some cases when
their use can be beneficial.
❑In [5] three such cases are listed. But
using surrogate keys in other situations is
not avoided, these keys can be helpful in
the ETL process.

Florin Radulescu, Note de curs

43 DMDW-12
Surrogate keys for fact tables

❑Case 1: Sometimes multiple identical rows


are allowed to exist in the same fact table,
including the values of the natural key.
❑In these situations it is necessary to use
a surrogate key for that fact table (to avoid
PK constraint violation).

Florin Radulescu, Note de curs

44 DMDW-12
Surrogate keys for fact tables

❑Case 2: When the ETL process needs to


update rows in the fact table. One way to do
that is to insert the updated rows as new
ones and then to delete the old rows (having
the same natural keys as the new ones).
❑In that case, for avoiding deletion of all new
and old rows, different surrogate keys must
be used for new and old version of a row.

Florin Radulescu, Note de curs

45 DMDW-12
Surrogate keys for fact tables

❑Case 3: “A similar ETL requirement is to


determine exactly where a load job was
suspended, either to resume loading or
back put the job entirely. A sequentially
assigned surrogate key makes this task
straightforward.”

Florin Radulescu, Note de curs

46 DMDW-12
Surrogate keys summary

❑Surrogate keys:
✓Are mandatory for dimension tables
✓May be used in fact tables
✓There are cases when using surrogate keys
for fact tables are mandatory due to the data
processing characteristics

Florin Radulescu, Note de curs

47 DMDW-12
Road Map

❑Types of Dimensional Models


❑Surrogate keys
❑Conformed Dimensions
❑Summary

Florin Radulescu, Note de curs

48 DMDW-12
Conformed Dimensions

❑Speaking about dimension tables, Ralph


Kimball used the term “Conformed
dimensions”. His definition is the following
([2]):
❑“Dimension tables conform when
attributes in separate dimension tables
have the same column names and domain
contents.
Florin Radulescu, Note de curs

49 DMDW-12
Conformed Dimensions

❑Information from separate fact tables can


be combined in a single report by using
conformed dimension attributes that are
associated with each fact table.
❑When a conformed attribute is used as the
row header (that is, the grouping column in
the SQL query), the results from the
separate fact tables can be aligned on the
same rows in a drill-across report.
Florin Radulescu, Note de curs

50 DMDW-12
Conformed Dimensions

❑This is the essence of integration in an


enterprise DW/ BI system.
❑Conformed dimensions, defined once in
collaboration with the business’s data
governance representatives, are reused
across fact tables; they deliver both
analytic consistency and reduced future
development costs because the wheel is
not repeatedly re-created.”
Florin Radulescu, Note de curs

51 DMDW-12
Conformed Dimensions
❑Simply put, if multiple star schemes use the
same dimension table, there will be a single
table in the data warehouse for that
dimension that will be shared by all star
schemes that need it.
❑For this reason, the grain must be the same
in all cases.
❑Conformed Dimensions are therefore very
important and are frequently Reference Data
(such as Calendars) or Master Data (such as
Products).
Florin Radulescu, Note de curs

52 DMDW-12
Conformed Dimensions

❑Conformed Dimensions are therefore very


important and are frequently Reference
Data (such as Calendars) or Master Data
(such as Products).
❑Rule no. 9 defined by Ralph Kimball (see
the 10 rules at [6]) is:
“Create conformed dimensions to integrate
data across the enterprise”
Florin Radulescu, Note de curs

53 DMDW-12
Conformed Dimensions

❑An example for using conformed


dimensions is in the next slide (source:
[3]).
❑The example shows the scheme for a
simple data mart containing three star
schemes sharing some dimensions

Florin Radulescu, Note de curs

54 DMDW-12
x

❑x

Florin Radulescu, Note de curs

55 DMDW-12
Conformed Dimensions

❑In the previous figure there are three


dimensions shared between two or three
tables:
▪ Customers, used by Ticket_Sales and
Restaurant_Orders
▪ Ref_Sports, used by Ticket_Sales and
Judo_Competition
▪ Ref_Calendar, used by three Star schemes:
Ticket_Sales, Restaurant_Orders and
Judo_Competition.
Florin Radulescu, Note de curs

56 DMDW-12
Conformed Dimensions

❑As you can see, foreign keys have not


identical names with the primary keys of
related dimensions.
❑For example in Ticket_Sales we have
Event_Start_Time referring PK
Day_Date_and_Time from Ref_Calendar.

Florin Radulescu, Note de curs

57 DMDW-12
Summary

❑This course presented the dimensional


model part 2 and some case studies.
❑This is the last course text.
❑Next week: Exam preparation and project
delivery.

Florin Radulescu, Note de curs

58 DMDW-12
References
1. Kimball group website: https://www.kimballgroup.com/
2. Kimball Group Dimensional Modeling Techniques:
https://www.kimballgroup.com/data-warehouse-business-
intelligence-resources/kimball-techniques/dimensional-
modeling-techniques/
3. Barry Williams - Dimensional Modelling by Example,
http://www.databaseanswers.org/downloads/Dimensional_M
odelling_by_Example.pdf visited April 2020
4. http://etutorials.org/SQL/oracle+dba+guide+to+data+wareho
using+and+star+schemas/Chapter+7.+Implementing+Aggre
gates/Aggregation+by+Itself/
5. https://www.kimballgroup.com/2006/07/design-tip-81-fact-
table-surrogate-key/
6. https://www.kimballgroup.com/2009/05/the-10-essential-
rules-of-dimensional-modeling/
Florin Radulescu, Note de curs

59 DMDW-12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy