Lec Slides Combined Mid Quiz With Old Quizzes
Lec Slides Combined Mid Quiz With Old Quizzes
Introduction
Slides based on
Data Mining: Concepts and Techniques, 3 rd ed.
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Course Outline
•ILO1: recall the basic fundamental concepts involved in the process of discovering useful,
possibly unexpected, patterns in large data sets
•ILO2: select appropriate techniques for data mining, considering the given problem
•ILO4: evaluate and implement appropriate solution for a given data mining problem from
a wide array of application domains and communicate.
2
Outline Syllabus
4
Lecture Panel
5
Introduction
◼ Why Data Mining?
◼ What Is Data Mining?
◼ Summary
6
Why Data Mining?
8
Evolution of Database Technology
◼ 1960s:
◼ Data collection, database creation, IMS and network DBMS
◼ 1970s:
◼ Relational data model, relational DBMS implementation
◼ 1980s:
◼ RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
◼ Application-oriented DBMS (spatial, scientific, engineering, etc.)
◼ 1990s:
◼ Data mining, data warehousing, multimedia databases, and Web
databases
◼ 2000s
◼ Stream data management and mining
◼ Data mining and its applications
◼ Web technology (XML, data integration) and global information systems
9
Chapter 1. Introduction
◼ Why Data Mining?
◼ What Is Data Mining?
◼ Summary
10
What Is Data Mining?
11
Knowledge Discovery (KDD) Process
◼ This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
◼ Data mining plays an essential
role in the knowledge discovery
process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
12
Example: A Web Mining Framework
13
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
◼ Data presentation
◼ Exploration
15
KDD Process: A Typical View from ML and
Statistics
16
Example: Medical Data Mining
17
Chapter 1. Introduction
◼ Why Data Mining?
◼ What Is Data Mining?
◼ Summary
18
Multi-Dimensional View of Data Mining
◼ Data to be mined
◼ Database data (extended-relational, object-oriented, heterogeneous,
◼ Techniques utilized
◼ Data-intensive, data warehouse (OLAP), machine learning, statistics,
◼ Summary
20
Data Mining: On What Kinds of Data?
21
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
22
Data Mining Function: (1) Generalization
23
Data Mining Function: (2) Association and
Correlation Analysis
◼ Frequent patterns (or frequent itemsets)
◼ What items are frequently purchased together in your
Walmart?
◼ Association, correlation vs. causality
◼ A typical association rule
◼ Diaper → Beer [0.5%, 75%] (support, confidence)
◼ Are strongly associated items also strongly correlated?
◼ How to mine such patterns and rules efficiently in large
datasets?
◼ How to use such patterns for classification, clustering,
and other applications?
24
Data Mining Function: (3) Classification
25
Data Mining Function: (4) Cluster Analysis
26
Data Mining Function: (5) Outlier Analysis
◼ Outlier analysis
◼ Outlier: A data object that does not
comply with the general behavior of the
data
◼ Noise or exception? ― One person’s
garbage could be another person’s
treasure
◼ Methods: by product of clustering or
regression analysis, …
◼ Useful in fraud detection, rare events
analysis
27
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
◼ Sequence, trend and evolution analysis
◼ Trend, time-series, and deviation analysis: e.g.,
memory cards
◼ Periodicity analysis
◼ Similarity-based analysis
◼ Graph mining
◼ Finding frequent subgraphs (e.g., chemical compounds), trees
family, classmates, …
◼ Links carry a lot of semantic information: Link mining
◼ Web mining
◼ Web is a big information network: from PageRank to Google
29
Evaluation of Knowledge
◼ Are all mined knowledge interesting?
◼ One can mine tremendous amount of “patterns” and knowledge
◼ Some may fit only certain dimension space (time, location, …)
◼ Some may not be representative, may be transient, …
◼ Evaluation of mined knowledge → directly mine only
interesting knowledge?
◼ Descriptive vs. predictive
◼ Coverage
◼ Typicality vs. novelty
◼ Accuracy
◼ Timeliness
◼ …
30
Chapter 1. Introduction
◼ Why Data Mining?
◼ What Is Data Mining?
◼ Summary
31
Data Mining: Confluence of Multiple Disciplines
32
Why Confluence of Multiple Disciplines?
◼ Summary
34
Applications of Data Mining
◼ Web page analysis: from web page classification, clustering to PageRank &
HITS algorithms
◼ Collaborative analysis & recommender systems
◼ Basket data analysis to targeted marketing
◼ Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
◼ Data mining and software engineering (e.g., IEEE Computer, Aug. 2009
issue)
◼ From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
35
Chapter 1. Introduction
◼ Why Data Mining?
◼ Summary
36
Major Issues in Data Mining (1)
◼ Mining Methodology
◼ Mining various and new kinds of knowledge
◼ Mining knowledge in multi-dimensional space
◼ Data mining: An interdisciplinary effort
◼ Boosting the power of discovery in a networked environment
◼ Handling noise, uncertainty, and incompleteness of data
◼ Pattern evaluation and pattern- or constraint-guided mining
◼ User Interaction
◼ Interactive mining
◼ Incorporation of background knowledge
◼ Presentation and visualization of data mining results
37
Major Issues in Data Mining (2)
38
Chapter 1. Introduction
◼ Why Data Mining?
◼ What Is Data Mining?
◼ Summary
39
A Brief History of Data Mining Society
41
Where to Find References? DBLP, CiteSeer, Google
◼ Summary
43
Summary
44
Recommended Reference Books
◼ S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
◼ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
◼ U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
◼ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
◼ J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011
◼ D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
◼ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd ed., Springer-Verlag, 2009
◼ B. Liu, Web Data Mining, Springer 2006.
◼ T. M. Mitchell, Machine Learning, McGraw Hill, 1997
◼ G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
◼ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
◼ S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
◼ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005
45
Data Mining
— Rule Mining (ARM) [Ch 5] —
◼ Basic concepts
◼ Efficient and scalable frequent itemset mining
methods
◼ Mining various kinds of association rules
◼ From association mining to correlation
analysis
◼ Constraint-based association mining
◼ Summary
diaper}
◼ i.e., every transaction having {beer, diaper, nuts} also
◼ Challenges
◼ Multiple scans of transaction database
◼ Huge number of candidates
◼ Tedious workload of support counting for candidates
◼ Improving Apriori: general ideas
◼ Reduce passes of transaction database scans
◼ Shrink number of candidates
◼ Facilitate support counting of candidates
Support = 60 %
Conf. = 80 %
F-list = f, c, a, b, m, p
Step 3
Create the root of the tree, labelled with “null”
null
Step 4
Scan each transactions in the DB → generate FP-Tree
The items in each transaction are processed in f-list order
TID Items bought Frequent items in F-list order
100 f, a, c, d, g, i, m, p f, c, a, m, p null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:1
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:1
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:1
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:1
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:1
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:2
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:2
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:2
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:2
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:2
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:3
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:3
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:3 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:3 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:3 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:4 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:4 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:4 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p
100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:4 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p
◼ Completeness
◼ Preserve complete information for frequent pattern
mining
◼ Never break a long pattern of any transaction
◼ Compactness
◼ Reduce irrelevant info—infrequent items are gone
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
70
Run time(sec.)
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
◼ Divide-and-conquer:
◼ decompose both the mining task and DB according to
the frequent patterns obtained so far
◼ leads to focused search of smaller databases
◼ Other factors
◼ no candidate generation, no candidate test
◼ compressed database: FP-tree structure
◼ no repeated scan of entire database
◼ basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
Tid Items A B C D E
10 A, C, D 1 0 1 1 0
20 B, C, E 0 1 1 0 1
30 A, B, C, E 1 1 1 0 1
40 B, E 0 1 0 0 1
1
Lecture outline
1. Introduction to TS
2. Basic visualization techniques
3. Trend analysis and seasonality detection
4. Cyclic pattern analysis
5. Anomaly detection in TS
6. Stationarity in TS
7. Traditional TS forecasting methods
8. Advanced TS forecasting methods
2
Part 1.
Introduction to time series analysis and
mining
3
Time series everywhere!
4
What is a time series?
● A time series is an ordered
sequence of values of a variable
at equally spaced time intervals
● Typically, the mean and the
standard deviation can vary in a
time series
● Understanding time series data
helps predict future events based
on historical patterns
Trend Volatility
5
TS analysis, forecasting & mining
● Time series analysis
○ Techniques for identifying and analysing temporal dependencies in sequential data
● Time series mining
○ What has happened?
● Time series forecasting
○ What’s going to happen next?
6
Applications
● Finance ● Industry
○ Sales forecasting ○ Usage
○ Inventory analysis ○ Anomaly
○ Stock market analysis ○ Maintenance
○ Price estimation ● Healthcare
● Weather ○ Patient occupancy in
○ Temperature hospitals
○ Climate change ○ Disease outbreaks
○ Seasonal shifts ○ Patient monitoring
○ Rain, wind, snow,... ● Insurance
○ Sales forecasting
○ Insurance benefits awarded 7
Difference between a time series and a random
variable with known mean and standard deviation
We might be able to build a distribution that explains all the data points in the time series. But
does that mean the distribution can be sampled to produce the time series?
8
Dependence on the past
● Memory of a TS
○ A few points having a similar trend
○ But there might be no overall trend
● Actual values tend to depend on the
value at the previous time point
○ Autocorrelation
9
Time Series Components
10
Time series components
Level: The average value in the series.
Reference
11
Trend
● Refers to the increasing or decreasing
value in the series.
● Can be linear or nonlinear.
● Methods to detect:
○ Moving Averages
○ Polynomial Fitting
12
Seasonal component
Periodic ups and downs. Period can be a year, several years, months, weeks or days.
13
How to measure/detect seasonality
1. Autocorrelation plots
2. FFT (Fast Fourier Transform)
3. Seasonal decomposition
14
Autocorrelation Plots
● Autocorrelation, often denoted as R(k), measures the correlation between
a time series and its lagged version.
● A significant spike at a particular lag indicates seasonality at that lag.
If there's seasonality present at a particular lag k, the autocorrelation plot will show a significant spike at
that lag. For example, if there's a yearly seasonality in monthly data, we would expect a significant
autocorrelation at lag 12. 15
Decomposing TS into components
Often this is done to help improve understanding of the time series, but it can
also be used to improve forecast accuracy.
16
Decomposing TS into components
17
Original
Seasonal
component
Trend-cycle
component
Residual
component
18
Seasonal and Trend Decomposition using LOESS (STL)
Reference 19
Steps in STL
1. Trend extraction
a. Applies LOESS smoother to time series.
b. Captures the central path or growth trajectory.
c. Not influenced by seasonality.
2. Detrending & Seasonal Extraction
a. Remove trend from original series.
b. Apply LOESS to detrended series to capture seasonality.
c. Seasonal pattern can be of any type: monthly, quarterly, etc.
3. Extract Residual Component
a. Residuals = Original Series - Trend - Seasonality.
b. Captures randomness and potential anomalies.
c. Helpful for model diagnostics.
20
Seasonally adjusted data
21
TS decomposition using statsmodel package
Naive, or classical, decomposition method. Requires you to know if the model
is additive or multiplicative
24
Detecting cycles
1. Spectral analysis
a. Transform time series into frequency domain using Fourier Transform.
b. Identify dominant frequencies (apart from seasonal frequencies).
c. The peaks in spectral density indicate cyclical components.
25
Detecting cycles
2. Wavelet transform
26
TS and regression
● Focused on identifying underlying trends and patterns
● Mathematically modeling / describing those patterns
● Predict/forecast future values
27
TS and regression
● Regression can model the trend over
time
○ Linear regression for mean demand. Function
of economic growth?
○ Within year cycles - seasonal variability?
○ Y = b1x1 + b2x2 + b3x3 + …
● TS allows you to model the process
without knowing the underlying causes
28
29
Signal and Noise
30
Signal and noise
31
Signal and noise: some definitions
● Statistical moments
○ Mean and standard deviation
● Stationary vs non-stationary
○ Trends in mean and/or standard deviation
○ Stationary - doesn’t depend on the time of observation
● Seasonality
○ Periodic patterns
● Autocorrelation
○ Degree to which the time series values in period t are related to time series values in
period t-1, t-2, …
32
Stationarity of a Time Series
33
Stationary TS
TS whose properties do not depend on the time of observation
No seasonality
No trend
34
Stationarity check
1. Visual inspection
2. Seasonal-Trend decomposition
3. Summary statistics at random partitions
4. Statistical tests
35
Stationarity check: Visual inspection
Mean
Variance
Seasonality
36
Stationarity check: Seasonal-Trend decomposition
38
Stationarity check: statistical tests
Augmented Dickey-Fuller test
TS is non-stationary!
39
Adjustments
1. Fixing non-constant variance
2. Trend removal
3. Fixing non-constant variance + trend removal
40
Fixing non-constant variance
41
Fixing non-constant variance
43
Fixing non-constant variance + trend removal
44
Part 2.
Time series forecasting
45
46
Simple forecasting methods
1. Average method
2. Naive method
3. Seasonal naive method
4. Drift method
47
Time series forecasting
Estimating future values of a time series
1. Exponential smoothing
2. Autoregressive methods
a. AR
b. MA
c. ARMA
d. ARIMA
e. SARIMA
48
Exponential smoothing
Simple Exponential Smoothing (SES) (Brown, 1956) is suitable for forecasting
data with no clear trend or seasonal pattern.
49
Exponential smoothing
● Double Exponential Smoothing (Holt, 1957)
○ Smoothing for level + trend
● Triple Exponential Smoothing (Holt-Winter, 1960)
○ Smoothing for trend + seasonality + level
50
Exponential smoothing - Python
51
Exponential smoothing - Python
52
Autoregressive model (AR)
In an autoregression model, we forecast the variable of interest using a linear
combination of past values of the variable. The term autoregression indicates
that it is a regression of the variable against itself.
53
Autoregressive model
54
Autoregressive model - Python
55
Moving average model (MA)
Rather than using past values of the forecast variable in a regression, a moving
average model uses past forecast errors in a regression-like model.
56
Moving average model
57
Moving average model
● Parameter estimation:
○ Maximum likelihood estimation
○ Non-linear least squares estimation
58
Moving average model - python
59
Autoregressive moving average model (ARMA)
● Combines AR and MA models
○ AR accounts for autocorrelations in the TS (deterministic)
○ MA accounts for smoothing periodic fluctuations (stochastic)
61
Seasonal ARIMA (SARIMA)
SARIMA = ARMA + trend removal + seasonal adjustment
62
SARIMA - Python
63
Comparison of model assumptions
64
Model selection (hyperparameter tuning)
● Visualization
○ MA - Autocorrelation function (ACF) plot
○ RA - Partial autocorrelation function (PACF) plot
● Grid search
○ Test for different model orders exhaustively
○ A model that is more accurate and less complex is preferred
■ More accurate - fits the data
■ Less complex - smaller number of parameters
65
Autocorrelation plot (ACF)
66
Partial autocorrelation (PACF) plot
67
References
● Forecasting: Principles & Practice (2nd edition)
68
Time Series Mining &
Forecasting - 2
1
Time Series Part 2 - Lecture Outline
2
Correlation Correlation is a statistical measure that
expresses the extent to which two
variables are linearly related (meaning
they change together at a constant
rate)
3
Autocorrelation
4
Autocorrelation plot (ACF)
5
Partial autocorrelation (PACF) plot
6
7
Simple forecasting methods
1. Average method
2. Naive method
3. Seasonal naive method
4. Drift method
8
Time series forecasting
Estimating future values of a time series
1. Exponential smoothing
2. Autoregressive methods
a. AR
b. MA
c. ARMA
d. ARIMA
e. SARIMA
9
Exponential smoothing
Simple Exponential Smoothing (SES) (Brown, 1956) is suitable for forecasting
data with no clear trend or seasonal pattern.
10
Exponential smoothing
● Double Exponential Smoothing (Holt, 1957)
○ Smoothing for level + trend
● Triple Exponential Smoothing (Holt-Winter, 1960)
○ Smoothing for trend + seasonality + level
11
Exponential smoothing - Python
12
Exponential smoothing - Python
13
Autoregressive model (AR)
In an autoregression model, we forecast the variable of interest using a linear
combination of past values of the variable. The term autoregression indicates
that it is a regression of the variable against itself.
14
Autoregressive model
15
Autoregressive model - Python
16
Moving average model (MA)
Rather than using past values of the forecast variable in a regression, a moving
average model uses past forecast errors in a regression-like model.
17
Moving average model
18
Moving average model
● Parameter estimation:
○ Maximum likelihood estimation
○ Non-linear least squares estimation
19
Moving average model - python
20
Autoregressive moving average model (ARMA)
● Combines AR and MA models
○ AR accounts for autocorrelations in the TS (deterministic)
○ MA accounts for smoothing fluctuations (stochastic)
22
Seasonal ARIMA (SARIMA)
SARIMA = ARMA + trend removal + seasonal adjustment
23
24
SARIMA - Python
25
Comparison of model assumptions
26
Evaluation
● Residual plots:
○ ACF, PACF residual plots, histogram or density plot of residuals
27
Information Criteria
● AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion)
are used to compare and select models with different parameters.
● They take into account both the likelihood of the model and the number of
parameters used.
● The model with the lowest AIC or BIC is usually preferred.
● These criteria are often used to choose the order of ARIMA models (i.e., the
values of p, d, and q).
28
Model selection (hyperparameter tuning)
● Visualization
○ MA - Autocorrelation function (ACF) plot
○ RA - Partial autocorrelation function (PACF) plot
● Grid search
○ Test for different model orders exhaustively
○ A model that is more accurate and less complex is preferred
■ More accurate - fits the data
■ Less complex - smaller number of parameters
Based on the error metrics, residual analysis, and overfitting checks, select the model that
seems most likely to generalize well to new data. This is typically the model with low error on
the validation/test set, whose residuals are closest to white noise, and has lower AIC or BIC
values.
29
Time Series Mining
30
Time series mining tasks
1. Indexing (query by content)
2. Clustering
3. Classification
4. Prediction
5. Summarization
6. Anomaly detection
7. Segmentation
31
Indexing (query by content)
● Whole matching: a query time series is matched against a database of
individual time series to identify the ones similar to the query
● Subsequence Matching: a short query subsequence time series is
matched against longer time series by sliding it along the longer
sequence, looking for the best matching location.
● Brute force technique (linear or sequential matching)
○ Very slow
● Somewhat more efficient method:
○ Store a compressed version of the TS and use it for an initial comparison (lower bound)
using linear scan
● Even more efficient method:
○ Use an index structure that clusters similar sequences together
32
Indexing
Two types
1. Vector-based:
a. Original sequences are compressed using
dimensionality reduction
b. Resulting sequences grouped in the new
dimensions
c. Hierarchical (e.g. R-tree) or non-hierarchical
d. Performance deteriorates when dim > 5
2. Metric-based:
a. Superior in performance
b. Works for higher dimensionalities (dim < 30)
c. Require only distances between objects
d. Clusters objects using the distance between objects
33
Summarization
● Text/graphical descriptions (summaries)
● Anomaly detection and motif discovery are special cases of summarization
● Some of popular approaches for visualizing massive time series datasets
include TimeSearcher, Calendar-Based Visualization, Spiral and VizTree
● Read section 3.5 of this reference
34
Time series compression
TS compression helps reduce storage costs and allows efficient indexing
● Delta encoding
● Delta of delta encoding
● Simple 8b
● Run Length Encoding (RLE)
Reference
35
Similarity measures
Dissimilarity measures
Distance measures
36
Time series similarity measures
Similarity is the inverse of distance
37
Euclidean distance and Lp norms
For two sequences of length n,
consider each of them as a point
in n dimensional Euclidean space
39
Dynamic Time Warping (DTW)
● Useful when the two sequences have a similar shape but don’t line up
perfectly (i.e. out of phase) in the x axis
○ Euclidean distance doesn’t capture the similarity well
● DTW allows acceleration/deceleration of the signal (warping) along the
time (x) axis
● Example application: speech processing
40
Dynamic Time Warping
Basic idea
● Consider X = x1, x2, …, xn , and Y = y1, y2, …, yn
● We are allowed to extend each sequence by repeating
elements
● Euclidean distance now calculated between the extended
sequences X’ and Y’
● Matrix M, where mij = d(xi, yj)
Reference 41
Motifs, anomalies, discords, matrix profiles
42
43
44
Time series representations
45
References and additional reading
● Book chapter on Time Series Mining
● Prophet by Meta
46
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 10 —
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
2
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
3
Clustering for Data Understanding and
Applications
Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric
and ocean
Economic Science: market resarch
4
Clustering as a Preprocessing Tool (Utility)
Summarization:
Preprocessing for regression, PCA, classification, and
association analysis
Compression:
Image processing: vector quantization
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters
Outlier detection
Outliers are often viewed as those “far away” from any
cluster
5
Quality: What Is Good Clustering?
6
Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that
measures the “goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
7
Considerations for Cluster Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one
class)
Similarity measure
Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
8
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
these
Constraint-based clustering
User may give inputs on constraints
Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
High dimensionality
9
Major Clustering Approaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by some
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
Density-based approach:
Based on connectivity and density functions
Grid-based approach:
based on a multiple-level granularity structure
10
Major Clustering Approaches (II)
Model-based:
A model is hypothesized for each of the clusters and tries to find
Frequent pattern-based:
Based on the analysis of frequent patterns
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
11
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
12
Partitioning Algorithms: Basic Concept
E = Σ ik=1Σ p∈Ci ( p − ci ) 2
Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
13
The K-Means Clustering Method
14
An Example of K-Means Clustering
K=2
Dissimilarity calculations
17
What Is the Problem of the K-Means Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
18
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7
Arbitrary Assign
7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
medoids to
2 2 2
1 1
nearest
1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
8
Compute
9
8
Swapping O total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
19
The K-Medoid Clustering Method
PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
20
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
21
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
22
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the single-link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
23
Dendrogram: Shows How Clusters are Merged
24
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
25
Distance between Clusters X X
Σ N Σ N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)
27
Extensions to Hierarchical Clustering
Major weakness of agglomerative clustering methods
29
Clustering Feature Vector in BIRCH
CF = (5, (16,30),(54,190))
SS: square sum of N points
N 2 10
(3,4)
∑ Xi
9
(2,6)
8
i =1
7
4 (4,5)
(4,7)
3
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
30
CF-Tree in BIRCH
Clustering feature:
Summary of the statistics for a given subcluster: the 0-th, 1st,
nodes 31
The CF Tree Structure
Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
32
The Birch Algorithm
Cluster Diameter 1 2
∑ (x − x )
n( n − 1) i j
parents
Algorithm is O(n)
Concerns
Sensitive to insertion order of data points
Since we fix the size of leaf nodes, so clusters may not be so natural
measures
33
CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling (1999)
CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into
a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining
these sub-clusters
34
Overall Framework of CHAMELEON
Construct (K-NN)
Sparse Graph Partition the Graph
Data Set
K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness
35
CHAMELEON (Clustering Complex Objects)
36
Probabilistic Hierarchical Clustering
Algorithmic hierarchical clustering
Nontrivial to choose a good distance measure
Hard to handle missing attribute values
Optimization goal not clear: heuristic, local search
Probabilistic hierarchical clustering
Use probabilistic models to measure distances between clusters
Generative model: Regard the set of data objects to be clustered as
a sample of the underlying data generation mechanism to be
analyzed
Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
In practice, assume the generative models adopt common distributions
functions, e.g., Gaussian distribution or Bernoulli distribution, governed
by parameters
37
Generative Model
Given a set of 1-D points X = {x1, …, xn} for clustering
analysis & assuming they are generated by a
Gaussian distribution:
38
A Probabilistic Hierarchical Clustering Algorithm
For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality
can be measured by,
40
Density-Based Clustering Methods
Handle noise
One scan
41
Density-Based Clustering: Basic Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
p belongs to NEps(q)
core point condition: p MinPts = 5
42
Density-Reachable and Density-Connected
Density-reachable:
A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
Density-connected
A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
43
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
44
DBSCAN: The Algorithm
Arbitrary select a point p
45
DBSCAN: Sensitive to Parameters
46
OPTICS: A Cluster-Ordering Method (1999)
techniques
47
OPTICS: Some Extension from DBSCAN
Index-based:
k = number of dimensions
N = 20
D
p = 75%
M = N(1-p) = 5
Complexity: O(NlogN) p1
Core Distance:
min eps s.t. point is core
o
Reachability Distance p2 o
Max (core-distance (o), d (o, p)) MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) = 4cm ε = 3 cm 48
Reachability
-distance
undefined
ε
ε
ε ‘
Cluster-order
of the objects
49
Density-Based Clustering: OPTICS & Its Applications
50
DENCLUE: Using Statistical Density Functions
2
d ( x , xi ) 2
d ( x,y) −
( x ) = ∑i =1 e
D N
f Gaussian ( x , y ) = e
−
2σ 2 f Gaussian
2σ 2
d ( x , xi ) 2
influence of y −
( x, xi ) = ∑i =1 ( xi − x) ⋅ e
D N
Major features
on x ∇f Gaussian
2σ 2
gradient of x in
Solid mathematical foundation the direction of
xi
Good for data sets with large amounts of noise
Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
Significant faster than existing algorithm (e.g., DBSCAN)
But needs a large number of parameters
51
Denclue: Technical Essence
Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a tree-based
access structure
Influence function: describes the impact of a data point within its
neighborhood
Overall density of the data space can be calculated as the sum of the
influence function of all data points
Clusters can be determined mathematically by identifying density
attractors
Density attractors are local maximal of the overall density function
Center defined clusters: assign to each density attractor the points
density attracted to it
Arbitrary shaped cluster: merge density attractors that are connected
through paths of high density (> threshold)
52
Density Attractor
53
Center-Defined and Arbitrary
54
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
55
Grid-Based Clustering Method
56
STING: A Statistical Information Grid Approach
(i-1)st layer
i-th layer
57
The STING Clustering Method
Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated
from parameters of lower level cell
count, mean, s, min, max
59
CLIQUE (Clustering In QUEst)
Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify the subspaces that contain clusters using the
Apriori principle
Identify clusters
Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of
interests.
Generate minimal description for the clusters
Determine maximal regions that cover a cluster of
connected dense units for each cluster
Determination of minimal cover for each cluster
61
Salary
(10,000)
τ=3
0 1 2 3 4 5 6 7
20
30
40
50
Vacation
60
age
30
Vacation
(week)
50
0 1 2 3 4 5 6 7
20
30
40
age
50
60
age
62
Strength and Weakness of CLIQUE
Strength
automatically finds subspaces of the highest
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
64
Assessing Clustering Tendency
Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
Test spatial randomness by statistic test: Hopkins Static
Given a dataset D regarded as a sample of a random variable o,
Elbow method
Use the turning point in the curve of sum of within cluster variance
E.g., For each point in the test set, find the closest centroid, and
use the sum of squared distance between all points in the test set
and the closest centroids to measure how well the model fits the
test set
For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best
66
Measuring Clustering Quality
67
Measuring Clustering Quality: Extrinsic Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
69
Summary
Cluster analysis groups objects based on their similarity and has
wide applications
Measure of similarity can be computed for various types of data
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
STING and CLIQUE are grid-based methods, where CLIQUE is also
a subspace clustering algorithm
Quality of clustering results can be evaluated in various ways
70
CS512-Spring 2011: An Introduction
Coverage
Cluster Analysis: Chapter 11
Outlier Detection: Chapter 12
Mining Sequence Data: BK2: Chapter 8
Mining Graphs Data: BK2: Chapter 9
Social and Information Network Analysis
BK2: Chapter 9
Partial coverage: Mark Newman: “Networks: An Introduction”, Oxford U., 2010
Scattered coverage: Easley and Kleinberg, “Networks, Crowds, and Markets:
Reasoning About a Highly Connected World”, Cambridge U., 2010
Recent research papers
Mining Data Streams: BK2: Chapter 8
Requirements
One research project
One class presentation (15 minutes)
Two homeworks (no programming assignment)
Two midterm exams (no final exam)
71
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure, SIGMOD’99.
Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
72
References (2)
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March
1999.
A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD’98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall,
1988.
G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75,
1999.
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.
73
References (3)
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering
in Large Databases, ICDT'01.
A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data
sets, SIGMOD’02
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method
for very large databases. SIGMOD'96
X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous
Semantic Links”, VLDB'06
74
Slides unused in class
75
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7
Arbitrary Assign
7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
medoids to
2 2 2
1 1
nearest
1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
8
Compute
9
8
Swapping O total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
76
PAM (Partitioning Around Medoids) (1987)
78
What Is the Problem with PAM?
79
CLARA (Clustering Large Applications)
(1990)
81
ROCK: Clustering Categorical Data
Major ideas
Use links to measure similarity/proximity
Not distance-based
Experiments
Congressional voting, mushroom data
82
Similarity Measure in ROCK
Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
Example: Two groups (clusters) of transactions
C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e},
{a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Jaccard co-efficient may lead to wrong clustering result
C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
Jaccard co-efficient-based similarity function: T1 ∩ T2
Sim( T1 , T2 ) =
T1 ∪ T2
Ex. Let T1 = {a, b, c}, T2 = {c, d, e}
{c} 1
Sim(T 1, T 2) = = = 0.2
{a, b, c, d , e} 5
83
Link Measure in ROCK
Clusters
C1:<a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
C2: <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Neighbors
Two transactions are neighbors if sim(T1,T2) > threshold
Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
T1 connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e},
{a,b,f}, {a,b,g}
T2 connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d}
Link Similarity
Link similarity between two transactions is the # of common neighbors
link(T1, T2) = 4, since they have 4 common neighbors
{a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
link(T1, T3) = 3, since they have 3 common neighbors
{a, b, d}, {a, b, e}, {a, b, g}
84
Aggregation-Based Similarity Computation
0.2
4 5 ST2
0.9 1.0 0.8 0.9 1.0
10 11 12 13 14
a b ST1
For each node nk ∈ {n10, n11, n12} and nl ∈ {n13, n14}, their path-
based similarity simp(nk, nl) = s(nk, n4)·s(n4, n5)·s(n5, nl).
sim(na , nb ) = ⋅ s (n , n ) ⋅
4 5
l =13
= 0.171
3 2
86
Computing Similarity with Aggregation
Average similarity a:(0.9,3) b:(0.95,2)
and total weight 0.2
4 5
aaai04
aaai
Mary aaai05
Issue: Expensive to compute:
Jeh & Widom, KDD’2002: SimRank For a dataset of N objects
Two objects are similar if they are and M links, it takes O(N2)
linked with the same or similar space and O(M2) time to
objects compute all similarities.
89
Observation 1: Hierarchical Structures
All
Articles
grocery electronics apparel
TV DVD camera
Words
90
Observation 2: Distribution of Similarity
0.4
0.3
among DBLP authors
0.2
0.1
0
0.02
0.04
0.06
0.08
0.12
0.14
0.16
0.18
0.22
0.24
0.1
0.2
0
similarity value
Canon A40
digital camera
Digital
Sony V3 digital Cameras
Consumer Apparels
camera
electronics
TVs
92
Similarity Defined by SimTree
Similarity between two
sibling nodes n1 and n2
n1 0.2 n2 n3
0.8 0.9 0.9
Adjustment ratio
for node n7 n4 0.3 n5 n6
0.9 0.8 1.0
n7 n8 n9
Path-based node similarity
simp(n7,n8) = s(n7, n4) x s(n4, n5) x s(n5, n8)
Similarity between two nodes is the average similarity
between objects linked with them in other SimTrees
Average similarity between x and all other nodes
Adjust/ ratio for x =
Average similarity between x’s parent and all other nodes
93
LinkClus: Efficient Clustering via
Heterogeneous Semantic Links
Method
Initialize a SimTree for objects of each type
similar to
For details: X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient
Clustering via Heterogeneous Semantic Links”, VLDB'06
94
Initialization of SimTrees
Initializing a SimTree
Repeatedly find groups of tightly related nodes, which
95
Finding Tight Groups by Freq. Pattern Mining
Finding tight groups Frequent pattern mining
Reduced to
Transactions
n1 1 {n1}
The tightness of a g1 2 {n1, n2}
group of nodes is the n2 3 {n2}
4 {n1, n2}
support of a frequent 5 {n1, n2}
pattern n3 6 {n2, n3, n4}
g2 7 {n4}
n4 8 {n3, n4}
9 {n3, n4}
Procedure of initializing a tree
Start from leaf nodes (level-0)
n1 n2 n3
0.9
n4 n5 n6
0.8
n7 n7 n8 n9
Time Space
Updating similarities O(M(logN)2) O(M+N)
Adjusting tree structures O(N) O(N)
98
Experiment: Email Dataset
F. Nielsen. Email dataset. Approach Accuracy time (s)
www.imm.dtu.dk/~rem/data/Email-1431.zip
370 emails on conferences, 272 on jobs, LinkClus 0.8026 1579.6
and 789 spam emails SimRank 0.7965 39160
Accuracy: measured by manually labeled ReCom 0.5711 74.6
data
F-SimRank 0.3688 479.7
Accuracy of clustering: % of pairs of objects
in the same cluster that share common label CLARANS 0.4768 8.55
Approaches compared:
SimRank (Jeh & Widom, KDD 2002): Computing pair-wise similarities
SimRank with FingerPrints (F-SimRank): Fogaras & R´acz, WWW 2005
pre-computes a large sample of random paths from each object and uses
samples of two objects to estimate SimRank similarity
ReCom (Wang et al. SIGIR 2003)
Iteratively clustering objects using cluster labels of linked objects
99
WaveCluster: Clustering by Wavelet Analysis (1998)
100
The WaveCluster Algorithm
How to apply wavelet transform to find clusters
Summarizes the data by imposing a multidimensional grid
101
Quantization
& Transformation
Quantize data into m-D grid structure,
then wavelet transform
a) scale 1: high resolution
b) scale 2: medium resolution
c) scale 3: low resolution
102
Questlon 15
Consider two posting lists X and V w ith the lengths of x a nd y. If the largest Doc Ids in X and V
Complete
ore nl and n2 respectively. What is the worst case time comple~ity the merge tokes?
Mork 1.00 out
or1.oo
'f' Flog
question Select one:
a. O(nl+n2) operations
b. o(y.n2) operations
c. O(x.nl) operations
d. O(x+y) operations
e. None of the above
Answer. 11
c. Phrase Search
c . Unstructured data
Answer: 0.02
Docum en ts not
Acutally contains Brutus 5 100
Answer: 0.833
skies 271658
1ongerina 46653
trees 316812
Selectane:
a. (marmalade OR skies) AND (kaleidoscope OR eyes) AND (tangerine OR ire es)
The correct answer is: (kaleidoscope OR eyes) AND (tangerine OR trees) AND (marmalade OR skies)
Documents not
Acutally contains Brutus 5 100
Answer: 0.909
Mork 1.00 out Measure the relevance of each document to a user query. Ranking
or too
'f' Flog Group set ot documents based on the content Clustering
question
Given set of topics. decide the most relevant topic. a document
Classification :
belongs to
The correct answer is: Measure the relevance of each document to a user query. ~ Ranking.
Group set of documents based on the content - Clustering. Given set of topics. decide the
most relevant topic, a document belongs to ~ Classification
The correct answers ore: to improve the efficiency of the search engine at query time.. to use
in ranked retrieval systems.
The correct answers are: Find titles according to the relevance with a given query, Find
diagrams relat ed to the given query.
The correct answers are: Collect the documents to be indexed, Tokenize the text. Perform
linguistic preprocessing tokens, Index tne documents that each term occurs in.
The correct answer is: Search and retrieve patient records - Institutional IR system, Search a
user profile in facebook ~ Web/Search and Social Media information retrieval, Search in an
email client program.~ Personal IR system
The correct ans.wer is: New trend is to keep stop words in posting llsts
c. Checking available crawled data from other robots. so you may not need to
implement your own robot
d Keeping your crawler's raw data and sharing t'.he results publicly.
The correct answer is: Implementing a fast- crawling robot to make sure the process finishes
within a reasonable amount of time.
b. Since some of the languages don't use spaces between tokens. it can't be always
guaranteed a unique set of tokens.
The correct answers are: In Unicode, there can be scenarios w l1ere more than one
Unicode(code point) ore used to represent one character, Using Unicode could expose
systems to security vulnerabilities.. Since some of the languages don't use spaces between
tokens, it can't be always guaranteed a unique set o f tokens.
c. Flexible faceting, advanced, and adoptable search behavior that can be customized
based on your needs. Faceting is the arrangement of search results based on real-time
indexing of document fields
d. All of these
c. Cose Folding
d True casing
c . start
d.rows
e. All of these
c. False Negative
d. True Negative
e. False Positive
The correct answers are: Size of the Index table. False Positive
The correct answer is: Simple and efficient woy to do stemming. stripping off affixes.
'f' Flog
question
d. Postings list
'f' Flog
questfon Answer: 21
c. Positional index
d Inverted Index
Select one:
a. best => 0.2, eleven => 0, football => 0.50. need => 0, players=> 0.36
b. best => 0.2, eleven => 0.36, football =>0.50, need => 0.36, players => 0.36
c. best=> 0, eleven =>0.46, football => 0.60, need => 0.46, players => 0.46
Select one:
a . 0.78
b. 0.50
C. 0.48
d . 0.58
Select one:
a. best => 0.52, eleven => 0, football => 0, need => 0.2, players => 0.62
b. best=> 0.52, eleven=> 0, football=> 0.58, need => 0, players=> 0.54
d. best => 0.62, eleven => 0, football => 0.48. need => 0, players => 0.62
Questlon 6 tf- idf weight Isometric derived by taking the log of N divided by the document frequency
Complete where N is the total number of documents in o collection.
Mork LOO out
olt00 Select one:
'f' Flog True
question
Folse
Select one:
o.PoPand WH
b.SaSondWH
c. Cosine similarity
d. Sine similarity
Question 2 Calculat e Jaccard Coefficient between the following query and the document
Complete
Query: idea of march
Mo rk 1.00 out
Document caesar died in march
01 100
"' Rog
q uestion
Select one:
a. 0.154
b. 0.143
C . 0.167
d . 0.2
Answer OA
Quo1tion 9 Whot is the recoil or u,e system ror the given query?
Corrocl
Mcrt l.OOout Pro'llide the answer to one decimal place.
of LOO
,. Rag
""39.tion
Answer- 0.5
Qu..UOn 10 Which o f the following will help to improv e the recall of this query?
COrrocl
Marli:toOout SGlact one:
cJ 100
o. Stemming ~
b. Spell Correction
5. Management Systems
Question 5 In a Boolean retrieval system. stemming never lowers precision. True or False?
Correct
Which of the following ordering, order the da ta structure by their average lookup time in
increasing manner?
Select one:
1. a,b,c
2c,b,a
3.c,a,b "'
Mork 1.00 out In the vector space model. the dimensionality of the query vector is:
ofl.00
'f"Flog Select one:
question
a. the number of documents in the collection
b. the number of terms In the index ~
ofl.00
Term Index without stemn,ing 3rd highest recall •
~ X
'f" Flog
question Biword Index without stemn,ing 2nd highest recall • ~ )(
c. Periodic r<rindexing
d.None
d. Phrase queries
d •NNX
c. Phonetic
d.Synonym
Question 4 What are the filters that can be found in Apache Solr?
Ponlolly
correct Select one or more:
Morie O.67 out
a. Edge N-Gram Filter ..,.
or LOO
d. Synonym Filter
Questlon 2 A data structure that maps terms back to the parts of a document in which they occur called
Correct a/an:
Mo r~l00 out
of 1.00 Select one:
'f'" Flog a. Postings list
question
b. Incidence Matrix
c . Dictionary
d. Inverted Index ~