0% found this document useful (0 votes)
9 views378 pages

Lec Slides Combined Mid Quiz With Old Quizzes

The document outlines the CS3620 Data Mining course, detailing its objectives, syllabus, evaluation methods, and the importance of data mining in various fields. It covers fundamental concepts, techniques, and applications of data mining, including pattern mining, clustering, outlier analysis, and the evolution of database technology. The course aims to equip students with the skills to analyze and implement data mining solutions across diverse domains.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views378 pages

Lec Slides Combined Mid Quiz With Old Quizzes

The document outlines the CS3620 Data Mining course, detailing its objectives, syllabus, evaluation methods, and the importance of data mining in various fields. It covers fundamental concepts, techniques, and applications of data mining, including pattern mining, clustering, outlier analysis, and the evolution of database technology. The course aims to equip students with the skills to analyze and implement data mining solutions across diverse domains.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 378

CS3620: Data Mining

Introduction

Dr. Amal Shehan Perera

Slides based on
Data Mining: Concepts and Techniques, 3 rd ed.
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.

1
Course Outline

Module Code CS 3620 Module Title Data Mining

Credits 3.0 Lectures 2


Pre – Introduction to Data
Hours/ Week
Lab/Assignm requisites Science
GPA/NGPA GPA 1
ents

•ILO1: recall the basic fundamental concepts involved in the process of discovering useful,
possibly unexpected, patterns in large data sets

•ILO2: select appropriate techniques for data mining, considering the given problem

•ILO3: analyse the computational challenges in implementing modern data mining


systems.

•ILO4: evaluate and implement appropriate solution for a given data mining problem from
a wide array of application domains and communicate.
2
Outline Syllabus

T1. Introduction to Data Mining and Applications [ILO1]


T2. Pattern Mining [ILO 1,2,3]
T3. Time Series Mining and Forecasting [ILO 1,2,3]
T4. Clustering and Cluster Evaluation [ILO 1,2,3]
T5. Outlier Analysis [ILO 1,2,3]
T6. Introduction to NoSQL Databases [ILO 1,3]
T7. Data Warehouses and Data Lakes [ILO 3,4]
T8. Datacubes and Online Analytical Processing [ILO 3,4]
T9. Introduction to Information Retrieval [ILO 1,4]
T10. Data Mining and Society [ILO 1,4]
3
Final Evaluation

Category % on Final Computation Description


Grade
Weekly Pop 10 Best 5 out of 10 5 – 10 min online quiz
Quizzes based on the previous
week’s content.
Mini Projects 25 Best 4 out of 5 Project or assignment
/Assignments based on work covered in
the previous 2/3 weeks.
6h student work expected
and to be completed in 2
to 3 weeks.
Mid Sem Qz 15
Final Exam 50

4
Lecture Panel

1. Dr. Shehan Perera


2. Dr. Thanuja Ambegoda
3. Dr. Sapumal Ahangama
4. Dr. Budhika Karunaratne

5
Introduction
◼ Why Data Mining?
◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?


◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining


◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
6
Why Data Mining?

◼ The Explosive Growth of Data: from terabytes to petabytes


◼ Data collection and data availability
◼ Automated data collection tools, database systems, Web,
computerized society
◼ Major sources of abundant data
◼ Business: Web, e-commerce, transactions, stocks, …
◼ Science: Remote sensing, bioinformatics, scientific
simulation, …
◼ Society and everyone: news, digital cameras, YouTube
◼ We are drowning in data, but starving for knowledge!
◼ “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
7
Evolution of Sciences
◼ Before 1600, empirical science
◼ 1600-1950s, theoretical science
◼ Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
◼ 1950s-1990s, computational science
◼ Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
◼ Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models.
◼ 1990-now, data science
◼ The flood of data from new scientific instruments and simulations
◼ The ability to economically store and manage petabytes of data online
◼ The Internet and computing Grid that makes all these archives universally accessible
◼ Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
◼ Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002

8
Evolution of Database Technology
◼ 1960s:
◼ Data collection, database creation, IMS and network DBMS
◼ 1970s:
◼ Relational data model, relational DBMS implementation
◼ 1980s:
◼ RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
◼ Application-oriented DBMS (spatial, scientific, engineering, etc.)
◼ 1990s:
◼ Data mining, data warehousing, multimedia databases, and Web
databases
◼ 2000s
◼ Stream data management and mining
◼ Data mining and its applications
◼ Web technology (XML, data integration) and global information systems
9
Chapter 1. Introduction
◼ Why Data Mining?
◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?


◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining


◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
10
What Is Data Mining?

◼ Data mining (knowledge discovery from data)


◼ Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
◼ Data mining: a misnomer?
◼ Alternative names
◼ Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
◼ Watch out: Is everything “data mining”?
◼ Simple search and query processing
◼ (Deductive) expert systems

11
Knowledge Discovery (KDD) Process
◼ This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
◼ Data mining plays an essential
role in the knowledge discovery
process Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
12
Example: A Web Mining Framework

◼ Web mining usually involves


◼ Data cleaning
◼ Data integration from multiple sources
◼ Warehousing the data
◼ Data cube construction
◼ Data selection for data mining
◼ Data mining
◼ Presentation of the mining results
◼ Patterns and knowledge to be used or stored into
knowledge-base

13
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
14
Example: Mining vs. Data Exploration

◼ Business intelligence view


◼ Warehouse, data cube, reporting but not much mining

◼ Business objects vs. data mining tools

◼ Supply chain example: tools

◼ Data presentation

◼ Exploration

15
KDD Process: A Typical View from ML and
Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………

◼ This is a view from typical machine learning and statistics communities

16
Example: Medical Data Mining

◼ Health care & medical data mining – often


adopted such a view in statistics and machine
learning
◼ Preprocessing of the data (including feature
extraction and dimension reduction)
◼ Classification or/and clustering processes
◼ Post-processing for presentation

17
Chapter 1. Introduction
◼ Why Data Mining?
◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?


◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining


◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
18
Multi-Dimensional View of Data Mining
◼ Data to be mined
◼ Database data (extended-relational, object-oriented, heterogeneous,

legacy), data warehouse, transactional data, stream, spatiotemporal,


time-series, sequence, text and web, multi-media, graphs & social
and information networks
◼ Knowledge to be mined (or: Data mining functions)
◼ Characterization, discrimination, association, classification,

clustering, trend/deviation, outlier analysis, etc.


◼ Descriptive vs. predictive data mining

◼ Multiple/integrated functions and mining at multiple levels

◼ Techniques utilized
◼ Data-intensive, data warehouse (OLAP), machine learning, statistics,

pattern recognition, visualization, high-performance, etc.


◼ Applications adapted
◼ Retail, telecommunication, banking, fraud analysis, bio-data mining,

stock market analysis, text mining, Web mining, etc.


19
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?


◼ What Technology Are Used?
◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
20
Data Mining: On What Kinds of Data?

◼ Database-oriented data sets and applications


◼ Relational database, data warehouse, transactional database
◼ Advanced data sets and advanced applications
◼ Data streams and sensor data
◼ Time-series data, temporal data, sequence data (incl. bio-sequences)
◼ Structure data, graphs, social networks and multi-linked data
◼ Object-relational databases
◼ Heterogeneous databases and legacy databases
◼ Spatial data and spatiotemporal data
◼ Multimedia database
◼ Text databases
◼ The World-Wide Web

21
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?


◼ What Technology Are Used?
◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
22
Data Mining Function: (1) Generalization

◼ Information integration and data warehouse construction


◼ Data cleaning, transformation, integration, and
multidimensional data model
◼ Data cube technology
◼ Scalable methods for computing (i.e., materializing)
multidimensional aggregates
◼ OLAP (online analytical processing)
◼ Multidimensional concept description: Characterization
and discrimination
◼ Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region

23
Data Mining Function: (2) Association and
Correlation Analysis
◼ Frequent patterns (or frequent itemsets)
◼ What items are frequently purchased together in your
Walmart?
◼ Association, correlation vs. causality
◼ A typical association rule
◼ Diaper → Beer [0.5%, 75%] (support, confidence)
◼ Are strongly associated items also strongly correlated?
◼ How to mine such patterns and rules efficiently in large
datasets?
◼ How to use such patterns for classification, clustering,
and other applications?
24
Data Mining Function: (3) Classification

blue ◼ Classification and label prediction


◼ Construct models (functions) based on some training examples
◼ Describe and distinguish classes or concepts for future prediction
◼ E.g., classify countries based on (climate), or classify cars
Yellow based on (gas mileage)
◼ Predict some unknown class labels
◼ Typical methods
◼ Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
◼ Typical applications:
◼ Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …

25
Data Mining Function: (4) Cluster Analysis

◼ Unsupervised learning (i.e., Class label is unknown)

◼ Group data to form new categories (i.e., clusters), e.g.,


cluster houses to find distribution patterns

◼ Principle: Maximizing intra-class similarity & minimizing


interclass similarity

◼ Many methods and applications

26
Data Mining Function: (5) Outlier Analysis

◼ Outlier analysis
◼ Outlier: A data object that does not
comply with the general behavior of the
data
◼ Noise or exception? ― One person’s
garbage could be another person’s
treasure
◼ Methods: by product of clustering or
regression analysis, …
◼ Useful in fraud detection, rare events
analysis

27
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
◼ Sequence, trend and evolution analysis
◼ Trend, time-series, and deviation analysis: e.g.,

regression and value prediction


◼ Sequential pattern mining

◼ e.g., first buy digital camera, then buy large SD

memory cards
◼ Periodicity analysis

◼ Motifs and biological sequence analysis

◼ Approximate and consecutive motifs

◼ Similarity-based analysis

◼ Mining data streams


◼ Ordered, time-varying, potentially infinite, data streams
28
Structure and Network Analysis

◼ Graph mining
◼ Finding frequent subgraphs (e.g., chemical compounds), trees

(XML), substructures (web fragments)


◼ Information network analysis
◼ Social networks: actors (objects, nodes) and relationships (edges)

◼ e.g., author networks in CS, terrorist networks

◼ Multiple heterogeneous networks

◼ A person could be multiple information networks: friends,

family, classmates, …
◼ Links carry a lot of semantic information: Link mining

◼ Web mining
◼ Web is a big information network: from PageRank to Google

◼ Analysis of Web information networks

◼ Web community discovery, opinion mining, usage mining, …

29
Evaluation of Knowledge
◼ Are all mined knowledge interesting?
◼ One can mine tremendous amount of “patterns” and knowledge
◼ Some may fit only certain dimension space (time, location, …)
◼ Some may not be representative, may be transient, …
◼ Evaluation of mined knowledge → directly mine only
interesting knowledge?
◼ Descriptive vs. predictive
◼ Coverage
◼ Typicality vs. novelty
◼ Accuracy
◼ Timeliness
◼ …
30
Chapter 1. Introduction
◼ Why Data Mining?
◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?


◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining


◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
31
Data Mining: Confluence of Multiple Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

32
Why Confluence of Multiple Disciplines?

◼ Tremendous amount of data


◼ Algorithms must be highly scalable to handle such as tera-bytes of
data
◼ High-dimensionality of data
◼ Micro-array may have tens of thousands of dimensions
◼ High complexity of data
◼ Data streams and sensor data
◼ Time-series data, temporal data, sequence data
◼ Structure data, graphs, social networks and multi-linked data
◼ Heterogeneous databases and legacy databases
◼ Spatial, spatiotemporal, multimedia, text and Web data
◼ Software programs, scientific simulations
◼ New and sophisticated applications
33
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?


◼ What Technology Are Used?
◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
34
Applications of Data Mining
◼ Web page analysis: from web page classification, clustering to PageRank &
HITS algorithms
◼ Collaborative analysis & recommender systems
◼ Basket data analysis to targeted marketing
◼ Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
◼ Data mining and software engineering (e.g., IEEE Computer, Aug. 2009
issue)
◼ From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining

35
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?


◼ What Technology Are Used?
◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
36
Major Issues in Data Mining (1)

◼ Mining Methodology
◼ Mining various and new kinds of knowledge
◼ Mining knowledge in multi-dimensional space
◼ Data mining: An interdisciplinary effort
◼ Boosting the power of discovery in a networked environment
◼ Handling noise, uncertainty, and incompleteness of data
◼ Pattern evaluation and pattern- or constraint-guided mining
◼ User Interaction
◼ Interactive mining
◼ Incorporation of background knowledge
◼ Presentation and visualization of data mining results

37
Major Issues in Data Mining (2)

◼ Efficiency and Scalability


◼ Efficiency and scalability of data mining algorithms
◼ Parallel, distributed, stream, and incremental mining methods
◼ Diversity of data types
◼ Handling complex types of data
◼ Mining dynamic, networked, and global data repositories
◼ Data mining and society
◼ Social impacts of data mining
◼ Privacy-preserving data mining
◼ Invisible data mining

38
Chapter 1. Introduction
◼ Why Data Mining?
◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?

◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
39
A Brief History of Data Mining Society

◼ 1989 IJCAI Workshop on Knowledge Discovery in Databases


◼ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
◼ 1991-1994 Workshops on Knowledge Discovery in Databases
◼ Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
◼ 1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
◼ Journal of Data Mining and Knowledge Discovery (1997)
◼ ACM SIGKDD conferences since 1998 and SIGKDD Explorations
◼ More conferences on data mining
◼ PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
◼ ACM Transactions on KDD starting in 2007
40
Conferences and Journals on Data Mining

◼ KDD Conferences ◼ Other related conferences


◼ ACM SIGKDD Int. Conf. on ◼ DB conferences: ACM SIGMOD,
Knowledge Discovery in
VLDB, ICDE, EDBT, ICDT, …
Databases and Data Mining (KDD)
◼ Web and IR conferences: WWW,
◼ SIAM Data Mining Conf. (SDM)
SIGIR, WSDM
◼ (IEEE) Int. Conf. on Data Mining
(ICDM) ◼ ML conferences: ICML, NIPS
◼ European Conf. on Machine ◼ PR conferences: CVPR,
Learning and Principles and ◼ Journals
practices of Knowledge Discovery
◼ Data Mining and Knowledge
and Data Mining (ECML-PKDD)
Discovery (DAMI or DMKD)
◼ Pacific-Asia Conf. on Knowledge
Discovery and Data Mining ◼ IEEE Trans. On Knowledge and
(PAKDD) Data Eng. (TKDE)
◼ Int. Conf. on Web Search and ◼ KDD Explorations
Data Mining (WSDM) ◼ ACM Trans. on KDD

41
Where to Find References? DBLP, CiteSeer, Google

◼ Data mining and KDD (SIGKDD: CDROM)


◼ Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
◼ Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
◼ Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
◼ Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
◼ Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
◼ AI & Machine Learning
◼ Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
◼ Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc.
◼ Web and IR
◼ Conferences: SIGIR, WWW, CIKM, etc.
◼ Journals: WWW: Internet and Web Information Systems,
◼ Statistics
◼ Conferences: Joint Stat. Meeting, etc.
◼ Journals: Annals of statistics, etc.
◼ Visualization
◼ Conference proceedings: CHI, ACM-SIGGraph, etc.
◼ Journals: IEEE Trans. visualization and computer graphics, etc.
42
Chapter 1. Introduction
◼ Why Data Mining?
◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?


◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining


◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
43
Summary

◼ Data mining: Discovering interesting patterns and knowledge from


massive amount of data
◼ A natural evolution of database technology, in great demand, with
wide applications
◼ A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
◼ Mining can be performed in a variety of data
◼ Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
◼ Data mining technologies and applications
◼ Major issues in data mining

44
Recommended Reference Books
◼ S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
◼ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
◼ U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
◼ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
◼ J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011
◼ D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
◼ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd ed., Springer-Verlag, 2009
◼ B. Liu, Web Data Mining, Springer 2006.
◼ T. M. Mitchell, Machine Learning, McGraw Hill, 1997
◼ G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
◼ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
◼ S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
◼ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005

45
Data Mining
— Rule Mining (ARM) [Ch 5] —

from text book PPTs by Prof. Jiawei Han


http://www-faculty.cs.uiuc.edu/~hanj/bk2/slidesindex.html

August 2, 2023 Data Mining: Concepts and Techniques 1


August 2, 2023 Data Mining: Concepts and Techniques 2
August 2, 2023 Data Mining: Concepts and Techniques 3
August 2, 2023 Data Mining: Concepts and Techniques 4
Session Outcomes

◼ describe fundamental concepts involved in the


process of ARM in large data sets
◼ explain the various stages involved in ARM
◼ explain the computational challenges and
appropriate solutions available in
implementing ARM
◼ evaluate, choose and implement appropriate ARM
solution for a given application domains

August 2, 2023 Data Mining: Concepts and Techniques 5


Mining Frequent Patterns, Association and
Correlations

◼ Basic concepts
◼ Efficient and scalable frequent itemset mining
methods
◼ Mining various kinds of association rules
◼ From association mining to correlation
analysis
◼ Constraint-based association mining
◼ Summary

August 2, 2023 Data Mining: Concepts and Techniques 6


What Is Frequent Pattern Analysis?
◼ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
◼ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
◼ Motivation: Finding inherent regularities in data
◼ What products were often purchased together?— Beer and diapers?!
◼ What are the subsequent purchases after buying a PC?
◼ What kinds of DNA are sensitive to this new drug?
◼ Can we automatically classify web documents?
◼ Applications
◼ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
August 2, 2023 Data Mining: Concepts and Techniques 7
Why Is Freq. Pattern Mining Important?

◼ Discloses an intrinsic and important property of data sets


◼ Forms the foundation for many essential data mining tasks
◼ Association, correlation, and causality analysis
◼ Sequential, structural (e.g., sub-graph) patterns
◼ Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
◼ Classification: associative classification
◼ Cluster analysis: frequent pattern-based clustering

August 2, 2023 Data Mining: Concepts and Techniques 8


Basic Concepts: Frequent Patterns and
Association Rules
◼ Itemset X = {x1, …, xk}
Transaction-id Items bought
◼ Find all the rules X → Y with minimum
10 A, B, D
support and confidence
20 A, C, D
◼ support, s, probability that a
30 A, D, E
transaction contains X  Y
40 B, E, F
50 B, C, D, E, F ◼ confidence, c, conditional
probability that a transaction
having X also contains Y
Customer Customer
buys both buys diaper
Sup ( X  Y) / Sup (X)
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
Customer A → D (60%, 100%)
buys beer
D → A (60%, 75%)
August 2, 2023 Data Mining: Concepts and Techniques 9
Scalable Methods for Mining Frequent Patterns

◼ The downward closure property of frequent patterns


◼ Any subset of a frequent itemset must be frequent

◼ If {beer, diaper, nuts} is frequent, so is {beer,

diaper}
◼ i.e., every transaction having {beer, diaper, nuts} also

contains {beer, diaper}

August 2, 2023 Data Mining: Concepts and Techniques 10


Scalable Methods for Mining Frequent Patterns

◼ Scalable mining methods: Three major approaches

1. Apriori (Agrawal & Srikant@VLDB’94)


2. Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
3. Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)

August 2, 2023 Data Mining: Concepts and Techniques 11


Apriori: A Candidate Generation-and-Test Approach

◼ Apriori pruning principle: If there is any itemset which is


infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
◼ Method:
◼ Initially, scan DB once to get frequent 1-itemset
◼ Generate length (k+1) candidate itemsets from length k
frequent itemsets
◼ Test the candidates against DB
◼ Terminate when no frequent or candidate set can be
generated

August 2, 2023 Data Mining: Concepts and Techniques 12


The Apriori Algorithm—An Example
Supmin = 2
Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2

August 2, 2023 Data Mining: Concepts and Techniques 13


Important Details of Apriori
◼ How to generate candidates?
◼ Step 1: self-joining Lk
◼ Step 2: pruning
◼ How to count supports of candidates?
◼ Example of Candidate-generation
◼ L3={abc, abd, acd, ace, bcd}
◼ Self-joining: L3*L3
◼ abcd from abc and abd
◼ acde from acd and ace
◼ Pruning:
◼ acde is removed because ade is not in L3
◼ C4={abcd}

August 2, 2023 Data Mining: Concepts and Techniques 14


Challenges of Frequent Pattern Mining

◼ Challenges
◼ Multiple scans of transaction database
◼ Huge number of candidates
◼ Tedious workload of support counting for candidates
◼ Improving Apriori: general ideas
◼ Reduce passes of transaction database scans
◼ Shrink number of candidates
◼ Facilitate support counting of candidates

August 2, 2023 Data Mining: Concepts and Techniques 15


Partition: Scan Database Only Twice

◼ Any itemset that is potentially frequent in DB must be


frequent in at least one of the partitions of DB
◼ Scan 1: partition database and find local frequent
patterns
◼ Scan 2: consolidate global frequent patterns
◼ A. Savasere, E. Omiecinski, and S. Navathe. An efficient
algorithm for mining association in large databases. In
VLDB’95
August 2, 2023 Data Mining: Concepts and Techniques 16
Sampling for Frequent Patterns

◼ Select a sample of original database, mine frequent


patterns within sample using Apriori
◼ Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
◼ Example: check abcd instead of ab, ac, …, etc.
◼ Scan database again to find missed frequent patterns
◼ H. Toivonen. Sampling large databases for association
rules. In VLDB’96
August 2, 2023 Data Mining: Concepts and Techniques 17
Find All Freq. Itemsets using Apriori

Support = 60 %
Conf. = 80 %

August 2, 2023 Data Mining: Concepts and Techniques


ARM1 18
Mining Frequent Patterns Without
Candidate Generation

◼ Grow long patterns from short ones using local


frequent items

◼ “abc” is a frequent pattern

◼ Get all transactions having “abc”: DB|abc

◼ “d” is a local frequent item in DB|abc → abcd is


a frequent pattern

August 2, 2023 Data Mining: Concepts and Techniques 19


Construct FP-tree from a Transaction Database

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3
order, f-list m 3
a:3 p:1
p 3
3. Scan DB again, m:2 b:1
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
August 2, 2023 Data Mining: Concepts and Techniques 20
FP-tree from a Transaction Database

Transaction ID Items bought


100 f, a, c, d, g, i, m, p
200 a, b, c, f, l, m, o
300 b, f, h, j, o, w
400 b, c, k, s, p
500 a, f, c, e, l, p, m, n
Step 1
Find frequent-1 item sets
Transaction ID Items bought Item Frequency
100 f, a, c, d, g, i, m, p f 4
200 a, b, c, f, l, m, o DB Scan 1 a 3
300 b, f, h, j, o, w c 4
400 b, c, k, s, p m 3
500 a, f, c, e, l, p, m, n p 3
b 3
Step 2
Sort the set of frequency items in order of
descending frequency, and get the ordered f-list
Item Frequency Item Frequency
f 4 f 4
a 3 c 4
c 4 sort a 3
m 3 b 3
p 3 m 3
b 3 p 3

F-list = f, c, a, b, m, p
Step 3
Create the root of the tree, labelled with “null”

null
Step 4
Scan each transactions in the DB → generate FP-Tree
The items in each transaction are processed in f-list order
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

F-list = f, c, a, b, m, p Item Frequency


f 1
c 0
a 0
b 0
m 0
Header Table p 0
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:1


F-list = f, c, a, b, m, p
f 1
c 1
a 0
b 0
m 0
Header Table p 0
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:1


F-list = f, c, a, b, m, p
f 1
c 1 a:1
a 1
b 0
m 0
Header Table p 0
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:1


F-list = f, c, a, b, m, p
f 1
c 1 a:1
a 1
b 0 m:
1
m 1
Header Table p 0
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:1


F-list = f, c, a, b, m, p
f 1
c 1 a:1
a 1
b 0 m:
1
m 1
Header Table p 1 p:1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:2
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:1


F-list = f, c, a, b, m, p
f 2
c 1 a:1
a 1
b 0 m:
1
m 1
Header Table p 1 p:1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:2
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:2


F-list = f, c, a, b, m, p
f 2
c 2 a:1
a 1
b 0 m:
1
m 1
Header Table p 1 p:1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:2
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:2


F-list = f, c, a, b, m, p
f 2
c 2 a:2
a 2
b 0 m:
1
m 1
Header Table p 1 p:1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:2
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:2


F-list = f, c, a, b, m, p
f 2
c 2 a:2
a 2
b 1 m:
b:1
1
m 1
Header Table p 1 p:1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:2
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:2


F-list = f, c, a, b, m, p
f 2
c 2 a:2
a 2
b 1 m:
b:1
1
m 2
Header Table p 1 m:
p:1
1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:3
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:2


F-list = f, c, a, b, m, p
f 3
c 2 a:2
a 2
b 1 m:
b:1
1
m 2
Header Table p 1 m:
p:1
1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:3
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:2 b:1


F-list = f, c, a, b, m, p
f 3
c 2 a:2
a 2
b 2 m:
b:1
1
m 2
Header Table p 1 m:
p:1
1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:3 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:2 b:1


F-list = f, c, a, b, m, p
f 3
c 3 a:2
a 2
b 2 m:
b:1
1
m 2
Header Table p 1 m:
p:1
1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:3 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:2 b:1 b:1


F-list = f, c, a, b, m, p
f 3
c 3 a:2
a 2
b 3 m:
b:1
1
m 2
Header Table p 1 m:
p:1
1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:3 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:2 b:1 b:1


F-list = f, c, a, b, m, p
f 3
c 3 a:2 p:1
a 2
b 3 m:
b:1
1
m 2
Header Table p 2 m:
p:1
1
TID Items bought Frequent items in F-list order
100 f, a, c, d, g, i, m, p f, c, a, m, p null
200 a, b, c, f, l, m, o f, c, a, b, m
300 b, f, h, j, o, w f, b
400 b, c, k, s, p c, b, p
f:4 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:2 b:1 b:1


F-list = f, c, a, b, m, p
f 4
c 3 a:2 p:1
a 2
b 3 m:
b:1
1
m 2
Header Table p 2 m:
p:1
1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:4 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:3 b:1 b:1


F-list = f, c, a, b, m, p
f 4
c 4 a:2 p:1
a 2
b 3 m:
b:1
1
m 2
Header Table p 2 m:
p:1
1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:4 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:3 b:1 b:1


F-list = f, c, a, b, m, p
f 4
c 4 a:3 p:1
a 3
b 3 m:
b:1
1
m 2
Header Table p 2 m:
p:1
1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:4 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:3 b:1 b:1


F-list = f, c, a, b, m, p
f 4
c 4 a:3 p:1
a 3
b 3 m:
b:1
2
m 3
Header Table p 2 m:
p:1
1
TID Items bought Frequent items in F-list order

100 f, a, c, d, g, i, m, p f, c, a, m, p
null
200 a, b, c, f, l, m, o f, c, a, b, m

300 b, f, h, j, o, w f, b

400 b, c, k, s, p c, b, p
f:4 c:1
500 a, f, c, e, l, p, m, n f, c, a, m, p

Item Frequency c:3 b:1 b:1


F-list = f, c, a, b, m, p
f 4
c 4 a:3 p:1
a 3
b 3 m:
b:1
2
m 3
Header Table p 3 m:
p:2
1
Benefits of the FP-tree Structure

◼ Completeness
◼ Preserve complete information for frequent pattern

mining
◼ Never break a long pattern of any transaction

◼ Compactness
◼ Reduce irrelevant info—infrequent items are gone

◼ Items in frequency descending order: the more

frequently occurring, the more likely to be shared


◼ Never be larger than the original database

◼ Compression ratio could be over 100

August 2, 2023 Data Mining: Concepts and Techniques 48


Find Patterns Having P From P-conditional Database

◼ Starting at the frequent item header table in the FP-tree


◼ Traverse the FP-tree by following the link of each frequent item p
◼ Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1

p:2 m:1 p fcam:2, cb:1


August 2, 2023 Data Mining: Concepts and Techniques 49
FP-Growth vs. Apriori: Scalability With the Support
Threshold

100 Data set T25I20D10K


90 D1 FP-grow th runtime
D1 Apriori runtime
80

70
Run time(sec.)

60
50

40

30

20

10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)

August 2, 2023 Data Mining: Concepts and Techniques 50


Why Is FP-Growth the Winner?

◼ Divide-and-conquer:
◼ decompose both the mining task and DB according to
the frequent patterns obtained so far
◼ leads to focused search of smaller databases
◼ Other factors
◼ no candidate generation, no candidate test
◼ compressed database: FP-tree structure
◼ no repeated scan of entire database
◼ basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching

August 2, 2023 Data Mining: Concepts and Techniques 51


Mining by Exploring Vertical Data Format
◼ Vertical format: t(AB) = {T 11, T25, …}
◼ tid-list: list of trans.-ids containing an itemset
Itemset Transaction Id List
Tid Items A {10,30} Tid A
10 A, C, D B {20,30,40} 10 1
20 B, C, E C {10,20,30} 20 0
30 A, B, C, E D {30} 30 1
40 B, E E {20,30,40} 40 0

◼ Deriving closed patterns based on vertical intersections


◼ t(X) = t(Y): X and Y always happen together
◼ t(X)  t(Y): transaction having X always has Y

August 2, 2023 Data Mining: Concepts and Techniques 52


Mining by Exploring Vertical Data Format
◼ Using diffset to accelerate mining
◼ Only keep track of differences of tids
◼ t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
◼ Diffset (XY, X) = {T2}
◼ Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy et
al.@SIGMOD’00), CHARM (Zaki & Hsiao@SDM’02)

Tid Items A B C D E
10 A, C, D 1 0 1 1 0
20 B, C, E 0 1 1 0 1
30 A, B, C, E 1 1 1 0 1
40 B, E 0 1 0 0 1

August 2, 2023 Data Mining: Concepts and Techniques 53


Visualization of Association Rules: Plane Graph

August 2, 2023 Data Mining: Concepts and Techniques 54


Visualization of Association Rules: Rule Graph

August 2, 2023 Data Mining: Concepts and Techniques 55


Visualization of Association Rules
(SGI/MineSet 3.0)

August 2, 2023 Data Mining: Concepts and Techniques 56


Mining Various Kinds of Association Rules

1. Mining multilevel association

2. Miming multidimensional association

3. Mining quantitative association

4. Mining interesting correlation patterns

August 2, 2023 Data Mining: Concepts and Techniques


ARM2 57
Mining Multiple-Level Association Rules

◼ Items often form hierarchies


◼ Flexible support settings
◼ Items at the lower level are expected to have lower
support
◼ Exploration of shared multi-level mining (Agrawal &
Srikant@VLB’95, Han & Fu@VLDB’95)

uniform support reduced support


Level 1 Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%

Level 2 2% Milk Skim Milk Level 2


min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

August 2, 2023 Data Mining: Concepts and Techniques 58


Multi-level Association: Redundancy Filtering

◼ Some rules may be redundant due to “ancestor”


relationships between items.
◼ Example
◼ milk  wheat bread [support = 8%, confidence = 70%]

◼ 2% milk  wheat bread [support = 2%, confidence = 72%]


◼ We say the first rule is an ancestor of the second rule.
◼ A rule is redundant if its support is close to the “expected”
value, based on the rule’s ancestor.

August 2, 2023 Data Mining: Concepts and Techniques 59


Mining Multi-Dimensional Association
◼ Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
◼ Multi-dimensional rules:  2 dimensions or predicates
◼ Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”)
◼ hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
◼ Categorical Attributes: finite number of possible values, no
ordering among values
◼ Quantitative Attributes: numeric, implicit ordering among
values—discretization, clustering, and gradient approaches

August 2, 2023 Data Mining: Concepts and Techniques 60


Mining Quantitative Associations

◼ Techniques can be categorized by how numerical


attributes, such as age or salary are treated
1. Static discretization based on predefined concept
hierarchies
2. Dynamic discretization based on data distribution
(quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)
3. Clustering: Distance-based association (e.g., Yang &
Miller@SIGMOD97)
◼ one dimensional clustering then association

August 2, 2023 Data Mining: Concepts and Techniques 61


Quantitative Association Rules
◼ Proposed by Lent, Swami and Widom ICDE’97
◼ Numeric attributes are dynamically discretized
◼ Such that the confidence or compactness of the rules
mined is maximized
◼ 2-D quantitative association rules: Aquan1  Aquan2  Acat
◼ Cluster adjacent 2-D grid of customers who buy HDTV
association rules
to form general
rules using a 2-D grid
◼ Example
age(X,”34-35”)  income(X,”30-50K”)
 buys(X,”high resolution TV”)

August 2, 2023 Data Mining: Concepts and Techniques 62


Interestingness Measure: Correlations (Lift)
◼ play basketball  eat cereal [40%, 66.7%] is misleading
◼ The overall % of students eating cereal is 75% > 66.7%.
◼ play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
◼ Measure of dependent/correlated events: lift
Basketball Not basketball Sum (row)

P( A B) Cereal 2000 1750 3750


lift = Not cereal 1000 250 1250
P( A) P( B)
Sum(col.) 3000 2000 5000

2000 / 5000 1000 / 5000


lift ( B, C ) = = 0.89 lift ( B, C ) = = 1.33
3000 / 5000 * 3750 / 5000 3000 / 5000 *1250 / 5000

August 2, 2023 Data Mining: Concepts and Techniques 63


Interestingness Measures

August 2, 2023 Data Mining: Concepts and Techniques 64


Constraint-based (Query-Directed) Mining

◼ Finding all the patterns in a database autonomously? —


unrealistic!
◼ The patterns could be too many but not focused!
◼ Data mining should be an interactive process
◼ User directs what to be mined using a data mining
query language (or a graphical user interface)
◼ Constraint-based mining
◼ User flexibility: provides constraints on what to be
mined
◼ System optimization: explores such constraints for
efficient mining—constraint-based mining
August 2, 2023 Data Mining: Concepts and Techniques 65
Constraints in Data Mining

◼ Knowledge type constraint:


◼ classification, association, etc.

◼ Data constraint — using SQL-like queries


◼ find product pairs sold together in stores in Chicago in
Dec.’02
◼ Dimension/level constraint
◼ in relevance to region, price, brand, customer category

◼ Rule (or pattern) constraint


◼ small sales (price < $10) triggers big sales (sum >
$200)
◼ Interestingness constraint
◼ strong rules: min_support  3%, min_confidence 
60%
August 2, 2023 Data Mining: Concepts and Techniques 66
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
August 2, 2023 Data Mining: Concepts and Techniques 67
Naïve Algorithm: Apriori + Constraint
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} < 5
August 2, 2023 Data Mining: Concepts and Techniques 68
Frequent-Pattern Mining: Summary

◼ Frequent pattern mining—an important task in data mining


◼ Scalable frequent pattern mining methods
◼ Apriori (Candidate generation & test)
◼ Projection-based (FPgrowth, ...)
◼ Vertical format approach (VIPER, ...)
▪ Mining a variety of rules and interesting patterns
▪ Constraint-based mining

August 2, 2023 Data Mining: Concepts and Techniques


ARM3 69
Time Series Mining &
Forecasting

Dr. Thanuja Ambegoda

1
Lecture outline
1. Introduction to TS
2. Basic visualization techniques
3. Trend analysis and seasonality detection
4. Cyclic pattern analysis
5. Anomaly detection in TS
6. Stationarity in TS
7. Traditional TS forecasting methods
8. Advanced TS forecasting methods

2
Part 1.
Introduction to time series analysis and
mining

3
Time series everywhere!

4
What is a time series?
● A time series is an ordered
sequence of values of a variable
at equally spaced time intervals
● Typically, the mean and the
standard deviation can vary in a
time series
● Understanding time series data
helps predict future events based
on historical patterns

Trend Volatility

5
TS analysis, forecasting & mining
● Time series analysis
○ Techniques for identifying and analysing temporal dependencies in sequential data
● Time series mining
○ What has happened?
● Time series forecasting
○ What’s going to happen next?

6
Applications
● Finance ● Industry
○ Sales forecasting ○ Usage
○ Inventory analysis ○ Anomaly
○ Stock market analysis ○ Maintenance
○ Price estimation ● Healthcare
● Weather ○ Patient occupancy in
○ Temperature hospitals
○ Climate change ○ Disease outbreaks
○ Seasonal shifts ○ Patient monitoring
○ Rain, wind, snow,... ● Insurance
○ Sales forecasting
○ Insurance benefits awarded 7
Difference between a time series and a random
variable with known mean and standard deviation

We might be able to build a distribution that explains all the data points in the time series. But
does that mean the distribution can be sampled to produce the time series?
8
Dependence on the past
● Memory of a TS
○ A few points having a similar trend
○ But there might be no overall trend
● Actual values tend to depend on the
value at the previous time point
○ Autocorrelation

9
Time Series Components

10
Time series components
Level: The average value in the series.

Trend: The increasing or decreasing value in the series.

Cycle: Phenomena that happen across seasonal periods, without a fixed


period (eg. economic recession)

Seasonality: The repeating short-term cycle in the series.

Noise: The random variation in the series. (unsystematic)

Reference

11
Trend
● Refers to the increasing or decreasing
value in the series.
● Can be linear or nonlinear.
● Methods to detect:
○ Moving Averages
○ Polynomial Fitting

12
Seasonal component

Periodic ups and downs. Period can be a year, several years, months, weeks or days.
13
How to measure/detect seasonality
1. Autocorrelation plots
2. FFT (Fast Fourier Transform)
3. Seasonal decomposition

14
Autocorrelation Plots
● Autocorrelation, often denoted as R(k), measures the correlation between
a time series and its lagged version.
● A significant spike at a particular lag indicates seasonality at that lag.

If there's seasonality present at a particular lag k, the autocorrelation plot will show a significant spike at
that lag. For example, if there's a yearly seasonality in monthly data, we would expect a significant
autocorrelation at lag 12. 15
Decomposing TS into components
Often this is done to help improve understanding of the time series, but it can
also be used to improve forecast accuracy.

● Additive models (linear)


○ y(t) = Level + Trend + Seasonality + Noise
● Multiplicative models (nonlinear)
○ y(t) = Level * Trend * Seasonality * Noise

Real-world problems are messy and noisy.

● There may be additive and multiplicative components.


● There may be an increasing trend followed by a decreasing trend.
● There may be non-repeating cycles mixed in with the repeating seasonality components.

16
Decomposing TS into components

17
Original

Seasonal
component

Trend-cycle
component

Residual
component

18
Seasonal and Trend Decomposition using LOESS (STL)

● LOESS: Locally Weighted Scatterplot


Smoothing
● Alternatively LOWESS, which means
Locally Weighted Regression
● Non-parametric smoothing technique.
● Fits smooth curves using locally weighted
polynomial regression.
● Each point estimated based on subset of its
neighbors.
● Close neighbors have more influence via
weighted least squares.
● Ensures gradual transitions, yielding a
smooth curve.

Reference 19
Steps in STL
1. Trend extraction
a. Applies LOESS smoother to time series.
b. Captures the central path or growth trajectory.
c. Not influenced by seasonality.
2. Detrending & Seasonal Extraction
a. Remove trend from original series.
b. Apply LOESS to detrended series to capture seasonality.
c. Seasonal pattern can be of any type: monthly, quarterly, etc.
3. Extract Residual Component
a. Residuals = Original Series - Trend - Seasonality.
b. Captures randomness and potential anomalies.
c. Helpful for model diagnostics.

20
Seasonally adjusted data

If the variation due to seasonality is


not of primary interest, the seasonally
adjusted series can be useful. For
example, monthly unemployment
data are usually seasonally adjusted
in order to highlight variation due to
the underlying state of the economy
rather than the seasonal variation.

21
TS decomposition using statsmodel package
Naive, or classical, decomposition method. Requires you to know if the model
is additive or multiplicative

Statsmodels python package 22


23
Cycles
● Oscillations that aren't fixed in a seasonal pattern.
● Typically longer duration than seasonality.
● Can be caused by economic conditions, large-scale events, etc.

Why detect cycles?

● Improve forecasting accuracy.Uncover underlying long-term patterns.


● Make informed strategic decisions (e.g., economic, business planning).
● Separate cyclical effects from irregular noise.

24
Detecting cycles
1. Spectral analysis
a. Transform time series into frequency domain using Fourier Transform.
b. Identify dominant frequencies (apart from seasonal frequencies).
c. The peaks in spectral density indicate cyclical components.

25
Detecting cycles
2. Wavelet transform

● Decompose time series into different frequency bands.


● Detect cycles of various lengths.
● Suitable for non-stationary time series

26
TS and regression
● Focused on identifying underlying trends and patterns
● Mathematically modeling / describing those patterns
● Predict/forecast future values

27
TS and regression
● Regression can model the trend over
time
○ Linear regression for mean demand. Function
of economic growth?
○ Within year cycles - seasonal variability?
○ Y = b1x1 + b2x2 + b3x3 + …
● TS allows you to model the process
without knowing the underlying causes

28
29
Signal and Noise

30
Signal and noise

31
Signal and noise: some definitions
● Statistical moments
○ Mean and standard deviation
● Stationary vs non-stationary
○ Trends in mean and/or standard deviation
○ Stationary - doesn’t depend on the time of observation
● Seasonality
○ Periodic patterns
● Autocorrelation
○ Degree to which the time series values in period t are related to time series values in
period t-1, t-2, …

32
Stationarity of a Time Series

33
Stationary TS
TS whose properties do not depend on the time of observation

Constant variance and level

No seasonality

No trend

34
Stationarity check
1. Visual inspection
2. Seasonal-Trend decomposition
3. Summary statistics at random partitions
4. Statistical tests

35
Stationarity check: Visual inspection

Mean
Variance
Seasonality

36
Stationarity check: Seasonal-Trend decomposition

Statsmodels python package 37


Stationarity check: summary statistics
Summary statistics at random partitions

38
Stationarity check: statistical tests
Augmented Dickey-Fuller test

TS is non-stationary!
39
Adjustments
1. Fixing non-constant variance
2. Trend removal
3. Fixing non-constant variance + trend removal

40
Fixing non-constant variance

41
Fixing non-constant variance

To reverse the boxcox transform:


from scipy.special import boxcox, inv_boxcox 42
Trend removal

43
Fixing non-constant variance + trend removal

44
Part 2.
Time series forecasting

45
46
Simple forecasting methods
1. Average method
2. Naive method
3. Seasonal naive method
4. Drift method

Homework: Read about simple forecasting methods

These methods are covered for the sake of completeness of fundamental


theory, and can be used as primary benchmarks.

Refer to the previous lecture for better performing forecasting methods.

47
Time series forecasting
Estimating future values of a time series

1. Exponential smoothing
2. Autoregressive methods
a. AR
b. MA
c. ARMA
d. ARIMA
e. SARIMA

48
Exponential smoothing
Simple Exponential Smoothing (SES) (Brown, 1956) is suitable for forecasting
data with no clear trend or seasonal pattern.

● Prediction is a weighted linear sum of recent past observations


● Weights decline exponentially, making observations much earlier in the
past less relevant

49
Exponential smoothing
● Double Exponential Smoothing (Holt, 1957)
○ Smoothing for level + trend
● Triple Exponential Smoothing (Holt-Winter, 1960)
○ Smoothing for trend + seasonality + level

50
Exponential smoothing - Python

51
Exponential smoothing - Python

52
Autoregressive model (AR)
In an autoregression model, we forecast the variable of interest using a linear
combination of past values of the variable. The term autoregression indicates
that it is a regression of the variable against itself.

εt is the white noise.

This is an autoregressive model of order p - AR(p)

53
Autoregressive model

54
Autoregressive model - Python

55
Moving average model (MA)
Rather than using past values of the forecast variable in a regression, a moving
average model uses past forecast errors in a regression-like model.

εt is the white noise.

This is a moving average model of order q - MA(q)

56
Moving average model

57
Moving average model
● Parameter estimation:
○ Maximum likelihood estimation
○ Non-linear least squares estimation

58
Moving average model - python

59
Autoregressive moving average model (ARMA)
● Combines AR and MA models
○ AR accounts for autocorrelations in the TS (deterministic)
○ MA accounts for smoothing periodic fluctuations (stochastic)

Model fitting: Maximum likelihood estimation (Box-Jenkins method)


60
Autoregressive integrated moving average (ARIMA)
ARIMA = ARMA + trend removal

61
Seasonal ARIMA (SARIMA)
SARIMA = ARMA + trend removal + seasonal adjustment

62
SARIMA - Python

63
Comparison of model assumptions

64
Model selection (hyperparameter tuning)
● Visualization
○ MA - Autocorrelation function (ACF) plot
○ RA - Partial autocorrelation function (PACF) plot
● Grid search
○ Test for different model orders exhaustively
○ A model that is more accurate and less complex is preferred
■ More accurate - fits the data
■ Less complex - smaller number of parameters

65
Autocorrelation plot (ACF)

66
Partial autocorrelation (PACF) plot

67
References
● Forecasting: Principles & Practice (2nd edition)

68
Time Series Mining &
Forecasting - 2

Dr. Thanuja Ambegoda

1
Time Series Part 2 - Lecture Outline

Time series forecasting (ctd)


1. Autocorrelation revisited
2. Forecasting methods
3. Evaluation of time series forecasting

Time series mining


4. Time series mining tasks
5. Time series compression
6. Time series similarity measures
7. Motifs, matrix profile, discord, anomalies
8. Time series representations

2
Correlation Correlation is a statistical measure that
expresses the extent to which two
variables are linearly related (meaning
they change together at a constant
rate)

3
Autocorrelation

4
Autocorrelation plot (ACF)

5
Partial autocorrelation (PACF) plot

6
7
Simple forecasting methods
1. Average method
2. Naive method
3. Seasonal naive method
4. Drift method

Homework: Read about simple forecasting methods

These methods are covered for the sake of completeness of fundamental


theory, and can be used as primary benchmarks.

Refer to the previous lecture for better performing forecasting methods.

8
Time series forecasting
Estimating future values of a time series

1. Exponential smoothing
2. Autoregressive methods
a. AR
b. MA
c. ARMA
d. ARIMA
e. SARIMA

9
Exponential smoothing
Simple Exponential Smoothing (SES) (Brown, 1956) is suitable for forecasting
data with no clear trend or seasonal pattern.

● Prediction is a weighted linear sum of recent past observations


● Weights decline exponentially, making observations much earlier in the
past less relevant

10
Exponential smoothing
● Double Exponential Smoothing (Holt, 1957)
○ Smoothing for level + trend
● Triple Exponential Smoothing (Holt-Winter, 1960)
○ Smoothing for trend + seasonality + level

11
Exponential smoothing - Python

12
Exponential smoothing - Python

13
Autoregressive model (AR)
In an autoregression model, we forecast the variable of interest using a linear
combination of past values of the variable. The term autoregression indicates
that it is a regression of the variable against itself.

εt is the white noise.

This is an autoregressive model of order p - AR(p)

14
Autoregressive model

15
Autoregressive model - Python

16
Moving average model (MA)
Rather than using past values of the forecast variable in a regression, a moving
average model uses past forecast errors in a regression-like model.

εt is the white noise.

This is a moving average model of order q - MA(q)

The MA component models the relationship between an observation and the


white noise errors of previous observations

17
Moving average model

18
Moving average model
● Parameter estimation:
○ Maximum likelihood estimation
○ Non-linear least squares estimation

19
Moving average model - python

20
Autoregressive moving average model (ARMA)
● Combines AR and MA models
○ AR accounts for autocorrelations in the TS (deterministic)
○ MA accounts for smoothing fluctuations (stochastic)

Model fitting: Maximum likelihood estimation (Box-Jenkins method)


21
Autoregressive integrated moving average (ARIMA)
ARIMA = ARMA + trend removal

22
Seasonal ARIMA (SARIMA)
SARIMA = ARMA + trend removal + seasonal adjustment

23
24
SARIMA - Python

25
Comparison of model assumptions

26
Evaluation

● Residual plots:
○ ACF, PACF residual plots, histogram or density plot of residuals

27
Information Criteria
● AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion)
are used to compare and select models with different parameters.
● They take into account both the likelihood of the model and the number of
parameters used.
● The model with the lowest AIC or BIC is usually preferred.
● These criteria are often used to choose the order of ARIMA models (i.e., the
values of p, d, and q).

28
Model selection (hyperparameter tuning)
● Visualization
○ MA - Autocorrelation function (ACF) plot
○ RA - Partial autocorrelation function (PACF) plot
● Grid search
○ Test for different model orders exhaustively
○ A model that is more accurate and less complex is preferred
■ More accurate - fits the data
■ Less complex - smaller number of parameters

Based on the error metrics, residual analysis, and overfitting checks, select the model that
seems most likely to generalize well to new data. This is typically the model with low error on
the validation/test set, whose residuals are closest to white noise, and has lower AIC or BIC
values.

29
Time Series Mining

30
Time series mining tasks
1. Indexing (query by content)
2. Clustering
3. Classification
4. Prediction
5. Summarization
6. Anomaly detection
7. Segmentation

31
Indexing (query by content)
● Whole matching: a query time series is matched against a database of
individual time series to identify the ones similar to the query
● Subsequence Matching: a short query subsequence time series is
matched against longer time series by sliding it along the longer
sequence, looking for the best matching location.
● Brute force technique (linear or sequential matching)
○ Very slow
● Somewhat more efficient method:
○ Store a compressed version of the TS and use it for an initial comparison (lower bound)
using linear scan
● Even more efficient method:
○ Use an index structure that clusters similar sequences together

32
Indexing
Two types

1. Vector-based:
a. Original sequences are compressed using
dimensionality reduction
b. Resulting sequences grouped in the new
dimensions
c. Hierarchical (e.g. R-tree) or non-hierarchical
d. Performance deteriorates when dim > 5
2. Metric-based:
a. Superior in performance
b. Works for higher dimensionalities (dim < 30)
c. Require only distances between objects
d. Clusters objects using the distance between objects

33
Summarization
● Text/graphical descriptions (summaries)
● Anomaly detection and motif discovery are special cases of summarization
● Some of popular approaches for visualizing massive time series datasets
include TimeSearcher, Calendar-Based Visualization, Spiral and VizTree
● Read section 3.5 of this reference

34
Time series compression
TS compression helps reduce storage costs and allows efficient indexing

● Delta encoding
● Delta of delta encoding
● Simple 8b
● Run Length Encoding (RLE)

Reference

35
Similarity measures
Dissimilarity measures
Distance measures

36
Time series similarity measures
Similarity is the inverse of distance

They play an important role in time series mining tasks

1. Euclidean distance and Lp norms


2. Dynamic time warping
3. Longest common subsequence similarity
4. Probabilistic similarity measures
5. General transformations

37
Euclidean distance and Lp norms
For two sequences of length n,
consider each of them as a point
in n dimensional Euclidean space

Dissimilarity between sequences C and Q


D(C,Q) = Lp(C,Q) =
p=2 reduces to Euclidean distance (p=1 is Manhattan distance)
Widely used
Only works when the two sequences are of the same length
38
Euclidean distance between normalized sequences

To normalize C, replace C with c’ where

39
Dynamic Time Warping (DTW)

● Useful when the two sequences have a similar shape but don’t line up
perfectly (i.e. out of phase) in the x axis
○ Euclidean distance doesn’t capture the similarity well
● DTW allows acceleration/deceleration of the signal (warping) along the
time (x) axis
● Example application: speech processing

40
Dynamic Time Warping

Basic idea
● Consider X = x1, x2, …, xn , and Y = y1, y2, …, yn
● We are allowed to extend each sequence by repeating
elements
● Euclidean distance now calculated between the extended
sequences X’ and Y’
● Matrix M, where mij = d(xi, yj)

● Uses bottom-up dynamic programming approach


○ Basic implementation O(mn)
○ With a warping window w, O(nw)
● DTW cost/distance: number of changes (edits) made to C to make it
aligned with Q

Reference 41
Motifs, anomalies, discords, matrix profiles

42
43
44
Time series representations

● Refer section 4 of this reference

45
References and additional reading
● Book chapter on Time Series Mining
● Prophet by Meta

46
Data Mining:
Concepts and Techniques
(3rd ed.)

— Chapter 10 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 10. Cluster Analysis: Basic Concepts and
Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Density-Based Methods

 Grid-Based Methods

 Evaluation of Clustering

 Summary
2
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis (or clustering, data segmentation, …)


 Finding similarities between data according to the

characteristics found in the data and grouping similar


data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution

 As a preprocessing step for other algorithms

3
Clustering for Data Understanding and
Applications
 Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
 Climate: understanding earth climate, find patterns of atmospheric
and ocean
 Economic Science: market resarch
4
Clustering as a Preprocessing Tool (Utility)

 Summarization:
 Preprocessing for regression, PCA, classification, and
association analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters
 Outlier detection
 Outliers are often viewed as those “far away” from any
cluster

5
Quality: What Is Good Clustering?

 A good clustering method will produce high quality


clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns

6
Measure the Quality of Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
 The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
 Weights should be associated with different variables
based on applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that
measures the “goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”
 The answer is typically highly subjective
7
Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one
class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
8
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples

 Ability to deal with different types of attributes


 Numerical, binary, categorical, ordinal, linked, and mixture of

these
 Constraint-based clustering
 User may give inputs on constraints
 Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape

 Ability to deal with noisy data

 Incremental clustering and insensitivity to input order

 High dimensionality

9
Major Clustering Approaches (I)

 Partitioning approach:
 Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors


 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)

using some criterion


 Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:
 Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS, DenClue

 Grid-based approach:
 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

10
Major Clustering Approaches (II)
 Model-based:
 A model is hypothesized for each of the clusters and tries to find

the best fit of that model to each other


 Typical methods: EM, SOM, COBWEB

 Frequent pattern-based:
 Based on the analysis of frequent patterns

 Typical methods: p-Cluster

 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific

constraints
 Typical methods: COD (obstacles), constrained clustering

 Link-based clustering:
 Objects are often linked together in various ways

 Massive links can be used to cluster objects: SimRank, LinkClus

11
Chapter 10. Cluster Analysis: Basic Concepts and
Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Density-Based Methods

 Grid-Based Methods

 Evaluation of Clustering

 Summary
12
Partitioning Algorithms: Basic Concept

 Partitioning method: Partitioning a database D of n objects into a set of


k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)

E = Σ ik=1Σ p∈Ci ( p − ci ) 2
 Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
13
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in four


steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when the assignment does
not change

14
An Example of K-Means Clustering

K=2

Arbitrarily Update the


partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects


needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update the
cluster
point) for each partition
centroids
 Assign each object to the
cluster of its nearest centroid
 Until no change
15
Comments on the K-Means Method

 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is


# iterations. Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-dimensional space
 Using the k-modes method for categorical data
 In comparison, k-medoids can be applied to a wide range of
data
 Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al., 2009)
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
16
Variations of the K-Means Method

 Most of the variants of the k-means which differ in

 Selection of the initial k means

 Dissimilarity calculations

 Strategies to calculate cluster means

 Handling categorical data: k-modes

 Replacing means of clusters with modes

 Using new dissimilarity measures to deal with categorical objects

 Using a frequency-based method to update modes of clusters

 A mixture of categorical and numerical data: k-prototype method

17
What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially


distort the distribution of the data

 K-Medoids: Instead of taking the mean value of the object in a cluster


as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

18
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7
Arbitrary Assign
7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

medoids to
2 2 2

1 1
nearest
1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping O total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

19
The K-Medoid Clustering Method

 K-Medoids Clustering: Find representative objects (medoids) in clusters

 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

 Starts from an initial set of medoids and iteratively replaces one


of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering

 PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)

 Efficiency improvement on PAM

 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

 CLARANS (Ng & Han, 1994): Randomized re-sampling

20
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

21
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
22
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

23
Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting


the dendrogram at the desired level, then each
connected component forms a cluster

24
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

25
Distance between Clusters X X

 Single link: smallest distance between an element in one cluster


and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

 Complete link: largest distance between an element in one cluster


and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

 Average: avg distance between an element in one cluster and an


element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e.,


dist(Ki, Kj) = dist(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dist(Ki,


Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster
26
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster ΣiN= 1(t )
Cm = N
ip

 Radius: square root of average distance from any point


of the cluster to its centroid Σ N (t − cm ) 2
Rm = i =1 ip
N
 Diameter: square root of average mean squared
distance between all pairs of points in the cluster

Σ N Σ N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)

27
Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering methods

 Can never undo what was done previously

 Do not scale well: time complexity of at least O(n2),


where n is the number of total objects

 Integration of hierarchical & distance-based clustering

 BIRCH (1996): uses CF-tree and incrementally adjusts


the quality of sub-clusters

 CHAMELEON (1999): hierarchical clustering using


dynamic modeling
28
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the
data record

29
Clustering Feature Vector in BIRCH

Clustering Feature (CF): CF = (N, LS, SS)


N: Number of data points
N
LS: linear sum of N points: ∑ X i
i =1

CF = (5, (16,30),(54,190))
SS: square sum of N points
N 2 10
(3,4)
∑ Xi
9

(2,6)
8

i =1
7

4 (4,5)
(4,7)
3

(3,8)
0
0 1 2 3 4 5 6 7 8 9 10

30
CF-Tree in BIRCH
 Clustering feature:
 Summary of the statistics for a given subcluster: the 0-th, 1st,

and 2nd moments of the subcluster from the statistical point


of view
 Registers crucial measurements for computing cluster and

utilizes storage efficiently


A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”

 The nonleaf nodes store sums of the CFs of their children

 A CF tree has two parameters


 Branching factor: max # of children

 Threshold: max diameter of sub-clusters stored at the leaf

nodes 31
The CF Tree Structure
Root

B=7 CF1 CF2 CF3 CF6


child1 child2 child3 child6
L=6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

32
The Birch Algorithm
 Cluster Diameter 1 2
∑ (x − x )
n( n − 1) i j

 For each point in the input


 Find closest leaf entry

 Add point to leaf entry and update CF

 If entry diameter > max_diameter, then split leaf, and possibly

parents
 Algorithm is O(n)
 Concerns
 Sensitive to insertion order of data points

 Since we fix the size of leaf nodes, so clusters may not be so natural

 Clusters tend to be spherical given the radius and diameter

measures
33
CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling (1999)
 CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
 Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into
a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining
these sub-clusters
34
Overall Framework of CHAMELEON

Construct (K-NN)
Sparse Graph Partition the Graph

Data Set

K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness
35
CHAMELEON (Clustering Complex Objects)

36
Probabilistic Hierarchical Clustering
 Algorithmic hierarchical clustering
 Nontrivial to choose a good distance measure
 Hard to handle missing attribute values
 Optimization goal not clear: heuristic, local search
 Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between clusters
 Generative model: Regard the set of data objects to be clustered as
a sample of the underlying data generation mechanism to be
analyzed
 Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
 In practice, assume the generative models adopt common distributions
functions, e.g., Gaussian distribution or Bernoulli distribution, governed
by parameters
37
Generative Model
 Given a set of 1-D points X = {x1, …, xn} for clustering
analysis & assuming they are generated by a
Gaussian distribution:

 The probability that a point xi ∈ X is generated by the


model

 The likelihood that X is generated by the model:

 The task of learning the generative model: find the


parameters μ and σ2 such that the maximum likelihood

38
A Probabilistic Hierarchical Clustering Algorithm

 For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality
can be measured by,

where P() is the maximum likelihood


 Distance between clusters C1 and C2:
 Algorithm: Progressively merge points and clusters
Input: D = {o1, ..., on}: a data set containing n objects
Output: A hierarchy of clusters
Method
Create a cluster for each object Ci = {oi}, 1 ≤ i ≤ n;
For i = 1 to n {
Find pair of clusters Ci and Cj such that
Ci,Cj = argmaxi ≠ j {log (P(Ci∪Cj )/(P(Ci)P(Cj ))};
If log (P(Ci∪Cj )/(P(Ci)P(Cj )) > 0 then merge Ci and Cj }
39
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

40
Density-Based Clustering Methods

 Clustering based on density (local cluster criterion), such


as density-connected points
 Major features:
 Discover clusters of arbitrary shape

 Handle noise

 One scan

 Need density parameters as termination condition

 Several interesting studies:


 DBSCAN: Ester, et al. (KDD’96)

 OPTICS: Ankerst, et al (SIGMOD’99).

 DENCLUE: Hinneburg & D. Keim (KDD’98)

 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

41
Density-Based Clustering: Basic Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
 p belongs to NEps(q)
 core point condition: p MinPts = 5

|NEps (q)| ≥ MinPts Eps = 1 cm


q

42
Density-Reachable and Density-Connected

 Density-reachable:
 A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
 Density-connected
 A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
43
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
 Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

44
DBSCAN: The Algorithm
 Arbitrary select a point p

 Retrieve all points density-reachable from p w.r.t. Eps


and MinPts

 If p is a core point, a cluster is formed

 If p is a border point, no points are density-reachable


from p and DBSCAN visits the next point of the database

 Continue the process until all of the points have been


processed

45
DBSCAN: Sensitive to Parameters

46
OPTICS: A Cluster-Ordering Method (1999)

 OPTICS: Ordering Points To Identify the Clustering


Structure
 Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)

 Produces a special order of the database wrt its

density-based clustering structure


 This cluster-ordering contains info equiv to the density-

based clusterings corresponding to a broad range of


parameter settings
 Good for both automatic and interactive cluster analysis,

including finding intrinsic clustering structure


 Can be represented graphically or using visualization

techniques
47
OPTICS: Some Extension from DBSCAN
 Index-based:
 k = number of dimensions
 N = 20
D
 p = 75%
M = N(1-p) = 5
 Complexity: O(NlogN) p1
 Core Distance:
 min eps s.t. point is core
o
 Reachability Distance p2 o
Max (core-distance (o), d (o, p)) MinPts = 5
r(p1, o) = 2.8cm. r(p2,o) = 4cm ε = 3 cm 48
Reachability
-distance

undefined
ε
ε
ε ‘

Cluster-order
of the objects
49
Density-Based Clustering: OPTICS & Its Applications

50
DENCLUE: Using Statistical Density Functions

 DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)


total influence
 Using statistical density functions: on x

2
d ( x , xi ) 2
d ( x,y) −
( x ) = ∑i =1 e
D N
f Gaussian ( x , y ) = e

2σ 2 f Gaussian
2σ 2

d ( x , xi ) 2
influence of y −
( x, xi ) = ∑i =1 ( xi − x) ⋅ e
D N
 Major features
on x ∇f Gaussian
2σ 2

gradient of x in
 Solid mathematical foundation the direction of
xi
 Good for data sets with large amounts of noise
 Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets
 Significant faster than existing algorithm (e.g., DBSCAN)
 But needs a large number of parameters
51
Denclue: Technical Essence
 Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a tree-based
access structure
 Influence function: describes the impact of a data point within its
neighborhood
 Overall density of the data space can be calculated as the sum of the
influence function of all data points
 Clusters can be determined mathematically by identifying density
attractors
 Density attractors are local maximal of the overall density function
 Center defined clusters: assign to each density attractor the points
density attracted to it
 Arbitrary shaped cluster: merge density attractors that are connected
through paths of high density (> threshold)

52
Density Attractor

53
Center-Defined and Arbitrary

54
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary

55
Grid-Based Clustering Method

 Using multi-resolution grid data structure


 Several interesting methods
 STING (a STatistical INformation Grid approach) by

Wang, Yang and Muntz (1997)


 WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)
 A multi-resolution clustering approach using
wavelet method
 CLIQUE: Agrawal, et al. (SIGMOD’98)
 Both grid-based and subspace clustering

56
STING: A Statistical Information Grid Approach

 Wang, Yang and Muntz (VLDB’97)


 The spatial area is divided into rectangular cells
 There are several levels of cells corresponding to different
levels of resolution
1st layer

(i-1)st layer

i-th layer

57
The STING Clustering Method
 Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
 Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
 Parameters of higher level cells can be easily calculated
from parameters of lower level cell
 count, mean, s, min, max

 type of distribution—normal, uniform, etc.

 Use a top-down approach to answer spatial data queries


 Start from a pre-selected layer—typically with a small
number of cells
 For each cell in the current level compute the confidence
interval
58
STING Algorithm and Its Analysis

 Remove the irrelevant cells from further consideration


 When finish examining the current layer, proceed to the
next lower level
 Repeat this process until the bottom layer is reached
 Advantages:
 Query-independent, easy to parallelize, incremental
update
 O(K), where K is the number of grid cells at the lowest
level
 Disadvantages:
 All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected

59
CLIQUE (Clustering In QUEst)

 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)


 Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-based
 It partitions each dimension into the same number of equal length
interval
 It partitions an m-dimensional data space into non-overlapping
rectangular units
 A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a
subspace
60
CLIQUE: The Major Steps

 Partition the data space and find the number of points that
lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the
Apriori principle
 Identify clusters
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of
interests.
 Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of
connected dense units for each cluster
 Determination of minimal cover for each cluster

61
Salary
(10,000)

τ=3
0 1 2 3 4 5 6 7

20
30
40
50
Vacation
60
age

30
Vacation
(week)
50
0 1 2 3 4 5 6 7
20
30
40

age
50
60
age

62
Strength and Weakness of CLIQUE

 Strength
 automatically finds subspaces of the highest

dimensionality such that high density clusters exist in


those subspaces
 insensitive to the order of records in input and does not

presume some canonical data distribution


 scales linearly with the size of input and has good

scalability as the number of dimensions in the data


increases
 Weakness
 The accuracy of the clustering result may be degraded

at the expense of simplicity of the method


63
Chapter 10. Cluster Analysis: Basic Concepts and
Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Density-Based Methods

 Grid-Based Methods

 Evaluation of Clustering

 Summary
64
Assessing Clustering Tendency
 Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
 Test spatial randomness by statistic test: Hopkins Static
 Given a dataset D regarded as a sample of a random variable o,

determine how far away o is from being uniformly distributed in


the data space
 Sample n points, p1, …, pn, uniformly from D. For each pi, find its

nearest neighbor in D: xi = min{dist (pi, v)} where v in D


 Sample n points, q1, …, qn, uniformly from D. For each qi, find its

nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and


v ≠ qi
 Calculate the Hopkins Statistic:

 If D is uniformly distributed, ∑ xi and ∑ yi will be close to each


other and H is close to 0.5. If D is highly skewed, H is close to 0
65
Determine the Number of Clusters
 Empirical method
 # of clusters ≈√n/2 for a dataset of n points

 Elbow method
 Use the turning point in the curve of sum of within cluster variance

w.r.t the # of clusters


 Cross validation method
 Divide a given data set into m parts

 Use m – 1 parts to obtain a clustering model

 Use the remaining part to test the quality of the clustering

 E.g., For each point in the test set, find the closest centroid, and

use the sum of squared distance between all points in the test set
and the closest centroids to measure how well the model fits the
test set
 For any k > 0, repeat it m times, compare the overall quality measure

w.r.t. different k’s, and find # of clusters that fits the data the best
66
Measuring Clustering Quality

 Two methods: extrinsic vs. intrinsic


 Extrinsic: supervised, i.e., the ground truth is available
 Compare a clustering against the ground truth using
certain clustering quality measure
 Ex. BCubed precision and recall metrics
 Intrinsic: unsupervised, i.e., the ground truth is unavailable
 Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how compact
the clusters are
 Ex. Silhouette coefficient

67
Measuring Clustering Quality: Extrinsic Methods

 Clustering quality measure: Q(C, Cg), for a clustering C


given the ground truth Cg.
 Q is good if it satisfies the following 4 essential criteria
 Cluster homogeneity: the purer, the better

 Cluster completeness: should assign objects belong to

the same category in the ground truth to the same


cluster
 Rag bag: putting a heterogeneous object into a pure

cluster should be penalized more than putting it into a


rag bag (i.e., “miscellaneous” or “other” category)
 Small cluster preservation: splitting a small category

into pieces is more harmful than splitting a large


category into pieces
68
Chapter 10. Cluster Analysis: Basic Concepts and
Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods

 Hierarchical Methods

 Density-Based Methods

 Grid-Based Methods

 Evaluation of Clustering

 Summary
69
Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
 DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
 STING and CLIQUE are grid-based methods, where CLIQUE is also
a subspace clustering algorithm
 Quality of clustering results can be evaluated in various ways
70
CS512-Spring 2011: An Introduction
 Coverage
 Cluster Analysis: Chapter 11
 Outlier Detection: Chapter 12
 Mining Sequence Data: BK2: Chapter 8
 Mining Graphs Data: BK2: Chapter 9
 Social and Information Network Analysis
 BK2: Chapter 9
 Partial coverage: Mark Newman: “Networks: An Introduction”, Oxford U., 2010
 Scattered coverage: Easley and Kleinberg, “Networks, Crowds, and Markets:
Reasoning About a Highly Connected World”, Cambridge U., 2010
 Recent research papers
 Mining Data Streams: BK2: Chapter 8
 Requirements
 One research project
 One class presentation (15 minutes)
 Two homeworks (no programming assignment)
 Two midterm exams (no final exam)
71
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure, SIGMOD’99.
 Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
 M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
 V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
72
References (2)
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
 S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March
1999.
 A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD’98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall,
1988.
 G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75,
1999.
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.

73
References (3)
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
 L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
 A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering
in Large Databases, ICDT'01.
 A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data
sets, SIGMOD’02
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method
for very large databases. SIGMOD'96
 X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous
Semantic Links”, VLDB'06

74
Slides unused in class

75
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7
Arbitrary Assign
7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

medoids to
2 2 2

1 1
nearest
1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping O total cost of
Until no
7 7

and Oramdom 6
swapping 6

change
5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

76
PAM (Partitioning Around Medoids) (1987)

 PAM (Kaufman and Rousseeuw, 1987), built in Splus


 Use real object to represent the cluster
 Select k representative objects arbitrarily
 For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
 For each pair of i and h,
 If TCih < 0, i is replaced by h
 Then assign each non-selected object to the most
similar representative object
 repeat steps 2-3 until there is no change
77
PAM Clustering: Finding the Best Cluster Center

 Case 1: p currently belongs to oj. If oj is replaced by orandom as a


representative object and p is the closest to one of the other
representative object oi, then p is reassigned to oi

78
What Is the Problem with PAM?

 Pam is more robust than k-means in the presence of


noise and outliers because a medoid is less influenced
by outliers or other extreme values than a mean
 Pam works efficiently for small data sets but does not
scale well for large data sets.
 O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling-based method
CLARA(Clustering LARge Applications)

79
CLARA (Clustering Large Applications)
(1990)

 CLARA (Kaufmann and Rousseeuw in 1990)


 Built in statistical analysis packages, such as SPlus
 It draws multiple samples of the data set, applies
PAM on each sample, and gives the best clustering
as the output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not
necessarily represent a good clustering of the whole
data set if the sample is biased
80
CLARANS (“Randomized” CLARA) (1994)
 CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
 Draws sample of neighbors dynamically

 The clustering process can be presented as searching a

graph where every node is a potential solution, that is, a


set of k medoids
 If the local optimum is found, it starts with new randomly

selected node in search for a new local optimum


 Advantages: More efficient and scalable than both PAM
and CLARA
 Further improvement: Focusing techniques and spatial
access structures (Ester et al.’95)

81
ROCK: Clustering Categorical Data

 ROCK: RObust Clustering using linKs


 S. Guha, R. Rastogi & K. Shim, ICDE’99

 Major ideas
 Use links to measure similarity/proximity

 Not distance-based

 Algorithm: sampling-based clustering


 Draw random sample

 Cluster with links

 Label data in disk

 Experiments
 Congressional voting, mushroom data

82
Similarity Measure in ROCK
 Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
 Example: Two groups (clusters) of transactions
 C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e},
{a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
 C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Jaccard co-efficient may lead to wrong clustering result
 C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
 C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
 Jaccard co-efficient-based similarity function: T1 ∩ T2
Sim( T1 , T2 ) =
T1 ∪ T2
 Ex. Let T1 = {a, b, c}, T2 = {c, d, e}
{c} 1
Sim(T 1, T 2) = = = 0.2
{a, b, c, d , e} 5

83
Link Measure in ROCK
 Clusters
 C1:<a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
 C2: <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
 Neighbors
 Two transactions are neighbors if sim(T1,T2) > threshold
 Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
 T1 connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e},
{a,b,f}, {a,b,g}
 T2 connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d}

 T3 connected to: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,g}, {a,f,g}, {b,f,g}

 Link Similarity
 Link similarity between two transactions is the # of common neighbors
 link(T1, T2) = 4, since they have 4 common neighbors
 {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e}
 link(T1, T3) = 3, since they have 3 common neighbors
 {a, b, d}, {a, b, e}, {a, b, g}

84
Aggregation-Based Similarity Computation
0.2
4 5 ST2
0.9 1.0 0.8 0.9 1.0
10 11 12 13 14

a b ST1
For each node nk ∈ {n10, n11, n12} and nl ∈ {n13, n14}, their path-
based similarity simp(nk, nl) = s(nk, n4)·s(n4, n5)·s(n5, nl).

∑k =10 s(nk , n4 ) ∑ s (nl , n5 )


12 14

sim(na , nb ) = ⋅ s (n , n ) ⋅
4 5
l =13
= 0.171
3 2

takes O(3+2) time


After aggregation, we reduce quadratic time computation to linear
time computation.

86
Computing Similarity with Aggregation
Average similarity a:(0.9,3) b:(0.95,2)
and total weight 0.2
4 5

sim(na, nb) can be computed 10 11 12 13 14


from aggregated similarities
a b

sim(na, nb) = avg_sim(na,n4) x s(n4, n5) x avg_sim(nb,n5)


= 0.9 x 0.2 x 0.95 = 0.171
To compute sim(na,nb):
 Find all pairs of sibling nodes ni and nj, so that na linked with ni and nb
with nj.
 Calculate similarity (and weight) between na and nb w.r.t. ni and nj.
 Calculate weighted average similarity between na and nb w.r.t. all such
pairs.
87
Chapter 10. Cluster Analysis: Basic Concepts and
Methods

 Cluster Analysis: Basic Concepts


 Overview of Clustering Methods
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Summary
88
Link-Based Clustering: Calculate Similarities
Based On Links
Authors Proceedings Conferences
 The similarity between two
Tom sigmod03
objects x and y is defined as
sigmod04 sigmod the average similarity between
Mike sigmod05 objects linked with x and those
vldb03 with y:
Cathy
vldb04 vldb C
I ( a ) I (b )

John vldb05 sim(a, b ) =


I (a ) I (b )
∑ ∑ sim(I (a ), I (b ))
i =1 j =1
i j

aaai04
aaai
Mary aaai05
 Issue: Expensive to compute:
Jeh & Widom, KDD’2002: SimRank  For a dataset of N objects

Two objects are similar if they are and M links, it takes O(N2)
linked with the same or similar space and O(M2) time to
objects compute all similarities.

89
Observation 1: Hierarchical Structures

 Hierarchical structures often exist naturally among objects


(e.g., taxonomy of animals)
Relationships between articles and
A hierarchical structure of words (Chakrabarti, Papadimitriou,
products in Walmart Modha, Faloutsos, 2004)

All

Articles
grocery electronics apparel

TV DVD camera

Words
90
Observation 2: Distribution of Similarity
0.4

Distribution of SimRank similarities


portion of entries

0.3
among DBLP authors
0.2

0.1

0
0.02

0.04

0.06

0.08

0.12

0.14

0.16

0.18

0.22

0.24
0.1

0.2
0

similarity value

 Power law distribution exists in similarities


 56% of similarity entries are in [0.005, 0.015]

 1.4% of similarity entries are larger than 0.1

 Can we design a data structure that stores the significant

similarities and compresses insignificant ones?


91
A Novel Data Structure: SimTree
Each leaf node Each non-leaf node
represents an object represents a group
of similar lower-level
nodes
Similarities between
siblings are stored

Canon A40
digital camera
Digital
Sony V3 digital Cameras
Consumer Apparels
camera
electronics
TVs

92
Similarity Defined by SimTree
Similarity between two
sibling nodes n1 and n2

n1 0.2 n2 n3
0.8 0.9 0.9
Adjustment ratio
for node n7 n4 0.3 n5 n6
0.9 0.8 1.0

n7 n8 n9
 Path-based node similarity
 simp(n7,n8) = s(n7, n4) x s(n4, n5) x s(n5, n8)
 Similarity between two nodes is the average similarity
between objects linked with them in other SimTrees
Average similarity between x and all other nodes
 Adjust/ ratio for x =
Average similarity between x’s parent and all other nodes

93
LinkClus: Efficient Clustering via
Heterogeneous Semantic Links
Method
 Initialize a SimTree for objects of each type

 Repeat until stable

 For each SimTree, update the similarities between its

nodes using similarities in other SimTrees


 Similarity between two nodes x and y is the average

similarity between objects linked with them


 Adjust the structure of each SimTree

 Assign each node to the parent node that it is most

similar to
For details: X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient
Clustering via Heterogeneous Semantic Links”, VLDB'06
94
Initialization of SimTrees
 Initializing a SimTree
 Repeatedly find groups of tightly related nodes, which

are merged into a higher-level node


 Tightness of a group of nodes
 For a group of nodes {n1, …, nk}, its tightness is

defined as the number of leaf nodes in other SimTrees


that are connected to all of {n1, …, nk}
Nodes Leaf nodes in
another SimTree
1
n1
2
3 The tightness of {n1, n2} is 3
n2 4
5

95
Finding Tight Groups by Freq. Pattern Mining
 Finding tight groups Frequent pattern mining
Reduced to
Transactions
n1 1 {n1}
The tightness of a g1 2 {n1, n2}
group of nodes is the n2 3 {n2}
4 {n1, n2}
support of a frequent 5 {n1, n2}
pattern n3 6 {n2, n3, n4}
g2 7 {n4}
n4 8 {n3, n4}
9 {n3, n4}
 Procedure of initializing a tree
 Start from leaf nodes (level-0)

 At each level l, find non-overlapping groups of similar


nodes with frequent pattern mining
96
Adjusting SimTree Structures

n1 n2 n3

0.9
n4 n5 n6
0.8

n7 n7 n8 n9

 After similarity changes, the tree structure also needs to be


changed
 If a node is more similar to its parent’s sibling, then move

it to be a child of that sibling


 Try to move each node to its parent’s sibling that it is

most similar to, under the constraint that each parent


node can have at most c children
97
Complexity

For two types of objects, N in each, and M linkages between them.

Time Space
Updating similarities O(M(logN)2) O(M+N)
Adjusting tree structures O(N) O(N)

LinkClus O(M(logN)2) O(M+N)


SimRank O(M2) O(N2)

98
Experiment: Email Dataset
 F. Nielsen. Email dataset. Approach Accuracy time (s)
www.imm.dtu.dk/~rem/data/Email-1431.zip
 370 emails on conferences, 272 on jobs, LinkClus 0.8026 1579.6
and 789 spam emails SimRank 0.7965 39160
 Accuracy: measured by manually labeled ReCom 0.5711 74.6
data
F-SimRank 0.3688 479.7
 Accuracy of clustering: % of pairs of objects
in the same cluster that share common label CLARANS 0.4768 8.55

 Approaches compared:
 SimRank (Jeh & Widom, KDD 2002): Computing pair-wise similarities
 SimRank with FingerPrints (F-SimRank): Fogaras & R´acz, WWW 2005
 pre-computes a large sample of random paths from each object and uses
samples of two objects to estimate SimRank similarity
 ReCom (Wang et al. SIGIR 2003)
 Iteratively clustering objects using cluster labels of linked objects

99
WaveCluster: Clustering by Wavelet Analysis (1998)

 Sheikholeslami, Chatterjee, and Zhang (VLDB’98)


 A multi-resolution clustering approach which applies wavelet transform
to the feature space; both grid-based and density-based
 Wavelet transform: A signal processing technique that decomposes a
signal into different frequency sub-band
 Data are transformed to preserve relative distance between objects

at different levels of resolution


 Allows natural clusters to become more distinguishable

100
The WaveCluster Algorithm
 How to apply wavelet transform to find clusters
 Summarizes the data by imposing a multidimensional grid

structure onto data space


 These multidimensional spatial data objects are represented in a

n-dimensional feature space


 Apply wavelet transform on feature space to find the dense

regions in the feature space


 Apply wavelet transform multiple times which result in clusters at

different scales from fine to coarse


 Major features:
 Complexity O(N)

 Detect arbitrary shaped clusters at different scales

 Not sensitive to noise, not sensitive to input order

 Only applicable to low dimensional data

101
Quantization
& Transformation
 Quantize data into m-D grid structure,
then wavelet transform
a) scale 1: high resolution
b) scale 2: medium resolution
c) scale 3: low resolution

102
Questlon 15
Consider two posting lists X and V w ith the lengths of x a nd y. If the largest Doc Ids in X and V
Complete
ore nl and n2 respectively. What is the worst case time comple~ity the merge tokes?
Mork 1.00 out
or1.oo

'f' Flog
question Select one:
a. O(nl+n2) operations
b. o(y.n2) operations

c. O(x.nl) operations
d. O(x+y) operations
e. None of the above

The correct a nswer is: O(x+y) operations

SC'..;rnnP.ci 'NiTh CamScannP.r


QuesUon 14 Posting lists for Brutus&. Caesar as follows. How many iterations do you need to follow in the
Complete intersect(Merge) step of processing boolean query: Brutus AND Caesar?
Mork 1.00 out Brutus = 2 4 8 16 32 64 128
of 1.00
Caesar= I 2 3 5 8 13 21 34
'f' Flog
question

Answer. 11

The correct answer is: 11

SC'..;rnnP.ci 'NiTh CamScannP.r


Question 13
What are the search functions provisioned by Boolean retrieval model?
Complete
Mo rk l.OO OUt
Select one or more:
oil.OD
a. Ranked Search
'f> Flog
question b. Proximity Search

c. Phrase Search

d. lone search in documents

e. None of the above

The correct onswer is: None of the above

SC'..;rnnP.ci 'NiTh CamScannP.r


Question 12 The content of o research paper can be identify as a.
Com plete

Mork 0.0 0 olJI Select one:


oflOO
a. Structured data
'f' Flog
question b. Semi-structred data

c . Unstructured data

d . None o f the above

The correct answer is: Structured data

SC'..;rnnP.ci 'NiTh CamScannP.r


Question Tl
Consider lM documents where each document contains an average of 100 unique terms. If the
complete-
total unique terms count of the corpus is 500000.
Morkt00ouL
or 1.00 What is the average percentage of 1· s in the binary term-document indices matrix?
(Give the answer to 2 decimal places. Don't input %mark)
'f' Flog
question

Answer: 0.02

The correct answer is: 0.02

Scanned v11ith C~msc~nner


Question 10 A confusion m atrix 1,os given below. What is the recall o f the retrieval system? (Give the
complete answer to three decimal pla ces)
Mo rk 1.00 Ol/t
identied as d ocum ents identied a s documenets n ot
oltOO
con taining Brutu s cont ai ning Brutus
~ Flog
question Docum ents
Acutally contains Brutus 50 10

Docum en ts not
Acutally contains Brutus 5 100

Answer: 0.833

The correct answer is; 0.833

SC'..;rnnP.ci 'NiTh CamScannP.r


Quost100 9 Decide the query processing order
ComplatD
(tangerine OR marmalade) AND (trees OR skies) AND (kaleidoscope OR eyes)
MorklOOout
oflOO Term Postings size
'P Rog eyes 213312
question
kaleidoscope 87009
marmalade 107913

skies 271658

1ongerina 46653
trees 316812

Selectane:
a. (marmalade OR skies) AND (kaleidoscope OR eyes) AND (tangerine OR ire es)

b. (tangerine OR trees) AND (kaleidoscope OR ayes) AND (marmalade OR skies)

c. (kaleidoscope OR eyes) AND (marmalade OR skies) AND (tangerine OR trees)


d. (kaleidoscope OR eyes) AND (tangerine OR trees) AND (marmalade OR skies)

The correct answer is: (kaleidoscope OR eyes) AND (tangerine OR trees) AND (marmalade OR skies)

Scanned v11ith C~msc~nner


QUestlon 8 A confusion matrix has given below. What is the precision of the retrieval system? (Give the
Complete a n swer to threo decimal places)
Mork 1,00 out identied as documents identied as documenets not
ofl.00
containing Brutus containing Brutus
'f' Flog
question Docum ents
Acutally contains Brutus 50 10

Documents not
Acutally contains Brutus 5 100

Answer: 0.909

The correct answer is: 0.909

SC'..;rnnP.ci 'NiTh CamScannP.r


Question 7 Match the terms with most relevant description.
complete

Mork 1.00 out Measure the relevance of each document to a user query. Ranking
or too
'f' Flog Group set ot documents based on the content Clustering
question
Given set of topics. decide the most relevant topic. a document
Classification :
belongs to

The correct answer is: Measure the relevance of each document to a user query. ~ Ranking.
Group set of documents based on the content - Clustering. Given set of topics. decide the
most relevant topic, a document belongs to ~ Classification

SC'..;rnnP.ci 'NiTh CamScannP.r


Question 6 Why finding a nd recording document(term) frequencies Is important in a lnformation retrieval
Complete system?
Mork 0.50 out
Ofl.00 Select one or more:
'f' Flog a. to improve the efficiency of the search engine at query time.
question
b. to improve the storage efficiency of the search engine.

c. to use in ranked retrieval systems.


d None of the above

The correct answers ore: to improve the efficiency of the search engine at query time.. to use
in ranked retrieval systems.

SC'..;rnnP.ci 'NiTh CamScannP.r


Questlon 5, What can be considered as information retrieval?
Complete

Mork 1.00 olllt Select one or more:


of 1.00
a. Coreference resolution
°f' Flog
questlon b. Automatic annotation and content identification from images, audio and video.

c . Find titles according to the relevance with a given query.

d. Flnd relationships between named entities.

e. Find named en tities in a text document


f. Find diagrams related to the given query.

The correct answers are: Find titles according to the relevance with a given query, Find
diagrams relat ed to the given query.

Scanned v11ith C~msc~nner


Question 4 What can not be considered as a major step in inverted index construction?
Complete

Mork 0.00 out Select one or more:


011.00
a. Index the documents that each term occurs in.
'f' Flog
questro,, b. TokeniLe t11e text

c . Perform linguistic preprocessing tokens

d . None of the above.

e. Collect the documents to be indexed

The correct answers are: Collect the documents to be indexed, Tokenize the text. Perform
linguistic preprocessing tokens, Index tne documents that each term occurs in.

SC'..;rnnP.ci 'NiTh CamScannP.r


Question 3 What is the statistic can be used ln query optimization?
Complete

Mo rk 1.00 out Select one:


orlOO a. Size of the dictionary
~ Flog
quesuon b. Document(term) Frequency

c. Number of words in the user query

d. Number of ANDs and ORs in the user query


e. None of the above

The correct answer is: Document(term) Frequency

SC'..;rnnP.ci 'NiTh CamScannP.r


Questlon 2 Match the various IR systems with the most relevant description.
Complete

Morkl.00 out Search and retrieve patient


ollOO records
Institutional IR system
'f' Aog
questlon Search a user profile in focebook Web/Search and Social Media information retrieval ;

Search in an email client


Personal IR system
program.

The correct answer is: Search and retrieve patient records - Institutional IR system, Search a
user profile in facebook ~ Web/Search and Social Media information retrieval, Search in an
email client program.~ Personal IR system

Scanned v11ith C~msc~nner


Question 1 In real world scenarios, binary term-documment indices metrix would tend to be a
Complete
sparse matrix.
Mork LOO out
or too
V Flag
question

The correct a nswer is: sporse

SC'..;rnnP.ci 'NiTh CamScannP.r


QuesUon 15 Select the statem ent which is not correct.
complete

Mo rk o.oo out Select one:


011.00
a. New tren d is to keep stop words in posting lists
'f' Flog
ques1Ion b. Stop words has not good semantic power

c. Removing stop words reduce memory overhead.


d. The lmportance of stop words is context dependent.

The correct ans.wer is: New trend is to keep stop words in posting llsts

Scanned with CamScanner


Question 14 If you were to create o spider to crawl the web, which of the following actions you should not
Complete perform?
Mork 0,00 out
of 1.00 Select one:
'f' Flog o. Implementing a fast-crawling robot to make sure the process finishes within a
question
reasonable amount of time.
b. Announcing your intentions and using HTTP user-agent to identify your robot

c. Checking available crawled data from other robots. so you may not need to
implement your own robot

d Keeping your crawler's raw data and sharing t'.he results publicly.

The correct answer is: Implementing a fast- crawling robot to make sure the process finishes
within a reasonable amount of time.

Scanned v11ith C~msc~nner


Question 13 Using Unicode for encoding Asian languages is popular nowadays. Select correct statements
Complete about Unicode encoding.
Mork 1.00 out
of l.00 Select one or more:
'f' Flog a. In Unicode. there can be scenarios where more than one Unicode(code point) are
question
used to represent one character.

b. Since some of the languages don't use spaces between tokens. it can't be always
guaranteed a unique set of tokens.

c. UT~8 is m emory efficient than UT~16

d. Usi ng Unicode could expose syst ems to security vulnera bilities.

The correct answers are: In Unicode, there can be scenarios w l1ere more than one
Unicode(code point) ore used to represent one character, Using Unicode could expose
systems to security vulnerabilities.. Since some of the languages don't use spaces between
tokens, it can't be always guaranteed a unique set o f tokens.

SC'..;rnnP.ci 'NiTh CamScannP.r


QuestJon 12 What are the features of the Apache Solr platform?
Complete

Mork 1,00 out Select one:


or 1.00
a. This is a scalable platform offers maximum performance with real-time index facilities.
'f Flog
question b. It is based on standardized open interfaces like JSON. XMl and HTTP

c. Flexible faceting, advanced, and adoptable search behavior that can be customized
based on your needs. Faceting is the arrangement of search results based on real-time
indexing of document fields
d. All of these

The correct answer is: All of these

Scanned v11ith C~msc~nner


Question 11 A crude heuristic process that chops off the ends of the words to reduce inflectional forms of
Complete words and reduce the size of the vocabulary is called:
Mark 0.00 out
of LOO Select one:
~ Flog a. Lemmatizotion
question
b.Stemming

c. Cose Folding
d True casing

The correct a nswer is: Stemming

SC'..;rnnP.ci 'NiTh CamScannP.r


Question 10 What are the different types of query parameters in Apache Solr?
Complete
Mork 1.00 out Select one:
011.00
0. 9
~ Flog
question b. fq

c . start
d.rows

e. All of these

The correct answer is: All of these

SC'..;rnnP.ci 'NiTh CamScannP.r


Question 9 Assume you want to run a range query. You ore given B- tree to store indices. Whot is the time
Complete complex,ity?
Mork l.00 out
ofl00 Select one:
'f> Flog a. O(log(n))
question
b. O(n)
c. O(nlog(n))
d. o(log(n112))

The correct a nswer is: O(log(n))

Scanned v11ith C~msc~nner


Question 8 A user wants to query 'Moratuwa University Katubedda'. Usage of Bf-word index for querying
complete would cause to increase
Mo rk 1.00 out
olt0O Select one or more:
~Aog a. Size of the Index table
question
b. True Positive

c. False Negative
d. True Negative
e. False Positive

The correct answers are: Size of the Index table. False Positive

Scanned v11ith C~msc~nner


Question 7 Porter Stemmer olgorithm consists of several steps. What are interim changes and final output
Complete of the word: differently
Mork 1.00 out
of 1.00 Select one:
l' Remove a. differently - > differentli - >different
flog
b. differently - > different
c. differently-> different-> differ

d. differently-> different!-> different

The correct answer ls: differently-> differentli -> different

Scanned v11ith C~msc~nner


Question 6 Most relevont stotement regarding Porter Algorithm is
complete-

Morkt00ouL Select one:


or1.oo a. Simple and efficient way to do stemming. stripping off affixes.
"'Flog
question b. It is an algorithm for word takenization.

c. It is an a lgorithm for sentence tolcenization.


d. None of the above

The correct answer is: Simple and efficient woy to do stemming. stripping off affixes.

Scanned v11ith C~msc~nner


Question 5 The process of matching USA and U.SA into one class is called
Complete

Morkl.00out Answer: Normalization


01100

'f' Flog
question

The correct answer is: Normalization

SC'..;rnnP.ci 'NiTh CamScannP.r


Questlon 4 A data structure that maps terms back to the ports of o document in which they occur is
complete called an:
Mo rk 0.00 out
oltOO Select one:
'f> Flag a. Dictionary
question
b. Inverted Index
c. Incidence Matrix

d. Postings list

The correct answer is: Inverted Index

Scanned v11ith C~msc~nner


Question 3 How m any types are in the following serntence?
Complete
However, this strategy depends on user training. since if you query using either of the other
Morkl00out two forms, you get no generalization.
011.00

'f' Flog
questfon Answer: 21

The correct answer is: 21

Scanned v11ith C~msc~nner


Question 2 An index that includes sequences of wards or t erm s of variable length that have been
Complete extracted from a source document is called a:
Mo rk 1.00 out
of 1.0 0 Select one:
l' Rog a. Bi-word index
questlor1
b. Phrase Index

c. Positional index

d Inverted Index

The correct answer is: Phrase Index

Scanned v11ith C~msc~nner


Question l Assume you want to run a range query. You are given a Hoshmop to store indices. What is the
Cornplete tin1e complexity?
Mo rk 0.00 out
of LOO Select one:
'f' Flog a. O(log(n))
question
b.O(n)
C. 0(1)
d. o(nlog(n))

The correct answer is: O(n)

Scanned v11ith C~msc~nner


Questior, 10 You nnci givAn o documr-int ond a 9 11r-iry ns follows. You nrA nskr-id to fill t l~A hiddAn vnh iAs of thr-i
Complete following toble.
Mork 1.00 out
What are the normalized values for the documen t?
oflOO
N (Number of documents) = 1,000,000
'f' Flog
question Document= football needs eleven football players
Query= best football players
ITenn Query Document Prod
tf-raw lf-wt dt lid! lwt ln'llzed tf-raw tf-wt lw1 ln'll1ed
best l 1 2500 0
e leven 0 0 1000 1
football 1 l 10000 2
!needs 0 0 7500 1
players 1 1 2500 1

Select one:
a. best => 0.2, eleven => 0, football => 0.50. need => 0, players=> 0.36

b. best => 0.2, eleven => 0.36, football =>0.50, need => 0.36, players => 0.36

c. best=> 0, eleven =>0.46, football => 0.60, need => 0.46, players => 0.46

d best=> 0, eleven=> 0.36, football=> 0.50, need=> 0.36, players=> 0.36

Scanned v11ith C~msc~nner


Question 9 You are given a document and a query as follows. You are asked to fill the hidden values of the
Comple1e following table.
Mork 1.00 out What are the similarity score between docum ent an d query?
oftOO
N (Number of documents)= 1,000,000
~ Flog
quesiion Document= football needs eleven football players
Query= best football players
Tenn Q.uerv Document Prod
I tf-raw lf•WI df lid! lwt ln' lized ti-raw ti-wt lwl ln'llzed
bes! 1 1 2500 0
eleven 0 0 1000 1
foolball 1 1 10000 2
needs 0 0 7500 l
Iplayers l l 2500 1

Select one:
a . 0.78
b. 0.50
C. 0.48
d . 0.58

Scanned v11ith C~msc~nner


Question 8 You are given a document and a query as follows. You are asked to fill the hidden values
complete of the following table. (Question 8. 9. 10 are based on the given document and query)
Mork 1.00 ou t
What are the n o rmalized values for the query?
of 1.00

VF1og N (Number of documents) = 1.000.000


question
Document= football needs eleven football players

Query= best football players


Term Query Document Prod
tf-raw tf•wl di Jidf Jwt ln' lized tf•raw tf-wt lwt ln'lized
best 1 1 2500 0
eleven 0 0 1000' 1
football 1 1 10000 2
needs 0 0 7500 1
players 1 l 2s00' 1

Select one:
a. best => 0.52, eleven => 0, football => 0, need => 0.2, players => 0.62
b. best=> 0.52, eleven=> 0, football=> 0.58, need => 0, players=> 0.54

c. best=> 0.52, eleven=> 0, football=> 0.38, need=> 0, players=> 0.52

d. best => 0.62, eleven => 0, football => 0.48. need => 0, players => 0.62

Scanned v11ith C~msc~nner


Question 7 The tf- idf weight is lower when a term t occurs many times in a document or occurs in
complete relatively few documents.
Mork 1.00 out
of 1.00 Select one:
'f'Flog True
question
False

Scanned v11ith C~msc~nner


Questlon 5 The tf-idf weight is highest when a term t occurs many times within a small number of
Complete documents.
Mork LOO out
olt00 Select one:
'f' Flog True
question
Folse

Questlon 6 tf- idf weight Isometric derived by taking the log of N divided by the document frequency
Complete where N is the total number of documents in o collection.
Mork LOO out
olt00 Select one:
'f' Flog True
question
Folse

Scanned v11ith C~msc~nner


Question 4 Calculate and find the most relevant two documents with respect to cosine similarity
complete-

Mork 1.00 OUL term WH PaP Sas


or1.oo
'f' Flog affection 80 60 40
question jealous 30 30 10
gossip 0 10 10
wuthering 0 0 20

Select one:
o.PoPand WH

b.SaSondWH

c. PaP and Sas

d. information is not enough to calculate cosine similarity

Scanned v11ith C~msc~nner


QuesUon 3
A measure of similarity between two vectors which is determined by measuring the angle
Complete
between them is called:
Mork 1.00 o ut
otl.OO
Select one or more:
l' Flog
question a. Vector sim ilarity
b. Vector scoring

c. Cosine similarity
d. Sine similarity

Scanned v11ith C~msc~nner


Question 1 Pick correct stotement(s).
complete

Mo rk 0.50 out Select one or more:


01100
a. Vector representation consider the ordering of words in a document
"' Rog
question b. Inverted document frequency affects on ranking on term queries.

c. jaccard coefficient doesn't consider term frequency.

d. For a given term t, collection frequency >= document frequency.

Question 2 Calculat e Jaccard Coefficient between the following query and the document
Complete
Query: idea of march
Mo rk 1.00 out
Document caesar died in march
01 100

"' Rog
q uestion
Select one:
a. 0.154
b. 0.143
C . 0.167

d . 0.2

Scanned v11ith C~msc~nner


Whot Is the pc-ecislon of the system ror the given query?
PrtNide the answer to one decimal place.

Answer OA

Quo1tion 9 Whot is the recoil or u,e system ror the given query?
Corrocl
Mcrt l.OOout Pro'llide the answer to one decimal place.
of LOO

,. Rag
""39.tion
Answer- 0.5

Qu..UOn 10 Which o f the following will help to improv e the recall of this query?
COrrocl
Marli:toOout SGlact one:
cJ 100
o. Stemming ~
b. Spell Correction

c. Skip Ust Postings


d. Longuoge Oetecuon

Scanned with CamScanner


Question 8
Questions 8 to JO are based on the given ll"brary Ill system:
Oormcl
A user searched a Library IRsystem with the fonowlng query- "infom,otion systems·
Mar'tlOOout
ofl.00 The IRsystem responded with me following 11st of results
t' flog I. Introduction to Information Systems
cp.testion
2. Information on Signals li. Systems
3. Information Systems: Theory li. Practice
4. Operating systems ror Information Era

5. Management Systems

The carpus of this library system consists of the following 10 books


1. Introduction to Information Systems
2. Information System Applications
3. lnformalion on Signals & systems
4 Information Systems: Theory 6. Practice
5. Operating Systems lar Information Era
8. Management sysrems
7. Algorithms
8. Datobases 6. SQL
9 linux is the Next best
10. Information system

Scanned v11ith C~msc~nner


Question 6 Stemming should be invoked at indexing time but not w hile processing a query. True or Folse?
Correct
Mork LOOOUl Select one:
ot 1.00 True
'f' Rog
question Folse ~

Question 7 Query: ides of march


Portlolly
Document the long windy march
correct
Mo rtOBOout Calculate the Joccard Coefficient between tho query and the document
oil.OD

'f Flog Answer: 0.167


question

Scanned v11ith C~msc~nner


Question 4 The process of matching vehic le and automobile into one class is called
Incorrect
Synonyms )C
Ma rk o.oo out
of 1.00
'f' Rag
question

Question 5 In a Boolean retrieval system. stemming never lowers precision. True or False?
Correct

Ma rk LOOOUl Select one:


of 1.00
True
'f' Rag
question Folse ~

SC'..;rnnP.ci 'NiTh CamScannP.r


Question 3 The following data structures are used for the dictionary In an inverted index.
Correct
a) B-Tree
MorktOOout
or1.oo b) Binary Tree
'f' Flog c) HashMap
question

Which of the following ordering, order the da ta structure by their average lookup time in
increasing manner?

Select one:
1. a,b,c

2c,b,a

3.c,a,b "'

4. None of the above

SC'..;rnnP.ci 'NiTh CamScannP.r


Questlon 1 Select the one correct onswer.
Correct

Mork 1.00 out In the vector space model. the dimensionality of the query vector is:
ofl.00
'f"Flog Select one:
question
a. the number of documents in the collection
b. the number of terms In the index ~

c . the number of words in the coll'ection

d. the number of unique words in the query

Question 2 Order the indexes listed below by increasing recall.


PortioHy
correct

Merk 0.33 out


Tern, Index with stemming highest recall -• ~

ofl.00
Term Index without stemn,ing 3rd highest recall •
~ X
'f" Flog
question Biword Index without stemn,ing 2nd highest recall • ~ )(

Scanned v11ith C~msc~nner


Question 9 Best implementation approach for dynamic indexing is:
Correct
Mork lOOout Select one:
or too a. Using logarithmic merge .,
'f' Flog
question b. Using Invalidation bit- vector for delet ed docs

c. Periodic r<rindexing

d.None

Question 10 Permu term indices are used for solving:


correct
Mork l.OOout Select one:
011.00 a. V✓ildcard queries .,
V Flog
question b. Boolean queries
c. None

d. Phrase queries

SC'..;rnnP.ci 'NiTh CamScannP.r


Question 7 Any string of terms of the following form is coiled an ext ended biword:
Correct

Mo rk l OO out Select one:


oltOO o.NNX•
'f'Aag
question
c.NXNN

d •NNX

Question 8 Variable-size postings lists ore used when:


correct

Mo rk LOOOLJl Select one:


ol LOO
a. Less seek time is desired and the corpus is static
YAog
question b. Less seek time is desired and the corpus is dynamic
c. More seek time is desired and the corpus is dynamic .;-

d. More seek time is desired and the corpus is dynamic

SC'..;rnnP.ci 'NiTh CamScannP.r


Question 5 Long docum ents can have higher ranks because of their size,
Correct
if no normalization is adapted by the IR system.
Mo rk 1.00 out
oflO0
Select one:
'f' Flog
True ..;
question
False

Question 6 Which is a good idea for using skip pointers?


Correct

Mork lOOOUl Select one:


of 1.00
a. Fewer skips, larger skip spans
'f' Flog
question b.None
c. Depends upon the number of comparisons needed ..;

d More skips, shorter skip spans

SC'..;rnnP.ci 'NiTh CamScannP.r


Question 3 What is a type of a Tokenizer that can be found in Apache Solr?
Correot

Mo rk 1.00 OlJI Select one:


or 1.00
a. Porter Stem
\" flog
question b. Lower Case ..,.

c. Phonetic

d.Synonym

Question 4 What are the filters that can be found in Apache Solr?
Ponlolly
correct Select one or more:
Morie O.67 out
a. Edge N-Gram Filter ..,.
or LOO

'!'flog b. Lower Case F11ter ..,.


question
c. Whitespace Filter >C

d. Synonym Filter

SC'..;rnnP.ci 'NiTh CamScannP.r


Questlon 1 What is the correct query optimization step while executing the intersection of two postings?
Correct

Merk 1.00 out Select one:


of l.00
a. Process in the order of increasing document frequency ~
'f'"Flog
question b. Process i n any order

c . Process in the order of decreasing document frequency


d. None of the above

Questlon 2 A data structure that maps terms back to the parts of a document in which they occur called
Correct a/an:
Mo r~l00 out
of 1.00 Select one:
'f'" Flog a. Postings list
question
b. Incidence Matrix
c . Dictionary
d. Inverted Index ~

Scanned v11ith C~msc~nner

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy