0% found this document useful (0 votes)

117 views165 pages

Part 2

Uploaded by

msuresh79

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views165 pages

Part 2

Uploaded by

msuresh79

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 165

DATA MINING

Introductory and Advanced Topics

Part II

Margaret H. Dunham
Department of Computer Science and Engineering
Southern Methodist University

Companion slides for the text by Dr. M.H.Dunham, Data Mining,

Introductory and Advanced Topics, Prentice Hall, 2002.
© Prentice Hall 1
Data Mining Outline
 PART I
– Introduction
– Related Concepts
– Data Mining Techniques
 PART II
– Classification
– Clustering
– Association Rules
 PART III
– Web Mining
– Spatial Mining
– Temporal Mining

© Prentice Hall 2
Classification Outline
Goal: Provide an overview of the classification
problem and introduce some of the basic
algorithms
 Classification Problem Overview
 Classification Techniques
– Regression
– Distance
– Decision Trees
– Rules
– Neural Networks

© Prentice Hall 3
Classification Problem
 Given a database D={t1,t2,…,tn} and a set
of classes C={C1,…,Cm}, the
Classification Problem is to define a
mapping f:DC where each ti is assigned
to one class.
 Actually divides D into equivalence
classes.
 Prediction is similar, but may be viewed
as having infinite number of classes.

© Prentice Hall 4
Classification Examples
 Teachers classify students’ grades as
A, B, C, D, or F.
 Identify mushrooms as poisonous or
edible.
 Predict when a river will flood.
 Identify individuals with credit risks.
 Speech recognition
 Pattern recognition

© Prentice Hall 5
Classification Ex: Grading
x
 If x >= 90 then grade
=A. <90 >=90
 If 80<=x<90 then x A
grade =B.
<80 >=80
 If 70<=x<80 then x B
grade =C.
 If 60<=x<70 then <70 >=70
grade =D. x C
 If x<50 then grade =F. <50 >=60

Letter A Letter B

Letter C Letter D

Letter E Letter F

© Prentice Hall 7
Classification Techniques
 Approach:
1. Create specific model by evaluating
training data (or using domain
experts’ knowledge).
2. Apply model developed to new data.
 Classes must be predefined
 Most common techniques use DTs,
NNs, or are based on distances or
statistical methods.
© Prentice Hall 8
Defining Classes

Distance Based

Partitioning Based

© Prentice Hall 9
Issues in Classification
 Missing Data
– Ignore
– Replace with assumed value
 Measuring Performance
– Classification accuracy on test data
– Confusion matrix
– OC Curve

© Prentice Hall 10
Height Example Data
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
© Prentice Hall 11
Classification Performance

True Positive False Negative

False Positive True Negative

Using height data example with Output1

correct and Output2 actual assignment

Actual Assignment
Membership Short Medium Tall
Short 0 4 0
Medium 0 5 3
Tall 0 1 2

© Prentice Hall 14
Regression
 Assume data fits a predefined function
 Determine best values for regression
coefficients c0,c1,…,cn.
 Assume an error: y = c0+c1x1+…+cnxn+
 Estimate error using mean squared error for
training set:

© Prentice Hall 16
Classification Using Regression
 Division: Use regression function to
divide area into regions.
 Prediction: Use regression function to
predict a class membership function.
Input includes desired class.

© Prentice Hall 19
Classification Using Distance
 Place items in class to which they are
“closest”.
 Must determine distance between an item
and a class.
 Classes represented by
– Centroid: Central value.
– Medoid: Representative point.
– Individual points
 Algorithm: KNN
© Prentice Hall 20
K Nearest Neighbor (KNN):
 Training set includes classes.
 Examine K items near item to be
classified.
 New item placed in class with the most
number of close items.
 O(q) for each tuple to be classified.
(Here q is the size of the training set.)

© Prentice Hall 23
Classification Using Decision
Trees
 Partitioning based: Divide search
space into rectangular regions.
 Tuple placed into class based on the
region within which it falls.
 DT approaches differ in how the tree is
built: DT Induction
 Internal nodes associated with attribute
and arcs with values for that attribute.
 Algorithms: ID3, C4.5, CART
© Prentice Hall 24
Decision Tree
Given:
– D = {t1, …, tn} where ti=<ti1, …, tih>
– Database schema contains {A1, A2, …, Ah}
– Classes C={C1, …., Cm}
Decision or Classification Tree is a tree associated
with D such that
– Each internal node is labeled with attribute, Ai
– Each arc is labeled with predicate which can be
applied to attribute at parent
– Each leaf node is labeled with a class, Cj

M
Gender
F

Height

Balanced
Deep
© Prentice Hall 28
DT Issues
 Choosing Splitting Attributes
 Ordering of Splitting Attributes
 Splits
 Tree Structure
 Stopping Criteria
 Training Data
 Pruning

© Prentice Hall 31
DT Induction
 When all the marbles in the bowl are
mixed up, little information is given.
 When the marbles in the bowl are all
from one class and those in the other
two classes are on either side, more
information is given.

Use this approach with DT Induction !

 Entropy measures the amount of randomness

or surprise or uncertainty.
 Goal in classification
– no surprise
– entropy = 0

log (1/p) H(p,1-p)

© Prentice Hall 34
ID3
 Creates tree using information theory
concepts and tries to reduce expected
number of comparison..
 ID3 chooses split attribute with the highest
information gain:

© Prentice Hall 35
ID3 Example (Output1)
 Starting state entropy:
4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384
 Gain using gender:
– Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764
– Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) =
0.4392
– Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) =
0.34152
– Gain: 0.4384 – 0.34152 = 0.09688
 Gain using height:
0.4384 – (2/15)(0.301) = 0.3983
 Choose height as first splitting attribute

© Prentice Hall 36
C4.5
 ID3 favors attributes with large number of
divisions
 Improved version of ID3:
– Missing Data
– Continuous Data
– Pruning
– Rules
– GainRatio:

 PL,PR probability that a tuple in the training set

will be on the left or right side of the tree.

© Prentice Hall 38
CART Example
 At the start, there are six choices for
split point (right branch on equality):
– P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224
– P(1.6) = 0
– P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169
– P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385
– P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256
– P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32
 Split at 1.8

© Prentice Hall 39
Classification Using Neural
Networks
 Typical NN structure for classification:
– One output node per class
– Output value is class membership function value
 Supervised learning
 For each tuple in training set, propagate it
through NN. Adjust weights on edges to
improve future classification.
 Algorithms: Propagation, Backpropagation,
Gradient Descent
© Prentice Hall 40
NN Issues
 Number of source nodes
 Number of hidden layers
 Training data
 Number of sinks
 Interconnections
 Weights
 Activation Functions
 Learning Technique
 When to stop learning
© Prentice Hall 41
Decision Tree vs. Neural
Network

Tuple Input

Output

© Prentice Hall 45
NN Learning
 Adjust weights to perform better with the
associated test data.
 Supervised: Use feedback from
knowledge of correct classification.
 Unsupervised: No knowledge of
correct classification needed.

 Change weights on arcs based on estimated

error
© Prentice Hall 48
NN Backpropagation
 Propagate changes to weights
backward from output layer to input
layer.
 Delta Rule:  wij= c xij (dj – yj)
 Gradient Descent: technique to modify
the weights in the graph.

Error

© Prentice Hall 55
Types of NNs
 Different NN structures used for
different problems.
 Perceptron
 Self Organizing Feature Map
 Radial Basis Function Network

 Perceptron is one of the simplest NNs.

 No hidden layers.

© Prentice Hall 58
Self Organizing Feature Map
(SOFM)
 Competitive Unsupervised Learning
 Observe how neurons work in brain:
– Firing impacts firing of those near
– Neurons far apart inhibit each other
– Neurons have specific nonoverlapping
tasks
 Ex: Kohonen Network

© Prentice Hall 60
Kohonen Network
 Competitive Layer – viewed as 2D grid
 Similarity between competitive nodes and
input nodes:
– Input: X = <x1, …, xh>
– Weights: <w1i, … , whi>
– Similarity defined based on dot product
 Competitive node most similar to input “wins”
 Winning node weights (as well as surrounding
node weights) increased.
© Prentice Hall 61
Radial Basis Function Network
 RBF function has Gaussian shape
 RBF Networks
– Three Layers
– Hidden layer – Gaussian activation
function
– Output layer – Linear activation function

© Prentice Hall 63
Classification Using Rules
 Perform classification using If-Then
rules
 Classification Rule: r = <a,c>
Antecedent, Consequent
 May generate from from other
techniques (DT, NN) or generate
directly.
 Algorithms: Gen, RX, 1R, PRISM

© Prentice Hall 71
Decision Tree vs. Rules
 Tree has implied  Rules have no
order in which ordering of
splitting is predicates.
performed.
 Tree created based  Only need to look at
on looking at all one class to
classes. generate its rules.

© Prentice Hall 72
Clustering Outline
Goal: Provide an overview of the clustering
problem and introduce some of the basic
algorithms

 Clustering Problem Overview

 Clustering Techniques
– Hierarchical Algorithms
– Partitional Algorithms
– Genetic Algorithm
– Clustering Large Databases

© Prentice Hall 73
Clustering Examples
 Segment customer database based on
similar buying patterns.
 Group houses in a town into
neighborhoods based on similar
features.
 Identify new plant species
 Identify similar Web usage patterns

Geographic
SizeDistance
Based Based

© Prentice Hall 76
Clustering vs. Classification
 No prior knowledge
– Number of clusters
– Meaning of clusters
 Unsupervised learning

© Prentice Hall 77
Clustering Issues
 Outlier handling
 Dynamic data
 Interpreting results
 Evaluating results
 Number of clusters
 Data to be used
 Scalability

© Prentice Hall 79
Clustering Problem
 Given a database D={t1,t2,…,tn} of tuples
and an integer value k, the Clustering
Problem is to define a mapping
f:D{1,..,k} where each ti is assigned to
one cluster Kj, 1<=j<=k.
 A Cluster, Kj, contains precisely those
tuples mapped to it.
 Unlike classification problem, clusters
are not known a priori.
© Prentice Hall 80
Types of Clustering
 Hierarchical – Nested set of clusters
created.
 Partitional – One set of clusters
created.
 Incremental – Each element handled
one at a time.
 Simultaneous – All elements handled
together.
 Overlapping/Non-overlapping
© Prentice Hall 81
Clustering Approaches

Clustering

Hierarchical Partitional Categorical Large DB

Agglomerative Divisive Sampling Compression

© Prentice Hall 83
Distance Between Clusters
 Single Link: smallest distance between points
 Complete Link: largest distance between points
 Average Link: average distance between points
 Centroid: distance between centroids

© Prentice Hall 84
Hierarchical Clustering
 Clusters are created in levels actually
creating sets of clusters at each level.
 Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
 Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down

© Prentice Hall 86
Dendrogram
 Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
 Each level shows clusters
for that level.
– Leaf – individual clusters
– Root – one cluster
 A cluster at level i is the
union of its children clusters
at level i+1.

Threshold of
1 2 34 5

A B C D E
© Prentice Hall 89
MST Example

A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3 E C
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0 D

© Prentice Hall 91
Single Link
 View all items with links (distances)
between them.
 Finds maximal connected components
in this graph.
 Two clusters are merged if there is at
least one edge which connects them.
 Uses threshold distances at each level.
 Could be agglomerative or divisive.

© Prentice Hall 94
Partitional Clustering
 Nonhierarchical
 Creates clusters in one step as opposed
to several steps.
 Since only one set of clusters is output,
the user normally has to input the
desired number of clusters, k.
 Usually deals with static sets.

© Prentice Hall 99
K-Means
 Initial set of clusters randomly chosen.
 Iteratively, items are moved among sets
of clusters until the desired set is
reached.
 High degree of similarity among
elements in a cluster is obtained.
 Given a cluster Ki={ti1,ti2,…,tim}, the
cluster mean is mi = (1/m)(ti1 + … + tim)

© Prentice Hall 100

K-Means Example
 Given: {2,4,10,12,3,20,30,11,25}, k=2
 Randomly assign means: m1=3,m2=4
 K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16
 K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18
 K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6
 K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25
 Stop as the clusters with these means are the
same.

© Prentice Hall 101

K-Means Algorithm

© Prentice Hall 102

Nearest Neighbor
 Items are iteratively merged into the
existing clusters that are closest.
 Incremental
 Threshold, t, used to determine if items
are added to existing clusters or a new
cluster is created.

© Prentice Hall 103

Nearest Neighbor Algorithm

© Prentice Hall 104

PAM
 Partitioning Around Medoids (PAM)
(K-Medoids)
 Handles outliers well.
 Ordering of input does not impact results.
 Does not scale well.
 Each cluster represented by one item,
called the medoid.
 Initial set of k medoids randomly chosen.

© Prentice Hall 105

PAM

© Prentice Hall 106

PAM Cost Calculation
 At each step in algorithm, medoids are
changed if the overall cost is improved.
 Cjih – cost change for an item tj associated
with swapping medoid ti with non-medoid th.

© Prentice Hall 107

PAM Algorithm

© Prentice Hall 108

BEA
 Bond Energy Algorithm
 Database design (physical and logical)
 Vertical fragmentation
 Determine affinity (bond) between attributes
based on common usage.
 Algorithm outline:
1. Create affinity matrix
2. Convert to BOND matrix
3. Create regions of close bonding

© Prentice Hall 109

BEA

Modified from [OV99]

© Prentice Hall 110

Genetic Algorithm Example
 {A,B,C,D,E,F,G,H}
 Randomly choose initial solution:
{A,C,E} {B,F} {D,G,H} or
10101000, 01000100, 00010011
 Suppose crossover at point four and
choose 1st and 3rd individuals:
10100011, 01000100, 00011000
 What should termination criteria be?

© Prentice Hall 111

GA Algorithm

© Prentice Hall 112

Clustering Large Databases
 Most clustering algorithms assume a large
data structure which is memory resident.
 Clustering may be performed first on a
sample of the database then applied to the
entire database.
 Algorithms
– BIRCH
– DBSCAN
– CURE

© Prentice Hall 113

Desired Features for Large
Databases
 One scan (or less) of DB
 Online
 Suspendable, stoppable, resumable
 Incremental
 Work with limited main memory
 Different techniques to scan (e.g.
sampling)
 Process each tuple once

© Prentice Hall 114

BIRCH
 Balanced Iterative Reducing and
Clustering using Hierarchies
 Incremental, hierarchical, one scan
 Save clustering information in a tree
 Each entry in the tree contains
information about one cluster
 New nodes inserted in closest entry in
tree
© Prentice Hall 115
Clustering Feature
 CT Triple: (N,LS,SS)
– N: Number of points in cluster
– LS: Sum of points in the cluster
– SS: Sum of squares of points in the cluster
 CF Tree
– Balanced search tree
– Node has CF triple for each child
– Leaf node represents cluster and has CF value
for each subcluster in it.
– Subcluster has maximum diameter

© Prentice Hall 116

BIRCH Algorithm

© Prentice Hall 117

Improve Clusters

© Prentice Hall 118

DBSCAN
 Density Based Spatial Clustering of
Applications with Noise
 Outliers will not effect creation of cluster.
 Input
– MinPts – minimum number of points in
cluster
– Eps – for each point in cluster there must
be another point in it less than this distance
away.

© Prentice Hall 119

DBSCAN Density Concepts
 Eps-neighborhood: Points within Eps distance
of a point.
 Core point: Eps-neighborhood dense enough
(MinPts)
 Directly density-reachable: A point p is directly
density-reachable from a point q if the distance
is small (Eps) and q is a core point.
 Density-reachable: A point si density-
reachable form another point if there is a path
from one to the other consisting of only core
points.
© Prentice Hall 120
Density Concepts

© Prentice Hall 121

DBSCAN Algorithm

© Prentice Hall 122

CURE
 Clustering Using Representatives
 Use many points to represent a cluster
instead of only one
 Points will be well scattered

© Prentice Hall 123

CURE Approach

© Prentice Hall 124

CURE Algorithm

© Prentice Hall 125

CURE for Large Databases

© Prentice Hall 126

Comparison of Clustering
Techniques

© Prentice Hall 127

Association Rules Outline
Goal: Provide an overview of basic Association
Rule mining techniques
 Association Rules Problem Overview
– Large itemsets
 Association Rules Algorithms
– Apriori
– Sampling
– Partitioning
– Parallel Algorithms
 Comparing Techniques
 Incremental Algorithms
 Advanced AR Techniques

© Prentice Hall 128

Example: Market Basket Data
 Items frequently purchased together:
Bread PeanutButter
 Uses:
– Placement
– Advertising
– Sales
– Coupons
 Objective: increase sales and reduce
costs
© Prentice Hall 129
Association Rule Definitions
 Set of items: I={I1,I2,…,Im}
 Transactions: D={t1,t2, …, tn}, tj I

 Itemset: {Ii1,Ii2, …, Iik}  I

 Support of an itemset: Percentage of
transactions which contain that itemset.
 Large (Frequent) itemset: Itemset
whose number of occurrences is above
a threshold.
© Prentice Hall 130
Association Rules Example

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} is 60%
© Prentice Hall 131
Association Rule Definitions
 Association Rule (AR): implication X
 Y where X,Y  I and X  Y = ;
 Support of AR (s) X  Y:
Percentage of transactions that
contain X Y
 Confidence of AR () X  Y: Ratio of
number of transactions that contain X
 Y to the number that contain X

Association Rules Ex (cont’d)

Association Rule Problem
 Given a set of items I={I1,I2,…,Im} and a
database of transactions D={t1,t2, …, tn}
where ti={Ii1,Ii2, …, Iik} and Iij  I, the
Association Rule Problem is to
identify all association rules X  Y with
a minimum support and confidence.
 Link Analysis
 NOTE: Support of X  Y is same as
support of X  Y.

Association Rule Techniques
1. Find Large Itemsets.
2. Generate rules from frequent itemsets.

Algorithm to Generate ARs

Apriori
 Large Itemset Property:
Any subset of a large itemset is large.
 Contrapositive:

If an itemset is not large,

none of its supersets are large.

Large Itemset Property

Apriori Ex (cont’d)

s=30%  = 50%
© Prentice Hall 139
Apriori Algorithm
1. C1 = Itemsets of size one in I;
2. Determine all large itemsets of size 1, L1;
3. i = 1;
4. Repeat
5. i = i + 1;
6. Ci = Apriori-Gen(Li-1);
7. Count Ci to determine Li;
8. until no more large itemsets found;

Apriori-Gen
 Generate candidates of size i+1 from
large itemsets of size i.
 Approach used: join large itemsets of
size i if they agree on i-1
 May also prune candidates who have
subsets that are not large.

Apriori-Gen Example

Apriori-Gen Example (cont’d)

Apriori Adv/Disadv
 Advantages:
– Uses large itemset property.
– Easily parallelized
– Easy to implement.
 Disadvantages:
– Assumes transaction database is memory
resident.
– Requires up to m database scans.

Sampling
 Large databases
 Sample the database and apply Apriori to the
sample.
 Potentially Large Itemsets (PL): Large
itemsets from sample
 Negative Border (BD - ):
– Generalization of Apriori-Gen applied to
itemsets of varying sizes.
– Minimal set of itemsets which are not in PL,
but whose subsets are all in PL.

Negative Border Example

PL PL BD-(PL)
© Prentice Hall 146
Sampling Algorithm
1. Ds = sample of Database D;
2. PL = Large itemsets in Ds using smalls;
3. C = PL  BD-(PL);
4. Count C in Database using s;
5. ML = large itemsets in BD-(PL);
6. If ML =  then done
7. else C = repeated application of BD-;
8. Count C in Database;
© Prentice Hall 147
Sampling Example
 Find AR assuming s = 20%
 Ds = { t1,t2}
 Smalls = 10%
 PL = {{Bread}, {Jelly}, {PeanutButter},
{Bread,Jelly}, {Bread,PeanutButter}, {Jelly,
PeanutButter}, {Bread,Jelly,PeanutButter}}
 BD-(PL)={{Beer},{Milk}}
 ML = {{Beer}, {Milk}}
 Repeated application of BD- generates all
remaining itemsets

Sampling Adv/Disadv
 Advantages:
– Reduces number of database scans to one
in the best case and two in worst.
– Scales better.
 Disadvantages:
– Potentially large number of candidates in
second pass

Partitioning
 Divide database into partitions D1,D2,
…,Dp
 Apply Apriori to each partition
 Any large itemset must be large in at
least one partition.

Partitioning Algorithm
1. Divide D into partitions D1,D2,…,Dp;
2. For I = 1 to p do
3. Li = Apriori(Di);
4. C = L1  …  Lp;
5. Count C on D to generate L;

Partitioning Example
L1 ={{Bread}, {Jelly},
{PeanutButter},
{Bread,Jelly},
{Bread,PeanutButter},
D1 {Jelly, PeanutButter},
{Bread,Jelly,PeanutButter}}

L2 ={{Bread}, {Milk},
D2 {PeanutButter}, {Bread,Milk},
{Bread,PeanutButter}, {Milk,
PeanutButter},
{Bread,Milk,PeanutButter},
S=10% {Beer}, {Beer,Bread},
{Beer,Milk}}
© Prentice Hall 152
Partitioning Adv/Disadv
 Advantages:
– Adapts to available main memory
– Easily parallelized
– Maximum number of database scans is
two.
 Disadvantages:
– May have many candidates during second
scan.

Parallelizing AR Algorithms
 Based on Apriori
 Techniques differ:
– What is counted at each site
– How data (transactions) are distributed
 Data Parallelism
– Data partitioned
– Count Distribution Algorithm
 Task Parallelism
– Data and candidates partitioned
– Data Distribution Algorithm

Count Distribution Algorithm(CDA)
1. Place data partition at each site.
2. In Parallel at each site do
3. C1 = Itemsets of size one in I;
4. Count C1;
5. Broadcast counts to all sites;
6. Determine global large itemsets of size 1, L1;
7. i = 1;
8. Repeat
9. i = i + 1;
10. Ci = Apriori-Gen(Li-1);
11. Count Ci;
12. Broadcast counts to all sites;
13. Determine global large itemsets of size i, Li;
14. until no more large itemsets found;

CDA Example

Data Distribution Algorithm(DDA)
1. Place data partition at each site.
2. In Parallel at each site do
3. Determine local candidates of size 1 to count;
4. Broadcast local transactions to other sites;
5. Count local candidates of size 1 on all data;
6. Determine large itemsets of size 1 for local
candidates;
7. Broadcast large itemsets to all sites;
8. Determine L1;
9. i = 1;
10. Repeat
11. i = i + 1;
12. Ci = Apriori-Gen(Li-1);
13. Determine local candidates of size i to count;
14. Count, broadcast, and find Li;
15. until no more large itemsets found;

DDA Example

Comparing AR Techniques
 Target
 Type
 Data Type
 Data Source
 Technique
 Itemset Strategy and Data Structure
 Transaction Strategy and Data Structure
 Optimization
 Architecture
 Parallelism Strategy

Comparison of AR Techniques

Hash Tree

Incremental Association Rules
 Generate ARs in a dynamic database.
 Problem: algorithms assume static
database
 Objective:
– Know large itemsets for D
– Find large itemsets for D  { D}
 Must be large in either D or  D
 Save Li and counts

Note on ARs
 Many applications outside market
basket data analysis
– Prediction (telecom switch failure)
– Web usage mining
 Many different types of association rules
– Temporal
– Spatial
– Causal

Advanced AR Techniques
 Generalized Association Rules
 Multiple-Level Association Rules
 Quantitative Association Rules
 Using multiple minimum supports
 Correlation Rules

Measuring Quality of Rules
 Support
 Confidence
 Interest
 Conviction
 Chi Squared Test

Final Year Project Presentation (1)
No ratings yet
Final Year Project Presentation (1)
26 pages
Multilayer Perceptron (MLP) & Linear Separabaility
No ratings yet
Multilayer Perceptron (MLP) & Linear Separabaility
7 pages
Chapter 3
No ratings yet
Chapter 3
67 pages
UNIT-4(MCQs)
No ratings yet
UNIT-4(MCQs)
13 pages
6th_SEM Machine Learning Notes PDF
100% (1)
6th_SEM Machine Learning Notes PDF
36 pages
7 HierarchicalClustering AND DBSCAN
No ratings yet
7 HierarchicalClustering AND DBSCAN
41 pages
Unit 2
No ratings yet
Unit 2
55 pages
Unit 4, DWDM,IT Dept, III Year- II Semester
No ratings yet
Unit 4, DWDM,IT Dept, III Year- II Semester
87 pages
Data Mining
No ratings yet
Data Mining
33 pages
DSandML
No ratings yet
DSandML
76 pages
Unit 4
No ratings yet
Unit 4
186 pages
3 DM Classification (2)
No ratings yet
3 DM Classification (2)
62 pages
Data Mining Unit2 3
No ratings yet
Data Mining Unit2 3
167 pages
Lecture Notes For Chapter 4 Artificial Neural Networks Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 4 Artificial Neural Networks Introduction To Data Mining, 2 Edition
20 pages
KNN Poor Choice
No ratings yet
KNN Poor Choice
9 pages
classification
No ratings yet
classification
36 pages
Unit 4- Classification and Prediction
No ratings yet
Unit 4- Classification and Prediction
72 pages
Data Science Intro Mulawarman
No ratings yet
Data Science Intro Mulawarman
89 pages
AI and Robotics Complete practice set
No ratings yet
AI and Robotics Complete practice set
48 pages
Classification Notes (1)
No ratings yet
Classification Notes (1)
14 pages
UNIT 3 DM
No ratings yet
UNIT 3 DM
34 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
DWM UNIT-3 SEM ANS
No ratings yet
DWM UNIT-3 SEM ANS
10 pages
DM_06-Mar-2025
No ratings yet
DM_06-Mar-2025
13 pages
Module 04 Edited
No ratings yet
Module 04 Edited
19 pages
Week 4 Part 1 Classification
No ratings yet
Week 4 Part 1 Classification
71 pages
Artificial Neural Networks: Asad Anwar Butt
No ratings yet
Artificial Neural Networks: Asad Anwar Butt
39 pages
Chapter 7
No ratings yet
Chapter 7
64 pages
08 - Classification - Decision Trees
No ratings yet
08 - Classification - Decision Trees
116 pages
Unit 4
No ratings yet
Unit 4
20 pages
Unit-6: Classification and Prediction
No ratings yet
Unit-6: Classification and Prediction
63 pages
Classification, Prediction
100% (1)
Classification, Prediction
67 pages
CH 5
No ratings yet
CH 5
84 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
How to build a Neural Network from scratch
No ratings yet
How to build a Neural Network from scratch
4 pages
K_NN classification
No ratings yet
K_NN classification
4 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
No ratings yet
Outline: - Learning Agents - Inductive Learning - Decision Tree Learning
30 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
Studi Kasus: Identifikasi Komponen Penciri Akreditasi Sekolah/Madrasah Pada Tingkat SD/MI Di Provinsi Kalimantan Timur Tahun 2015
No ratings yet
Studi Kasus: Identifikasi Komponen Penciri Akreditasi Sekolah/Madrasah Pada Tingkat SD/MI Di Provinsi Kalimantan Timur Tahun 2015
8 pages
Down 4
No ratings yet
Down 4
83 pages
Classification & Prediction
No ratings yet
Classification & Prediction
19 pages
Classification
No ratings yet
Classification
50 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
ML Unit Wise Important Questions
No ratings yet
ML Unit Wise Important Questions
2 pages
DWDM Assignments Fall 24 25
No ratings yet
DWDM Assignments Fall 24 25
4 pages
Module 04
No ratings yet
Module 04
75 pages
Machine Learning Question Paper Set-3
No ratings yet
Machine Learning Question Paper Set-3
2 pages
Chapter 5 Classification
No ratings yet
Chapter 5 Classification
24 pages
K-Means Clustering Numerical Example
No ratings yet
K-Means Clustering Numerical Example
5 pages
CIS 674 Introduction To Data Mining: Srinivasan Parthasarathy Srini@cse - Ohio-State - Edu Office Hours: TTH 2-3:18PM DL317
No ratings yet
CIS 674 Introduction To Data Mining: Srinivasan Parthasarathy Srini@cse - Ohio-State - Edu Office Hours: TTH 2-3:18PM DL317
40 pages
6 Code MLP Export
No ratings yet
6 Code MLP Export
2 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
Feed-Forward Neural Networks (Part 1)
No ratings yet
Feed-Forward Neural Networks (Part 1)
33 pages
Part 1
No ratings yet
Part 1
40 pages
Classification
No ratings yet
Classification
33 pages
K-Means Clustering Dan Local Outlier Factor: Clustering Data Remunerasi PNS Menggunakan Metode
No ratings yet
K-Means Clustering Dan Local Outlier Factor: Clustering Data Remunerasi PNS Menggunakan Metode
8 pages
20-Delta Rule-02-09-2024
No ratings yet
20-Delta Rule-02-09-2024
3 pages
Class Basic
No ratings yet
Class Basic
67 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Quiz - Review On Machine Learning
No ratings yet
Quiz - Review On Machine Learning
6 pages
08 Class Basic
No ratings yet
08 Class Basic
141 pages
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
unit2
No ratings yet
unit2
15 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
88 pages
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
No ratings yet
Classification and Prediction Lecture-22,23,24,25,26,27, 28: Dr. Sudhir Sharma Manipal University Jaipur
43 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
22 pages
CIS 674 Introduction To Data Mining: Srinivasan Parthasarathy Srini@cse - Ohio-State - Edu Office Hours: TTH 2-3:18PM DL317
No ratings yet
CIS 674 Introduction To Data Mining: Srinivasan Parthasarathy Srini@cse - Ohio-State - Edu Office Hours: TTH 2-3:18PM DL317
40 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
deep learning mcq (Autosaved)
No ratings yet
deep learning mcq (Autosaved)
6 pages
Data Mining-Unit-3
No ratings yet
Data Mining-Unit-3
16 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Slides Courtesy: Ling Chen lchen@L3S.de
No ratings yet
Slides Courtesy: Ling Chen lchen@L3S.de
42 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
224 pages
ML 2 (Mainly KNN)
100% (1)
ML 2 (Mainly KNN)
12 pages
Data Mining All Summary
No ratings yet
Data Mining All Summary
47 pages
ABP DWDM UNIT 4 Classification 1
No ratings yet
ABP DWDM UNIT 4 Classification 1
51 pages
Back-Propagation Is Very Simple. Who Made It Complicated
No ratings yet
Back-Propagation Is Very Simple. Who Made It Complicated
26 pages
Data Mining Intro IEP
No ratings yet
Data Mining Intro IEP
47 pages
Ait401 DL Syllubus
100% (1)
Ait401 DL Syllubus
13 pages
PR Assignment 01 - Seemal Ajaz (206979)
No ratings yet
PR Assignment 01 - Seemal Ajaz (206979)
7 pages
Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF
100% (1)
Data Classification - Algorithms and Applications-Chapman and Hall - CRC (2014) - (Chapman & Hall - CRC Data Mining and Knowledge Discovery Series) Charu C. Aggarwal PDF
704 pages
CO328 - Deep - Learning - Final 23.12.23
No ratings yet
CO328 - Deep - Learning - Final 23.12.23
2 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
7 Classification
100% (3)
7 Classification
63 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Part 2

Uploaded by

Part 2

Uploaded by

DATA MINING

Introductory and Advanced Topics

Companion slides for the text by Dr. M.H.Dunham, Data Mining,

True Positive False Negative

False Positive True Negative

Using height data example with Output1

Use this approach with DT Induction !

 Entropy measures the amount of randomness

log (1/p) H(p,1-p)

 PL,PR probability that a tuple in the training set

 Change weights on arcs based on estimated

 Perceptron is one of the simplest NNs.

 Clustering Problem Overview

Hierarchical Partitional Categorical Large DB

Agglomerative Divisive Sampling Compression

© Prentice Hall 100

© Prentice Hall 101

© Prentice Hall 102

© Prentice Hall 103

© Prentice Hall 104

© Prentice Hall 105

© Prentice Hall 106

© Prentice Hall 107

© Prentice Hall 108

© Prentice Hall 109

Modified from [OV99]

© Prentice Hall 110

© Prentice Hall 111

© Prentice Hall 112

© Prentice Hall 113

© Prentice Hall 114

© Prentice Hall 116

© Prentice Hall 117

© Prentice Hall 118

© Prentice Hall 119

© Prentice Hall 121

© Prentice Hall 122

© Prentice Hall 123

© Prentice Hall 124

© Prentice Hall 125

© Prentice Hall 126

© Prentice Hall 127

© Prentice Hall 128

 Itemset: {Ii1,Ii2, …, Iik}  I

I = { Beer, Bread, Jelly, Milk, PeanutButter}

© Prentice Hall 132

© Prentice Hall 133

© Prentice Hall 134

© Prentice Hall 135

© Prentice Hall 136

If an itemset is not large,

© Prentice Hall 137

© Prentice Hall 138

© Prentice Hall 140

© Prentice Hall 141

© Prentice Hall 142

© Prentice Hall 143

© Prentice Hall 144

© Prentice Hall 145

© Prentice Hall 148

© Prentice Hall 149

© Prentice Hall 150

© Prentice Hall 151

© Prentice Hall 153

© Prentice Hall 154

© Prentice Hall 155

© Prentice Hall 156

© Prentice Hall 157

© Prentice Hall 158

© Prentice Hall 159

© Prentice Hall 160

© Prentice Hall 161

© Prentice Hall 162

© Prentice Hall 163

© Prentice Hall 164

© Prentice Hall 165