0% found this document useful (0 votes)

6 views

Week6_clustering_regression

NTU EE6483 Week6_clustering_regression

Uploaded by

yimingxiao2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Week6_clustering_regression

NTU EE6483 Week6_clustering_regression

Uploaded by

yimingxiao2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 101

Artificial Intelligence & Data Mining

IE4483

WEN Bihan (Asst Prof)

Homepage: https://personal.ntu.edu.sg/bihan.wen/

1
Weekly Plan

• Week 6 - Unsupervised Learning - Clustering & Regression

• Week 11 - Regularization and Optimization for Deep Models

• Week 12 - Bayesian Reasoning & Dimensionality Reduction

• Week 13 - Low-Dimensionality - NOT Examed

2
Mini Project

3
Mini Project

• Submission is due on Nov 15 (Friday), 11:59pm

• Work in group of THREE to prepare one report.

• Clearly specify the team members.

• State the respective contribution of each member to the project.

• Grouping

4
Mini Project

• Option 1: Sentiments of Product Reviews

• Application type:
Natural Language Processing

• Training data:

• Given the user review of product, as raw test, with sentiments

• 0 presents negative; 1 presents positive

• Testing data:

• Predict the sentiment: binary classification

5
Mini Project

• Option 2: Dogs vs. Cats

• Application type: Computer Vision

• Training data:
• Images in dog / cat folders:
|— dog
|— cat

• Testing data:

• Predict the image type: binary classification

6
Mini Project

• Submit your report and the “submission.csv”

file with your results.
Fill up your predicted
results here.

• If you couldn’t obtain any meaningful results,

describe what you have done with
screenshots or codes.

• Deadline: Nov 15 (Friday), 11:59pm

7
Clustering

8
Outline

• Concept of Clustering

• Distance Metrics

• K-Means

• Hierarchical Agglomerative Clustering (HAC)

• Examples

9
Carry-on Questions

• What is a cluster? What is clustering?

• What is the difference between clustering and classification?

• What are the limitations of K-Means algorithm?

• What are the limitations of HAC algorithm?

10
Recap: Classification

• Computer recognize whether there is a cat in the image:

11
Recap: Classification

• Computer recognize whether there is a cat in the image:

• Need training process to teach computer.

• Images labeled as Cat.
12
From classification to clustering

• What if we do NOT have the labels? Which pixels form the flower?

• Data intrinsic structure / Similarity within the same group.

13
From classification to clustering

• Clustering is (typically) unsupervised learning:

• Unsupervised Learning: self-organized learning that helps find unknown

patterns in data set without pre-existing labels.

• Clustering: Given a collection of data samples, the goal is to group / organize

the data such that the data in the same group are more similar to each other,
than to those in other groups.

• Cluster: a set of data which are similar to each other.

14
From classification to clustering

• Unsupervised Learning: self-organized learning that helps find

unknown patterns in data set without pre-existing labels.

No labels or class information provided

15
From classification to clustering

• Clustering: Given a collection of data samples, the goal is to group /

organize the data such that the data in the same group are more
similar to each other, than to those in other groups.

Background Pixels
Flower Pixels
16
From classification to clustering

• Clustering: Given a collection of data samples, the goal is to group /

organize the data such that the data in the same group are more
similar to each other, than to those in other groups.

17
From classification to clustering

• Cluster: a set of data which are similar to each other.

Cluster 2

Cluster 1

18
From classification to clustering

• Classification is supervised:

• Class labels are provided in training.

• Learn a classifier to predict the class labels of unseen data.

• Clustering is unsupervised:

• No pre-existing label is given.

• Understand the structure / organization of your underlying data.

19
Clustering

• Clustering:

• Unsupervised Method.

• Basic Idea: group together the similar data points.

• Input: a group of data points, without any training label.

• Output: the “membership” of each data point.

• How do we define “similarity” here?

20
Distance Measures / Metrics

• Given a set of N data samples / points 𝒙𝒙1 , 𝒙𝒙2 , … , 𝒙𝒙𝑖𝑖 , … 𝒙𝒙𝑁𝑁 that we
would like to cluster.

• Each data sample is assumed to be a d-dimensional vector that we

write as a column vector:
𝑥𝑥1
𝒙𝒙 = ⋮
𝑥𝑥𝑑𝑑

• We define the distance between any two data samples 𝒙𝒙𝑖𝑖 and 𝒙𝒙𝑗𝑗 as

𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ) → 𝑅𝑅 (real scalar)

21
Distance Measures / Metrics

• We define the distance between any two data samples 𝒙𝒙𝑖𝑖 and 𝒙𝒙𝑗𝑗 as
𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ) → 𝑅𝑅 (real scalar)

• A distance metric is a function of (𝑅𝑅𝑑𝑑 × 𝑅𝑅 𝑑𝑑 ) → 𝑅𝑅 that satisfies:

1. 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ≥ 0, 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 = 0 if and only if 𝒙𝒙𝑖𝑖 = 𝒙𝒙𝑗𝑗 .

𝒙𝒙𝑗𝑗
non-negativity

𝒙𝒙𝑖𝑖
22
Distance Measures / Metrics

• We define the distance between any two data samples 𝒙𝒙𝑖𝑖 and 𝒙𝒙𝑗𝑗 as
𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ) → 𝑅𝑅 (real scalar)

• A distance metric is a function of (𝑅𝑅𝑑𝑑 × 𝑅𝑅 𝑑𝑑 ) → 𝑅𝑅 that satisfies:

1. 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ≥ 0, 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 = 0 if and only if 𝒙𝒙𝑖𝑖 = 𝒙𝒙𝑗𝑗 .

𝒙𝒙𝑙𝑙
2. 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 + 𝑑𝑑 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑙𝑙 ≥ 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑙𝑙

triangle 𝒙𝒙𝑗𝑗
inequality 𝒙𝒙𝑖𝑖
23
Distance Measures / Metrics

• We define the distance between any two data samples 𝒙𝒙𝑖𝑖 and 𝒙𝒙𝑗𝑗 as
𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ) → 𝑅𝑅 (real scalar)

• A distance metric is a function of (𝑅𝑅𝑑𝑑 × 𝑅𝑅 𝑑𝑑 ) → 𝑅𝑅 that satisfies:

1. 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ≥ 0, 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 = 0 if and only if 𝒙𝒙𝑖𝑖 = 𝒙𝒙𝑗𝑗 .

2. 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 + 𝑑𝑑 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑙𝑙 ≥ 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑙𝑙

𝒙𝒙𝑗𝑗
3. 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 = 𝑑𝑑 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑖𝑖

symmetry 𝒙𝒙𝑖𝑖
24
Distance Measures / Metrics

• Example of distances:

1. Euclidean Distance:
𝑑𝑑

𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = �(𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗 )2 = 𝒙𝒙 − 𝒚𝒚 2
𝑗𝑗=1

• We denote 𝒙𝒙 − 𝒚𝒚 2 as the 𝑙𝑙2 -norm of (𝒙𝒙 − 𝒚𝒚).

𝒙𝒙

𝒙𝒙 − 𝒚𝒚 2

𝒚𝒚
25
Distance Measures / Metrics

• Example of distances:

1. Euclidean Distance:
𝑑𝑑

𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = �(𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗 )2 = 𝒙𝒙 − 𝒚𝒚 2
𝑗𝑗=1

2. Manhattan Distance:
𝑑𝑑

𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = � | 𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗 | = 𝒙𝒙 − 𝒚𝒚 1
𝑗𝑗=1

𝒙𝒙
𝒙𝒙 − 𝒚𝒚 1

𝒚𝒚 26
Distance Measures / Metrics

• Example of distances:

1. Euclidean Distance:
𝑑𝑑

𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = �(𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗 )2 = 𝒙𝒙 − 𝒚𝒚 2
𝑗𝑗=1

2. Manhattan Distance:
𝑑𝑑

𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = � | 𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗 | = 𝒙𝒙 − 𝒚𝒚 1
𝑗𝑗=1

3. Infinity (Sup) Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = max 𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗

1≤𝑗𝑗≤𝑑𝑑
27
Distance Measures / Metrics

• Example of distances:
1. Euclidean Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = 𝒙𝒙 − 𝒚𝒚 2
2. Manhattan Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = 𝒙𝒙 − 𝒚𝒚 1
3. Infinity (Sup) Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = max 𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗
1≤𝑗𝑗≤𝑑𝑑

28
Distance Measures / Metrics

• Example of distances: Used in this course

1. Euclidean Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = 𝒙𝒙 − 𝒚𝒚 2
2. Manhattan Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = 𝒙𝒙 − 𝒚𝒚 1
3. Infinity (Sup) Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = max 𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗
1≤𝑗𝑗≤𝑑𝑑

29
Clustering Algorithms

• Partition Algorithms

• K-Means

• Mixture of Gaussian

• Spectral Clustering

• Hierarchical Algorithms

• Agglomerative

• Divisive

30
Clustering Algorithms

• Partition Algorithm

Cluster 1

Cluster 2

 Cluster the data samples into non-overlapping subsets (clusters).

Each data sample is in exactly one cluster.

31
Clustering Algorithms

• Hierarchical Algorithm

 A set of nested clusters organized as a hierarchical tree.

32
Clustering Algorithms

• Partition Algorithms

• K-Means

• Mixture of Gaussian

• Spectral Clustering

• Hierarchical Algorithms

• Agglomerative

• Divisive

33
Clustering Algorithms

• Partition Algorithms

• K-Means

• Mixture of Gaussian

• Spectral Clustering

• Hierarchical Algorithms

• Agglomerative

• Divisive

34
K-Means

• Given a set of d-dimension data points 𝒙𝒙1 , 𝒙𝒙2 , … , 𝒙𝒙𝑖𝑖 , … 𝒙𝒙𝑁𝑁 , and a
distance metric 𝑑𝑑 𝒙𝒙, 𝒚𝒚

• Clustering Goal:

1. Split the data points 𝒙𝒙1 , 𝒙𝒙2 , … , 𝒙𝒙 𝑖𝑖 , … 𝒙𝒙𝑁𝑁 into K clusters.

2. Each cluster has a d-dimension centroid / center 𝜇𝜇𝑘𝑘 .

3. The sum of distances between each 𝒙𝒙(𝑖𝑖) and its centroid 𝜇𝜇𝑘𝑘 is minimized.

35
K-Means

• Distance metric: Typically, we use Euclidean distance

• Cluster Center / Centroid:

𝜇𝜇𝑘𝑘 = the average of the data points belong to this cluster.

• Split the data points:

Each data point belongs to only one cluster.

36
K-Means

• Initialize: pick K random points as the cluster centers 𝜇𝜇𝑘𝑘 .

• Iterate between the following step 1 and 2:

1. Assign every data point 𝒙𝒙𝑖𝑖 to its closest cluster center, according to the
given distance metric, i.e., find the 𝜇𝜇𝑘𝑘 such that 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝜇𝜇𝑘𝑘 is minimized.

2. Update the cluster center 𝜇𝜇𝑘𝑘 to be the average of its assigned data points.

• Stopping Criterion: when no points’ assignments change.

37
K-Means Example 1

• Suppose our task is to cluster the following eight points in 2D space

into K = 3 clusters:
A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9).

• The distance function is Euclidean distance.

• Suppose initially we assign A1, B1 and C1 as the center of each

cluster, respectively.

• Apply K-means to estimate the final three clusters.

38
K-Means Example 1

• A1=(2, 10); A2=(2,5); A3=(8,4); B1=(5,8); B2=(7,5); B3=(6,4); C1=(1,2); C2=(4,9)

39
K-Means Example 1

• Given the three initial cluster centers A1, B1, and C1.
• Step 1: Determine which data point belongs to which cluster by
calculating their distances to the centers, i.e., yellow-highlighted columns
in the following matrix:

40
K-Means Example 1

• Cluster 1={A1}, Cluster 2={B1, B2, B3, A3, C2}, Cluster 3={A2, C1}

41
K-Means Example 1

• Cluster 1={A1}, Cluster 2={B1, B2, B3, A3, C2}, Cluster 3={A2, C1}

• Step 2: The cluster centers after the first round of iteration can be obtained
by computing the mean of all the data points belong to each cluster as:
C1 = (2, 10); C2 = (6, 6); C3 = (1.5, 3.5)

42
K-Means Example 1

• Step 2: The cluster centers after the first round of iteration can be
obtained by computing the mean of all the data points belong to each
cluster as:
C1 = (2, 10); C2 = (6, 6); C3 = (1.5, 3.5)

• Repeat Step 1: Determine which data point belongs to which cluster,

by calculating their distances to the new centers C1, C2 and C3.

• ……

• Stopping Criterion: when no points’ assignments change

43
K-Means Example 2
Centroid 1

Data samples
Centroid 2

44
K-Means Example 3

45
K-Means Example 3

• Cluster the pixel’s gray-scale intensity: 1-dimension feature

K=2 K=3 46
K-Means Example 3

• Cluster the pixel’s RGB color: 3-dimension feature

K=3 47
K-Means

• What cost function does K-Means optimize?

𝑁𝑁 𝐾𝐾
1 2
min min � � 𝑟𝑟𝑖𝑖𝑖𝑖 𝑥𝑥𝑖𝑖 − µ𝑘𝑘 2
µ𝑘𝑘 𝑟𝑟𝑖𝑖𝑖𝑖 2
𝑖𝑖=1 𝑘𝑘=1

• Calculation of the centers: µ𝑘𝑘 = ∑𝑁𝑁

𝑖𝑖=1 𝑟𝑟𝑖𝑖𝑖𝑖 𝑥𝑥𝑖𝑖 ∀𝑘𝑘

• Constraint on the assignment: 𝑟𝑟𝑖𝑖𝑖𝑖 ∈ {0, 1} ∀𝑖𝑖, 𝑘𝑘

• Normalization: ∑𝐾𝐾
𝑘𝑘=1 𝑟𝑟𝑖𝑖𝑖𝑖 = 1 ∀𝑖𝑖

48
K-Means: Is the algorithm good?

• Pros:

• Simple but effective

• Easy to implement

• Cons:

• Need to choose K.

• Stuck at poor local minimum

• Need an appropriate distance metric

49
K-Means

• Different initialization -> Local minimum

50
K-Means

• Poor local minimum

51
K-Means

• Good Initialization

Centroid
52
K-Means

• Good Initialization

53
K-Means

• Bad Initialization

Centroid
54
K-Means

• Bad Initialization

55
K-Means

• Need a better metric

Not linearly separable Good metric space for clustering

56
Hierarchical Agglomerative Algorithm (HAC)

• Hierarchical Agglomerative Algorithm ( HAC )

• Start with the points as individual clusters

• At each step, merge the closest pair of clusters, until only one cluster (or K
clusters) left. K is a given number.

• How to merge?

• Merge the pair of clusters with the minimum distance.

57
HAC

• Hierarchical Agglomerative Algorithm ( HAC ) Initialization:

Each object is a cluster.

Iteration:
Merge two clusters with the
minimum distance.

Stopping Criteria:
All objects are merged into a
single cluster.

Only K clusters are left.

5 4 3 2 1
clusters clusters clusters clusters cluster 58
HAC

• HAC can be visualized as a Dendrogram

• A tree-like diagram that records the sequences of merges.

Cut at 2 clusters

Cluster Cluster
1 2

59
HAC

• Advantages of HAC

• Do not have to assume / pre-define the number of clusters.

• Any clustering result with the desired number of clusters K, can be obtained
by “cutting” the dendrogram at the corresponding level.

• The result is independent of the initialization.

60
HAC

1. How to Define the distance between 2 clusters?

Distance?

61
HAC

1. How to Define the distance between 2 clusters?

• MIN / Single Linkage: the minimum distance between any pair of

two data samples from each cluster.

62
HAC

1. How to Define the distance between 2 clusters?

• MAX / Complete Linkage: the maximum distance between any pair

of two data samples from each cluster.

63
HAC

1. How to Define the distance between 2 clusters?

• Average Linkage: the average distance between all pairs of two data
samples from each cluster.

64
HAC

1. How to Define the distance between 2 clusters?

• Centroid Distance: the distance between the means of data samples

(i.e., centroids) from each cluster.

65
HAC

2. How to determine the pair of clusters with minimum distance?

• Visualize the samples in the space.

66
HAC

2. How to determine the pair of clusters with minimum distance?

• Visualize the samples in the space.

• Merge the two clusters (points)

that are closest to each other.

67
HAC

2. How to determine the pair of clusters with minimum distance?

• Visualize the samples in the space.

• Merge the two clusters (points)

that are closest to each other.

• Merge the next closest clusters.

68
HAC

2. How to determine the pair of clusters with minimum distance?

• Visualize the samples in the space

• Merge the two clusters (points)

that are closest to each other

• Merge the next closest clusters.

• Then the next closest…

69
HAC

2. How to determine the pair of clusters with minimum distance?

• Visualize the samples in the space

• Merge the two clusters (points)

that are closest to each other

• Merge the next closest clusters.

• Then the next closest…

• Until only one cluster left.

Or the given K clusters left.

70
HAC

2. How to determine the pair of clusters with minimum distance?

• Equivalently, we have the resulting dendrogram showing the HAC process

• The y-axis on dendrogram shows the distance between clusters when

merging at each step. 71
Example: HAC
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

What is the distance metrics between clusters?

1. Single Linkage (MIN distance)

2. Complete Linkage (MAX distance)
3. Centroid Distance (distance between the centers)
4. Average Linkage (average over all pairs of points from two clusters)

72
Example: HAC – Centroid distance
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

Distance = Centroid Distance

#1 #2 #3 #4

#1 0 0.14 0.72 1.17

#2 0.14 0 0.86 1.3
#3 0.72 0.86 0 0.5
#4 1.17 1.3 0.5 0

73
Example: HAC – Centroid distance
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

Distance = Centroid Distance

#1 #2 #3 #4

#1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0

74
Example: HAC
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

Distance between #1 and #2 = 0.14 - Smallest

• Calculate the new centroid for (#1 and #2):

Cluster #1+#2 #3 #4
Centroid 1.85 2.3 2.3
0.95 1.6 2.1

75
Example: HAC
Cluster #1+#2 #3 #4
Centroid 1.85 2.3 2.3
0.95 1.6 2.1

Update the distance table

#1 + #2 #3 #4

#1 + #2 0
#3 0.79 0
#4 1.23 0.5 0

76
Example: HAC
#1 + #2 #3 #4
Cluster #1+#2 #3 #4
#1 + #2 0
Centroid 1.85 2.3 2.3
#3 0.79 0
0.95 1.6 2.1
#4 1.23 0.5 0

Next closest clusters to merge?

• Distance between #3 and #4 = 0.5

Cluster #1+#2 #3+#4

Centroid 1.85 2.3
0.95 1.85

77
Example: HAC
Cluster #1+#2 #3+#4
Centroid 1.85 2.3
0.95 1.85

Merge into 1 cluster

Cluster #1+#2+#3+#4

Centroid 2.075
1.4

78
Example: HAC – Centroid distance
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

Distance = Average Linkage

#1 #2 #3 #4

#1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0

79
#1 #2 #3 #4
Example: HAC #1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0

Merge #1 and #2, and update the distance table:

• D(1+2, 3) = 0.5 * D(1,3) + 0.5 * D(2,3) = 0.36 + 0.43 = 0.79

• D(1+2, 4) = 0.5 * D(1,4) + 0.5 * D(2,4) = 0.59 + 0.65 = 1.24

#1 + #2 #3 #4

#1 + #2 0
#3 0.79 0
#4 1.24 0.5 0
80
#1 + #2 #3 #4
Example: HAC
#1 + #2 0
#3 0.79 0
#4 1.24 0.5 0

Merge #3 and #4 next, and update the distance table:

• D(1+2, 3+4) = 0.5 * D(1+2,3) + 0.5 * D(1+2,4) = 0.40 + 0.62 = 1.02

#1 + #2 #3 + #4

#1 + #2 0
#3 + #4 1.02 0

81
K-Means vs. HAC

• K-Means

✓ Simple and cheap algorithm

✘ Results are sensitive to the initialization

✘ Number of clusters needs to be pre-defined

• HAC

✓ Deterministic algorithm, i.e., not randomness.

✓ Show us a range of clustering results with different choices of K.

✘ More memory- and computationally-intensive than K-Means

82
What we learn

• What is clustering?

• What is distance metric?

o Euclidean distance, centroid distance, etc.

• K-Means

o Goal, algorithm, optimization, examples

• HAC

o Algorithm, dendrogram, cluster selection, examples

83
Carry-on Questions
• What is a cluster? What is clustering?

• Cluster: a set of data which are similar to each other.

• Clustering: group / organize the data such that the data in the same group are
more similar to each other, than to those in other groups.

• What is the difference between clustering and classification?

• Clustering is unsupervised, while classification is supervised.

• What is the limitations of K-Means algorithm?

• Need to choose K. Can stuck at poor local minimum. Need good metric.

• What are the limitations of HAC algorithm?

• Memory- and computationally-intensive.

84
Regression

85
Outline

• Concept of Regression

• Linear Regression

• Examples

• Derivation of Linear Regression (Not Examed)

86
Carry-on Questions

• What is the difference between regression and classification?

• What is the loss function for training linear regressor?

87
Recap: Classification

• Computer decides whether there is a cat or not:

Input Data Discrete

Labels
88
From Classification to Regression

• Computer predicts the likelihood if there is a cat:

0.9213

Input Data Continuous

quantity
89
From Classification to Regression

• Classification is to predict a discrete class label:

• It may output a continuous value, in the form of the probability for a discrete
class label.

• Accuracy = percentage of correctly classified examples out of all predictions.

• Regression is to predict a continuous quantity:

• It may predict a discrete value, in the form of an integer quantity.

• Accuracy = root mean squared error.

90
Linear Regression

• Regressor: predict 𝑦𝑦 ∈ 𝑅𝑅 ( output ) - scalar

from 𝒙𝒙 ∈ 𝑅𝑅 𝑑𝑑 (data point) - vector or scalar

• Linear regression: 𝑦𝑦 can be determined by 𝒘𝒘 𝑇𝑇 𝒙𝒙 + 𝑏𝑏

• 𝒘𝒘 ∈ 𝑅𝑅𝑑𝑑 and 𝒘𝒘 𝑇𝑇 denotes its transpose (row vector).

• Training: to find the best 𝒘𝒘 , based on training data.

• Training dataset contains N data points 𝒙𝒙1 , … , 𝒙𝒙𝑖𝑖 , … 𝒙𝒙𝑁𝑁 , and their ground
truth 𝑦𝑦1 , … , 𝑦𝑦𝑖𝑖 , … 𝑦𝑦𝑁𝑁 .

91
Linear Regression

• Training: to find the best 𝒘𝒘 , based on training data.

• Given a set of N data points 𝒙𝒙1 , … , 𝒙𝒙𝑖𝑖 , … 𝒙𝒙𝑁𝑁 , and their 𝑦𝑦1 , … , 𝑦𝑦𝑖𝑖 , … 𝑦𝑦𝑁𝑁 .

• Find 𝑓𝑓𝒘𝒘,𝑏𝑏 (𝒙𝒙) = 𝒘𝒘𝑇𝑇 𝒙𝒙 + 𝑏𝑏 that minimizes the 𝑙𝑙2 loss

𝑁𝑁
1
min 𝐿𝐿� 𝑓𝑓𝒘𝒘,𝑏𝑏 = min �(𝒘𝒘𝑇𝑇 𝒙𝒙𝒊𝒊 + 𝑏𝑏 − 𝑦𝑦𝑖𝑖 )2
𝒘𝒘,𝑏𝑏 𝒘𝒘,𝑏𝑏 N
𝑖𝑖=1

• Loss function: mean squared error between 𝒘𝒘𝑇𝑇 𝒙𝒙𝒊𝒊 + 𝑏𝑏 and 𝑦𝑦𝑖𝑖 .

92
Linear Regression - Example

• Consider a simple 1-dimension data regression:

• Plot of the data points on the 𝑥𝑥 − 𝑦𝑦 plane:

93
Linear Regression - Example

• Consider a simple 1-dimension data regression:

• The predicted 𝒘𝒘 (line) and the square errors:

Minimizing Vertical Offset !

𝒘𝒘𝑇𝑇 𝒙𝒙𝒊𝒊 + 𝑏𝑏

𝑦𝑦𝑖𝑖

94
Linear Regression – Derivation

• Regressor: predict 𝑦𝑦 ∈ 𝑅𝑅 (response) from 𝒙𝒙 ∈ 𝑅𝑅 𝑑𝑑 (data point).

• Linear regression: 𝑦𝑦 can be determined by 𝒘𝒘 𝑇𝑇 𝒙𝒙 + 𝑏𝑏

• 𝒘𝒘 ∈ 𝑅𝑅𝑑𝑑 and 𝒘𝒘 𝑇𝑇 denotes its transpose (row vector).

𝒙𝒙
𝒘𝒘 𝑇𝑇 𝒙𝒙 + 𝑏𝑏 = 𝒘𝒘 𝑇𝑇 | 𝑏𝑏
1
• To simplify the notation,
𝒙𝒙
𝒘𝒘 𝑇𝑇 ← 𝒘𝒘 𝑇𝑇 | 𝑏𝑏 𝒙𝒙 ←
1
• The new vectors: 𝒙𝒙 ∈ 𝑅𝑅𝑑𝑑+1 and 𝒘𝒘 ∈ 𝑅𝑅𝑑𝑑+1

• Thus, 𝑦𝑦 can be fully determined by 𝒘𝒘 𝑇𝑇 𝒙𝒙.

95
Linear Regression - Derivation

• Write the problem in matrix form:

�𝐿𝐿 𝑓𝑓𝒘𝒘 = 1 ∑𝑁𝑁

𝑖𝑖=1 (𝒘𝒘 𝑇𝑇
𝒙𝒙𝒊𝒊 − 𝑦𝑦𝑖𝑖 ) 2
=
1
𝑿𝑿𝑿𝑿 − 𝒚𝒚 2
2
N N

• Concatenate the rows:

𝒙𝒙𝒊𝒊 𝑦𝑦𝒊𝒊

• Matrix 𝑿𝑿 ∈ 𝑅𝑅𝑁𝑁×(𝑑𝑑+1)
𝒚𝒚
• Vector 𝒚𝒚 ∈ 𝑅𝑅 𝑁𝑁

• Vector 𝒘𝒘 ∈ 𝑅𝑅𝑑𝑑+1

96
Linear Regression - Derivation

• Write the problem in matrix form:

1 1
𝐿𝐿� 𝑓𝑓𝒘𝒘 = ∑𝑁𝑁 𝑇𝑇 2
𝑖𝑖=1(𝒘𝒘 𝒙𝒙𝒊𝒊 − 𝑦𝑦𝑖𝑖 ) = 𝑿𝑿𝑿𝑿 − 𝒚𝒚 2
2
N N

• Find the gradient w.r.t. 𝒘𝒘:

2
∇𝒘𝒘 𝑿𝑿𝒘𝒘 − 𝒚𝒚 2= ∇𝒘𝒘 𝑿𝑿𝑿𝑿 − 𝒚𝒚 𝑇𝑇 𝑿𝑿𝑿𝑿 − 𝒚𝒚
= ∇𝑤𝑤 𝒘𝒘𝑇𝑇 𝑿𝑿𝑇𝑇 𝑿𝑿𝑿𝑿 − 2𝒘𝒘𝑇𝑇 𝑿𝑿𝑇𝑇 𝒚𝒚
= 2𝑿𝑿𝑇𝑇 𝑿𝑿𝑿𝑿 − 2𝑿𝑿𝑇𝑇 𝒚𝒚

• Set gradient to zero to get the minimizer:

𝑿𝑿𝑇𝑇 𝑿𝑿𝑿𝑿 = 𝑿𝑿𝑇𝑇 𝒚𝒚

𝒘𝒘 = (𝑿𝑿𝑇𝑇 𝑿𝑿)−1 𝑿𝑿𝑇𝑇 𝒚𝒚

• Note: here we assume (𝑿𝑿𝑇𝑇 𝑿𝑿)−1 exists unless otherwise specified.

97
Linear Regression - Derivation

• As we are minimizing the mean squared error, we call this solution as

the least square (LS) estimator:

𝒘𝒘 = (𝑿𝑿𝑇𝑇 𝑿𝑿)−1 𝑿𝑿𝑇𝑇 𝒚𝒚

• Some times linear is not good enough…

98
Linear Regression - Derivation

• As we are minimizing the mean squared error, we call this solution as

the least square (LS) estimator:

𝒘𝒘 = (𝑿𝑿𝑇𝑇 𝑿𝑿)−1 𝑿𝑿𝑇𝑇 𝒚𝒚

• Polynomial fit?
• No, thanks….

• Solution:

• Replace 𝒙𝒙𝒊𝒊 with better feature ϕ(𝒙𝒙𝒊𝒊 )

• Feature Learning

99
Carry-on Questions

• What is the difference between regression and classification?

• Classification is to predict discrete class labels.

• Regression is to predict continuous quantities.

• What is the loss function for training linear regressor?

• Loss function: mean squared error between 𝒘𝒘𝑇𝑇 𝒙𝒙𝒊𝒊 + 𝑏𝑏 and 𝑦𝑦𝑖𝑖 .

• Minimize the vertical offsets.

100
Thank you! Now questions

101

Account Registration: Laboratory Exercise
No ratings yet
Account Registration: Laboratory Exercise
3 pages
M30-2 Unit 1 Assignment 22-23
100% (1)
M30-2 Unit 1 Assignment 22-23
9 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Clustering
No ratings yet
Clustering
39 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Clustering
No ratings yet
Clustering
75 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Module 5 - Clustering - Afterclassb
No ratings yet
Module 5 - Clustering - Afterclassb
49 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
cs4811-ch10c-clustering
No ratings yet
cs4811-ch10c-clustering
35 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering
No ratings yet
Clustering
84 pages
3 UnSupervised Learning
No ratings yet
3 UnSupervised Learning
53 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
clustering
No ratings yet
clustering
62 pages
Module 5
No ratings yet
Module 5
370 pages
Clustering
No ratings yet
Clustering
104 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
Lecture 2.1.1 to 2.1.2 (1)
No ratings yet
Lecture 2.1.1 to 2.1.2 (1)
97 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
41 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
Lecture 4.6 Unsupervised-learning Clustering
No ratings yet
Lecture 4.6 Unsupervised-learning Clustering
60 pages
Week 9 - Clustering
No ratings yet
Week 9 - Clustering
63 pages
UNIT5
No ratings yet
UNIT5
60 pages
Clustering
No ratings yet
Clustering
27 pages
1. Clustering
No ratings yet
1. Clustering
75 pages
Supervised Learning vs. Unsupervised Learning
No ratings yet
Supervised Learning vs. Unsupervised Learning
7 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Mod 4 - CLustering
No ratings yet
Mod 4 - CLustering
55 pages
dm 4
No ratings yet
dm 4
76 pages
Clustering
No ratings yet
Clustering
75 pages
ML+Clustering
No ratings yet
ML+Clustering
33 pages
Lec09 Clustering
No ratings yet
Lec09 Clustering
27 pages
Cluster
100% (1)
Cluster
72 pages
Clustering
No ratings yet
Clustering
125 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
Lect 12
No ratings yet
Lect 12
80 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
K-Means Clustering
No ratings yet
K-Means Clustering
38 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
SIM7600E H 4G HAT Manual EN
No ratings yet
SIM7600E H 4G HAT Manual EN
28 pages
SysTrack Solution Brief
No ratings yet
SysTrack Solution Brief
4 pages
Technical Assessment For Deployment Support Engineer - Hery Munanzar
No ratings yet
Technical Assessment For Deployment Support Engineer - Hery Munanzar
6 pages
NFT Insight Report October 2023
No ratings yet
NFT Insight Report October 2023
50 pages
Digital NI Act Guidelines
No ratings yet
Digital NI Act Guidelines
74 pages
Swift Interview Questions Answers
No ratings yet
Swift Interview Questions Answers
13 pages
DDOS Setup Documentation
No ratings yet
DDOS Setup Documentation
19 pages
HM10 Intro Final
No ratings yet
HM10 Intro Final
229 pages
Counters
No ratings yet
Counters
56 pages
Applications and Services2 - Fidesmo
No ratings yet
Applications and Services2 - Fidesmo
45 pages
Tikona Broadband New
No ratings yet
Tikona Broadband New
72 pages
Decision Making and Branching
No ratings yet
Decision Making and Branching
26 pages
Get Started with T1000-E Tracker _ Seeed Studio Wiki
No ratings yet
Get Started with T1000-E Tracker _ Seeed Studio Wiki
1 page
Platine Contrôle - Pharos - TPC - Datasheet 1
No ratings yet
Platine Contrôle - Pharos - TPC - Datasheet 1
1 page
Cse8 ch02
No ratings yet
Cse8 ch02
25 pages
好例子网 RTL8211FS V18
No ratings yet
好例子网 RTL8211FS V18
100 pages
Data Literacy Q - Ans
100% (2)
Data Literacy Q - Ans
3 pages
Home Work 3 PPT History
No ratings yet
Home Work 3 PPT History
20 pages
Unit 10 - Student
No ratings yet
Unit 10 - Student
23 pages
Jazz 12 Hour Internet Package - Google Search
No ratings yet
Jazz 12 Hour Internet Package - Google Search
1 page
10.2.6-Packet-Tracer - Use-Lldp-To-Map-A-Network
No ratings yet
10.2.6-Packet-Tracer - Use-Lldp-To-Map-A-Network
6 pages
Huong Dan Cau Hinh Checkpoint
No ratings yet
Huong Dan Cau Hinh Checkpoint
320 pages
Data Science in Microsoft Fabric
No ratings yet
Data Science in Microsoft Fabric
23 pages
Diagnostics Apps Check
No ratings yet
Diagnostics Apps Check
428 pages
2008 - Computational Analysis and Improvement of SIRT
No ratings yet
2008 - Computational Analysis and Improvement of SIRT
7 pages
A13C Manual of AI Plugin V2.0
No ratings yet
A13C Manual of AI Plugin V2.0
7 pages
Addis Ababa University Addis Ababa Institute of Technology School of Electrical and Computer Engineering
No ratings yet
Addis Ababa University Addis Ababa Institute of Technology School of Electrical and Computer Engineering
6 pages
Digital Signal Processing Lecture-03: Arnisha Akhter Lecturer, Dept. of CSE Jagannath University
No ratings yet
Digital Signal Processing Lecture-03: Arnisha Akhter Lecturer, Dept. of CSE Jagannath University
44 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.