0% found this document useful (0 votes)
6 views

Week6_clustering_regression

NTU EE6483 Week6_clustering_regression

Uploaded by

yimingxiao2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Week6_clustering_regression

NTU EE6483 Week6_clustering_regression

Uploaded by

yimingxiao2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Artificial Intelligence & Data Mining

IE4483

WEN Bihan (Asst Prof)


Homepage: https://personal.ntu.edu.sg/bihan.wen/

1
Weekly Plan

• Week 6 - Unsupervised Learning - Clustering & Regression

• Week 11 - Regularization and Optimization for Deep Models

• Week 12 - Bayesian Reasoning & Dimensionality Reduction

• Week 13 - Low-Dimensionality - NOT Examed

2
Mini Project

3
Mini Project

• Submission is due on Nov 15 (Friday), 11:59pm

• Work in group of THREE to prepare one report.

• Clearly specify the team members.


• State the respective contribution of each member to the project.

• Grouping

4
Mini Project

• Option 1: Sentiments of Product Reviews

• Application type:
Natural Language Processing

• Training data:

• Given the user review of product, as raw test, with sentiments


• 0 presents negative; 1 presents positive

• Testing data:

• Predict the sentiment: binary classification

5
Mini Project

• Option 2: Dogs vs. Cats

• Application type: Computer Vision

• Training data:
• Images in dog / cat folders:
|— dog
|— cat

• Testing data:

• Predict the image type: binary classification

6
Mini Project

• Submit your report and the “submission.csv”


file with your results.
Fill up your predicted
results here.

• If you couldn’t obtain any meaningful results,


describe what you have done with
screenshots or codes.

• Deadline: Nov 15 (Friday), 11:59pm

7
Clustering

8
Outline

• Concept of Clustering

• Distance Metrics

• K-Means

• Hierarchical Agglomerative Clustering (HAC)

• Examples

9
Carry-on Questions

• What is a cluster? What is clustering?

• What is the difference between clustering and classification?

• What are the limitations of K-Means algorithm?

• What are the limitations of HAC algorithm?

10
Recap: Classification

• Computer recognize whether there is a cat in the image:

11
Recap: Classification

• Computer recognize whether there is a cat in the image:

• Need training process to teach computer.


• Images labeled as Cat.
12
From classification to clustering

• What if we do NOT have the labels? Which pixels form the flower?

• Data intrinsic structure / Similarity within the same group.

13
From classification to clustering

• Clustering is (typically) unsupervised learning:

• Unsupervised Learning: self-organized learning that helps find unknown


patterns in data set without pre-existing labels.

• Clustering: Given a collection of data samples, the goal is to group / organize


the data such that the data in the same group are more similar to each other,
than to those in other groups.

• Cluster: a set of data which are similar to each other.

14
From classification to clustering

• Unsupervised Learning: self-organized learning that helps find


unknown patterns in data set without pre-existing labels.

No labels or class information provided

15
From classification to clustering

• Clustering: Given a collection of data samples, the goal is to group /


organize the data such that the data in the same group are more
similar to each other, than to those in other groups.

Background Pixels
Flower Pixels
16
From classification to clustering

• Clustering: Given a collection of data samples, the goal is to group /


organize the data such that the data in the same group are more
similar to each other, than to those in other groups.

17
From classification to clustering

• Cluster: a set of data which are similar to each other.

Cluster 2

Cluster 1

18
From classification to clustering

• Classification is supervised:

• Class labels are provided in training.

• Learn a classifier to predict the class labels of unseen data.

• Clustering is unsupervised:

• No pre-existing label is given.

• Understand the structure / organization of your underlying data.

19
Clustering

• Clustering:

• Unsupervised Method.

• Basic Idea: group together the similar data points.

• Input: a group of data points, without any training label.

• Output: the “membership” of each data point.

• How do we define “similarity” here?

20
Distance Measures / Metrics

• Given a set of N data samples / points 𝒙𝒙1 , 𝒙𝒙2 , … , 𝒙𝒙𝑖𝑖 , … 𝒙𝒙𝑁𝑁 that we
would like to cluster.

• Each data sample is assumed to be a d-dimensional vector that we


write as a column vector:
𝑥𝑥1
𝒙𝒙 = ⋮
𝑥𝑥𝑑𝑑

• We define the distance between any two data samples 𝒙𝒙𝑖𝑖 and 𝒙𝒙𝑗𝑗 as

𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ) → 𝑅𝑅 (real scalar)

21
Distance Measures / Metrics

• We define the distance between any two data samples 𝒙𝒙𝑖𝑖 and 𝒙𝒙𝑗𝑗 as
𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ) → 𝑅𝑅 (real scalar)

• A distance metric is a function of (𝑅𝑅𝑑𝑑 × 𝑅𝑅 𝑑𝑑 ) → 𝑅𝑅 that satisfies:

1. 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ≥ 0, 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 = 0 if and only if 𝒙𝒙𝑖𝑖 = 𝒙𝒙𝑗𝑗 .

𝒙𝒙𝑗𝑗
non-negativity

𝒙𝒙𝑖𝑖
22
Distance Measures / Metrics

• We define the distance between any two data samples 𝒙𝒙𝑖𝑖 and 𝒙𝒙𝑗𝑗 as
𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ) → 𝑅𝑅 (real scalar)

• A distance metric is a function of (𝑅𝑅𝑑𝑑 × 𝑅𝑅 𝑑𝑑 ) → 𝑅𝑅 that satisfies:

1. 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ≥ 0, 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 = 0 if and only if 𝒙𝒙𝑖𝑖 = 𝒙𝒙𝑗𝑗 .


𝒙𝒙𝑙𝑙
2. 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 + 𝑑𝑑 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑙𝑙 ≥ 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑙𝑙

triangle 𝒙𝒙𝑗𝑗
inequality 𝒙𝒙𝑖𝑖
23
Distance Measures / Metrics

• We define the distance between any two data samples 𝒙𝒙𝑖𝑖 and 𝒙𝒙𝑗𝑗 as
𝑑𝑑(𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ) → 𝑅𝑅 (real scalar)

• A distance metric is a function of (𝑅𝑅𝑑𝑑 × 𝑅𝑅 𝑑𝑑 ) → 𝑅𝑅 that satisfies:

1. 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 ≥ 0, 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 = 0 if and only if 𝒙𝒙𝑖𝑖 = 𝒙𝒙𝑗𝑗 .

2. 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 + 𝑑𝑑 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑙𝑙 ≥ 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑙𝑙


𝒙𝒙𝑗𝑗
3. 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝒙𝒙𝑗𝑗 = 𝑑𝑑 𝒙𝒙𝑗𝑗 , 𝒙𝒙𝑖𝑖

symmetry 𝒙𝒙𝑖𝑖
24
Distance Measures / Metrics

• Example of distances:

1. Euclidean Distance:
𝑑𝑑

𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = �(𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗 )2 = 𝒙𝒙 − 𝒚𝒚 2
𝑗𝑗=1

• We denote 𝒙𝒙 − 𝒚𝒚 2 as the 𝑙𝑙2 -norm of (𝒙𝒙 − 𝒚𝒚).

𝒙𝒙

𝒙𝒙 − 𝒚𝒚 2

𝒚𝒚
25
Distance Measures / Metrics

• Example of distances:

1. Euclidean Distance:
𝑑𝑑

𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = �(𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗 )2 = 𝒙𝒙 − 𝒚𝒚 2
𝑗𝑗=1

2. Manhattan Distance:
𝑑𝑑

𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = � | 𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗 | = 𝒙𝒙 − 𝒚𝒚 1
𝑗𝑗=1

𝒙𝒙
𝒙𝒙 − 𝒚𝒚 1

𝒚𝒚 26
Distance Measures / Metrics

• Example of distances:

1. Euclidean Distance:
𝑑𝑑

𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = �(𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗 )2 = 𝒙𝒙 − 𝒚𝒚 2
𝑗𝑗=1

2. Manhattan Distance:
𝑑𝑑

𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = � | 𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗 | = 𝒙𝒙 − 𝒚𝒚 1
𝑗𝑗=1

3. Infinity (Sup) Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = max 𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗


1≤𝑗𝑗≤𝑑𝑑
27
Distance Measures / Metrics

• Example of distances:
1. Euclidean Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = 𝒙𝒙 − 𝒚𝒚 2
2. Manhattan Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = 𝒙𝒙 − 𝒚𝒚 1
3. Infinity (Sup) Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = max 𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗
1≤𝑗𝑗≤𝑑𝑑

28
Distance Measures / Metrics

• Example of distances: Used in this course


1. Euclidean Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = 𝒙𝒙 − 𝒚𝒚 2
2. Manhattan Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = 𝒙𝒙 − 𝒚𝒚 1
3. Infinity (Sup) Distance: 𝑑𝑑 𝒙𝒙 , 𝒚𝒚 = max 𝑥𝑥𝑗𝑗 − 𝑦𝑦𝑗𝑗
1≤𝑗𝑗≤𝑑𝑑

29
Clustering Algorithms

• Partition Algorithms

• K-Means

• Mixture of Gaussian

• Spectral Clustering

• Hierarchical Algorithms

• Agglomerative

• Divisive

30
Clustering Algorithms

• Partition Algorithm

Cluster 1

Cluster 2

 Cluster the data samples into non-overlapping subsets (clusters).

Each data sample is in exactly one cluster.

31
Clustering Algorithms

• Hierarchical Algorithm

 A set of nested clusters organized as a hierarchical tree.

32
Clustering Algorithms

• Partition Algorithms

• K-Means

• Mixture of Gaussian

• Spectral Clustering

• Hierarchical Algorithms

• Agglomerative

• Divisive

33
Clustering Algorithms

• Partition Algorithms

• K-Means

• Mixture of Gaussian

• Spectral Clustering

• Hierarchical Algorithms

• Agglomerative

• Divisive

34
K-Means

• Given a set of d-dimension data points 𝒙𝒙1 , 𝒙𝒙2 , … , 𝒙𝒙𝑖𝑖 , … 𝒙𝒙𝑁𝑁 , and a
distance metric 𝑑𝑑 𝒙𝒙, 𝒚𝒚

• Clustering Goal:

1. Split the data points 𝒙𝒙1 , 𝒙𝒙2 , … , 𝒙𝒙 𝑖𝑖 , … 𝒙𝒙𝑁𝑁 into K clusters.

2. Each cluster has a d-dimension centroid / center 𝜇𝜇𝑘𝑘 .

3. The sum of distances between each 𝒙𝒙(𝑖𝑖) and its centroid 𝜇𝜇𝑘𝑘 is minimized.

35
K-Means

• Distance metric: Typically, we use Euclidean distance

• Cluster Center / Centroid:


𝜇𝜇𝑘𝑘 = the average of the data points belong to this cluster.

• Split the data points:


Each data point belongs to only one cluster.

36
K-Means

• Initialize: pick K random points as the cluster centers 𝜇𝜇𝑘𝑘 .

• Iterate between the following step 1 and 2:

1. Assign every data point 𝒙𝒙𝑖𝑖 to its closest cluster center, according to the
given distance metric, i.e., find the 𝜇𝜇𝑘𝑘 such that 𝑑𝑑 𝒙𝒙𝑖𝑖 , 𝜇𝜇𝑘𝑘 is minimized.

2. Update the cluster center 𝜇𝜇𝑘𝑘 to be the average of its assigned data points.

• Stopping Criterion: when no points’ assignments change.

37
K-Means Example 1

• Suppose our task is to cluster the following eight points in 2D space


into K = 3 clusters:
A1(2,10), A2(2,5), A3(8,4), B1(5,8), B2(7,5), B3(6,4), C1(1,2), C2(4,9).

• The distance function is Euclidean distance.

• Suppose initially we assign A1, B1 and C1 as the center of each


cluster, respectively.

• Apply K-means to estimate the final three clusters.

38
K-Means Example 1

• A1=(2, 10); A2=(2,5); A3=(8,4); B1=(5,8); B2=(7,5); B3=(6,4); C1=(1,2); C2=(4,9)

39
K-Means Example 1

• Given the three initial cluster centers A1, B1, and C1.
• Step 1: Determine which data point belongs to which cluster by
calculating their distances to the centers, i.e., yellow-highlighted columns
in the following matrix:

40
K-Means Example 1

• Cluster 1={A1}, Cluster 2={B1, B2, B3, A3, C2}, Cluster 3={A2, C1}

41
K-Means Example 1

• Cluster 1={A1}, Cluster 2={B1, B2, B3, A3, C2}, Cluster 3={A2, C1}

• Step 2: The cluster centers after the first round of iteration can be obtained
by computing the mean of all the data points belong to each cluster as:
C1 = (2, 10); C2 = (6, 6); C3 = (1.5, 3.5)

42
K-Means Example 1

• Step 2: The cluster centers after the first round of iteration can be
obtained by computing the mean of all the data points belong to each
cluster as:
C1 = (2, 10); C2 = (6, 6); C3 = (1.5, 3.5)

• Repeat Step 1: Determine which data point belongs to which cluster,


by calculating their distances to the new centers C1, C2 and C3.

• ……

• Stopping Criterion: when no points’ assignments change

43
K-Means Example 2
Centroid 1

Data samples
Centroid 2

44
K-Means Example 3

45
K-Means Example 3

• Cluster the pixel’s gray-scale intensity: 1-dimension feature

K=2 K=3 46
K-Means Example 3

• Cluster the pixel’s RGB color: 3-dimension feature

K=3 47
K-Means

• What cost function does K-Means optimize?

𝑁𝑁 𝐾𝐾
1 2
min min � � 𝑟𝑟𝑖𝑖𝑖𝑖 𝑥𝑥𝑖𝑖 − µ𝑘𝑘 2
µ𝑘𝑘 𝑟𝑟𝑖𝑖𝑖𝑖 2
𝑖𝑖=1 𝑘𝑘=1

• Calculation of the centers: µ𝑘𝑘 = ∑𝑁𝑁


𝑖𝑖=1 𝑟𝑟𝑖𝑖𝑖𝑖 𝑥𝑥𝑖𝑖 ∀𝑘𝑘

• Constraint on the assignment: 𝑟𝑟𝑖𝑖𝑖𝑖 ∈ {0, 1} ∀𝑖𝑖, 𝑘𝑘

• Normalization: ∑𝐾𝐾
𝑘𝑘=1 𝑟𝑟𝑖𝑖𝑖𝑖 = 1 ∀𝑖𝑖

48
K-Means: Is the algorithm good?

• Pros:

• Simple but effective

• Easy to implement

• Cons:

• Need to choose K.

• Stuck at poor local minimum

• Need an appropriate distance metric

49
K-Means

• Different initialization -> Local minimum

50
K-Means

• Poor local minimum

51
K-Means

• Good Initialization

Centroid
52
K-Means

• Good Initialization

53
K-Means

• Bad Initialization

Centroid
54
K-Means

• Bad Initialization

55
K-Means

• Need a better metric

Not linearly separable Good metric space for clustering

56
Hierarchical Agglomerative Algorithm (HAC)

• Hierarchical Agglomerative Algorithm ( HAC )

• Start with the points as individual clusters

• At each step, merge the closest pair of clusters, until only one cluster (or K
clusters) left. K is a given number.

• How to merge?

• Merge the pair of clusters with the minimum distance.

57
HAC

• Hierarchical Agglomerative Algorithm ( HAC ) Initialization:


Each object is a cluster.

Iteration:
Merge two clusters with the
minimum distance.

Stopping Criteria:
All objects are merged into a
single cluster.

Or

Only K clusters are left.


5 4 3 2 1
clusters clusters clusters clusters cluster 58
HAC

• HAC can be visualized as a Dendrogram


• A tree-like diagram that records the sequences of merges.

Cut at 2 clusters

Cluster Cluster
1 2

59
HAC

• Advantages of HAC

• Do not have to assume / pre-define the number of clusters.

• Any clustering result with the desired number of clusters K, can be obtained
by “cutting” the dendrogram at the corresponding level.

• The result is independent of the initialization.

60
HAC

1. How to Define the distance between 2 clusters?

Distance?

61
HAC

1. How to Define the distance between 2 clusters?

• MIN / Single Linkage: the minimum distance between any pair of


two data samples from each cluster.

62
HAC

1. How to Define the distance between 2 clusters?

• MAX / Complete Linkage: the maximum distance between any pair


of two data samples from each cluster.

63
HAC

1. How to Define the distance between 2 clusters?

• Average Linkage: the average distance between all pairs of two data
samples from each cluster.

64
HAC

1. How to Define the distance between 2 clusters?

• Centroid Distance: the distance between the means of data samples


(i.e., centroids) from each cluster.

65
HAC

2. How to determine the pair of clusters with minimum distance?

• Visualize the samples in the space.

66
HAC

2. How to determine the pair of clusters with minimum distance?

• Visualize the samples in the space.

• Merge the two clusters (points)


that are closest to each other.

67
HAC

2. How to determine the pair of clusters with minimum distance?

• Visualize the samples in the space.

• Merge the two clusters (points)


that are closest to each other.

• Merge the next closest clusters.

68
HAC

2. How to determine the pair of clusters with minimum distance?

• Visualize the samples in the space

• Merge the two clusters (points)


that are closest to each other

• Merge the next closest clusters.

• Then the next closest…

69
HAC

2. How to determine the pair of clusters with minimum distance?

• Visualize the samples in the space

• Merge the two clusters (points)


that are closest to each other

• Merge the next closest clusters.

• Then the next closest…

• Until only one cluster left.


Or the given K clusters left.

70
HAC

2. How to determine the pair of clusters with minimum distance?

• Equivalently, we have the resulting dendrogram showing the HAC process

• The y-axis on dendrogram shows the distance between clusters when


merging at each step. 71
Example: HAC
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

What is the distance metrics between clusters?

1. Single Linkage (MIN distance)


2. Complete Linkage (MAX distance)
3. Centroid Distance (distance between the centers)
4. Average Linkage (average over all pairs of points from two clusters)

72
Example: HAC – Centroid distance
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

Distance = Centroid Distance


#1 #2 #3 #4

#1 0 0.14 0.72 1.17


#2 0.14 0 0.86 1.3
#3 0.72 0.86 0 0.5
#4 1.17 1.3 0.5 0

73
Example: HAC – Centroid distance
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

Distance = Centroid Distance


#1 #2 #3 #4

#1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0

74
Example: HAC
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

Distance between #1 and #2 = 0.14 - Smallest

• Calculate the new centroid for (#1 and #2):

Cluster #1+#2 #3 #4
Centroid 1.85 2.3 2.3
0.95 1.6 2.1

75
Example: HAC
Cluster #1+#2 #3 #4
Centroid 1.85 2.3 2.3
0.95 1.6 2.1

Update the distance table


#1 + #2 #3 #4

#1 + #2 0
#3 0.79 0
#4 1.23 0.5 0

76
Example: HAC
#1 + #2 #3 #4
Cluster #1+#2 #3 #4
#1 + #2 0
Centroid 1.85 2.3 2.3
#3 0.79 0
0.95 1.6 2.1
#4 1.23 0.5 0

Next closest clusters to merge?

• Distance between #3 and #4 = 0.5

Cluster #1+#2 #3+#4


Centroid 1.85 2.3
0.95 1.85

77
Example: HAC
Cluster #1+#2 #3+#4
Centroid 1.85 2.3
0.95 1.85

Merge into 1 cluster

Cluster #1+#2+#3+#4

Centroid 2.075
1.4

78
Example: HAC – Centroid distance
points #1 #2 #3 #4
x 1.9 1.8 2.3 2.3
y 1.0 0.9 1.6 2.1

Distance = Average Linkage


#1 #2 #3 #4

#1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0

79
#1 #2 #3 #4
Example: HAC #1 0
#2 0.14 0
#3 0.72 0.86 0
#4 1.17 1.3 0.5 0

Merge #1 and #2, and update the distance table:

• D(1+2, 3) = 0.5 * D(1,3) + 0.5 * D(2,3) = 0.36 + 0.43 = 0.79


• D(1+2, 4) = 0.5 * D(1,4) + 0.5 * D(2,4) = 0.59 + 0.65 = 1.24

#1 + #2 #3 #4

#1 + #2 0
#3 0.79 0
#4 1.24 0.5 0
80
#1 + #2 #3 #4
Example: HAC
#1 + #2 0
#3 0.79 0
#4 1.24 0.5 0

Merge #3 and #4 next, and update the distance table:

• D(1+2, 3+4) = 0.5 * D(1+2,3) + 0.5 * D(1+2,4) = 0.40 + 0.62 = 1.02

#1 + #2 #3 + #4

#1 + #2 0
#3 + #4 1.02 0

81
K-Means vs. HAC

• K-Means

✓ Simple and cheap algorithm

✘ Results are sensitive to the initialization


✘ Number of clusters needs to be pre-defined

• HAC

✓ Deterministic algorithm, i.e., not randomness.


✓ Show us a range of clustering results with different choices of K.

✘ More memory- and computationally-intensive than K-Means

82
What we learn

• What is clustering?

• What is distance metric?

o Euclidean distance, centroid distance, etc.

• K-Means

o Goal, algorithm, optimization, examples

• HAC

o Algorithm, dendrogram, cluster selection, examples


83
Carry-on Questions
• What is a cluster? What is clustering?

• Cluster: a set of data which are similar to each other.


• Clustering: group / organize the data such that the data in the same group are
more similar to each other, than to those in other groups.

• What is the difference between clustering and classification?

• Clustering is unsupervised, while classification is supervised.

• What is the limitations of K-Means algorithm?

• Need to choose K. Can stuck at poor local minimum. Need good metric.

• What are the limitations of HAC algorithm?

• Memory- and computationally-intensive.

84
Regression

85
Outline

• Concept of Regression

• Linear Regression

• Examples

• Derivation of Linear Regression (Not Examed)

86
Carry-on Questions

• What is the difference between regression and classification?

• What is the loss function for training linear regressor?

87
Recap: Classification

• Computer decides whether there is a cat or not:

Input Data Discrete


Labels
88
From Classification to Regression

• Computer predicts the likelihood if there is a cat:

0.9213

Input Data Continuous


quantity
89
From Classification to Regression

• Classification is to predict a discrete class label:

• It may output a continuous value, in the form of the probability for a discrete
class label.

• Accuracy = percentage of correctly classified examples out of all predictions.

• Regression is to predict a continuous quantity:

• It may predict a discrete value, in the form of an integer quantity.

• Accuracy = root mean squared error.

90
Linear Regression

• Regressor: predict 𝑦𝑦 ∈ 𝑅𝑅 ( output ) - scalar


from 𝒙𝒙 ∈ 𝑅𝑅 𝑑𝑑 (data point) - vector or scalar

• Linear regression: 𝑦𝑦 can be determined by 𝒘𝒘 𝑇𝑇 𝒙𝒙 + 𝑏𝑏

• 𝒘𝒘 ∈ 𝑅𝑅𝑑𝑑 and 𝒘𝒘 𝑇𝑇 denotes its transpose (row vector).

• Training: to find the best 𝒘𝒘 , based on training data.

• Training dataset contains N data points 𝒙𝒙1 , … , 𝒙𝒙𝑖𝑖 , … 𝒙𝒙𝑁𝑁 , and their ground
truth 𝑦𝑦1 , … , 𝑦𝑦𝑖𝑖 , … 𝑦𝑦𝑁𝑁 .

91
Linear Regression

• Training: to find the best 𝒘𝒘 , based on training data.

• Given a set of N data points 𝒙𝒙1 , … , 𝒙𝒙𝑖𝑖 , … 𝒙𝒙𝑁𝑁 , and their 𝑦𝑦1 , … , 𝑦𝑦𝑖𝑖 , … 𝑦𝑦𝑁𝑁 .

• Find 𝑓𝑓𝒘𝒘,𝑏𝑏 (𝒙𝒙) = 𝒘𝒘𝑇𝑇 𝒙𝒙 + 𝑏𝑏 that minimizes the 𝑙𝑙2 loss

𝑁𝑁
1
min 𝐿𝐿� 𝑓𝑓𝒘𝒘,𝑏𝑏 = min �(𝒘𝒘𝑇𝑇 𝒙𝒙𝒊𝒊 + 𝑏𝑏 − 𝑦𝑦𝑖𝑖 )2
𝒘𝒘,𝑏𝑏 𝒘𝒘,𝑏𝑏 N
𝑖𝑖=1

• Loss function: mean squared error between 𝒘𝒘𝑇𝑇 𝒙𝒙𝒊𝒊 + 𝑏𝑏 and 𝑦𝑦𝑖𝑖 .

92
Linear Regression - Example

• Consider a simple 1-dimension data regression:

• Plot of the data points on the 𝑥𝑥 − 𝑦𝑦 plane:

93
Linear Regression - Example

• Consider a simple 1-dimension data regression:

• The predicted 𝒘𝒘 (line) and the square errors:

Minimizing Vertical Offset !

𝒘𝒘𝑇𝑇 𝒙𝒙𝒊𝒊 + 𝑏𝑏

𝑦𝑦𝑖𝑖

94
Linear Regression – Derivation

• Regressor: predict 𝑦𝑦 ∈ 𝑅𝑅 (response) from 𝒙𝒙 ∈ 𝑅𝑅 𝑑𝑑 (data point).

• Linear regression: 𝑦𝑦 can be determined by 𝒘𝒘 𝑇𝑇 𝒙𝒙 + 𝑏𝑏

• 𝒘𝒘 ∈ 𝑅𝑅𝑑𝑑 and 𝒘𝒘 𝑇𝑇 denotes its transpose (row vector).


𝒙𝒙
𝒘𝒘 𝑇𝑇 𝒙𝒙 + 𝑏𝑏 = 𝒘𝒘 𝑇𝑇 | 𝑏𝑏
1
• To simplify the notation,
𝒙𝒙
𝒘𝒘 𝑇𝑇 ← 𝒘𝒘 𝑇𝑇 | 𝑏𝑏 𝒙𝒙 ←
1
• The new vectors: 𝒙𝒙 ∈ 𝑅𝑅𝑑𝑑+1 and 𝒘𝒘 ∈ 𝑅𝑅𝑑𝑑+1

• Thus, 𝑦𝑦 can be fully determined by 𝒘𝒘 𝑇𝑇 𝒙𝒙.

95
Linear Regression - Derivation

• Write the problem in matrix form:

�𝐿𝐿 𝑓𝑓𝒘𝒘 = 1 ∑𝑁𝑁


𝑖𝑖=1 (𝒘𝒘 𝑇𝑇
𝒙𝒙𝒊𝒊 − 𝑦𝑦𝑖𝑖 ) 2
=
1
𝑿𝑿𝑿𝑿 − 𝒚𝒚 2
2
N N

• Concatenate the rows:


𝒙𝒙𝒊𝒊 𝑦𝑦𝒊𝒊

• Matrix 𝑿𝑿 ∈ 𝑅𝑅𝑁𝑁×(𝑑𝑑+1)
𝒚𝒚
• Vector 𝒚𝒚 ∈ 𝑅𝑅 𝑁𝑁

• Vector 𝒘𝒘 ∈ 𝑅𝑅𝑑𝑑+1

96
Linear Regression - Derivation

• Write the problem in matrix form:


1 1
𝐿𝐿� 𝑓𝑓𝒘𝒘 = ∑𝑁𝑁 𝑇𝑇 2
𝑖𝑖=1(𝒘𝒘 𝒙𝒙𝒊𝒊 − 𝑦𝑦𝑖𝑖 ) = 𝑿𝑿𝑿𝑿 − 𝒚𝒚 2
2
N N

• Find the gradient w.r.t. 𝒘𝒘:


2
∇𝒘𝒘 𝑿𝑿𝒘𝒘 − 𝒚𝒚 2= ∇𝒘𝒘 𝑿𝑿𝑿𝑿 − 𝒚𝒚 𝑇𝑇 𝑿𝑿𝑿𝑿 − 𝒚𝒚
= ∇𝑤𝑤 𝒘𝒘𝑇𝑇 𝑿𝑿𝑇𝑇 𝑿𝑿𝑿𝑿 − 2𝒘𝒘𝑇𝑇 𝑿𝑿𝑇𝑇 𝒚𝒚
= 2𝑿𝑿𝑇𝑇 𝑿𝑿𝑿𝑿 − 2𝑿𝑿𝑇𝑇 𝒚𝒚

• Set gradient to zero to get the minimizer:


𝑿𝑿𝑇𝑇 𝑿𝑿𝑿𝑿 = 𝑿𝑿𝑇𝑇 𝒚𝒚

𝒘𝒘 = (𝑿𝑿𝑇𝑇 𝑿𝑿)−1 𝑿𝑿𝑇𝑇 𝒚𝒚

• Note: here we assume (𝑿𝑿𝑇𝑇 𝑿𝑿)−1 exists unless otherwise specified.


97
Linear Regression - Derivation

• As we are minimizing the mean squared error, we call this solution as


the least square (LS) estimator:

𝒘𝒘 = (𝑿𝑿𝑇𝑇 𝑿𝑿)−1 𝑿𝑿𝑇𝑇 𝒚𝒚

• Some times linear is not good enough…

98
Linear Regression - Derivation

• As we are minimizing the mean squared error, we call this solution as


the least square (LS) estimator:

𝒘𝒘 = (𝑿𝑿𝑇𝑇 𝑿𝑿)−1 𝑿𝑿𝑇𝑇 𝒚𝒚

• Polynomial fit?
• No, thanks….

• Solution:

• Replace 𝒙𝒙𝒊𝒊 with better feature ϕ(𝒙𝒙𝒊𝒊 )

• Feature Learning

99
Carry-on Questions

• What is the difference between regression and classification?

• Classification is to predict discrete class labels.

• Regression is to predict continuous quantities.

• What is the loss function for training linear regressor?

• Loss function: mean squared error between 𝒘𝒘𝑇𝑇 𝒙𝒙𝒊𝒊 + 𝑏𝑏 and 𝑦𝑦𝑖𝑖 .

• Minimize the vertical offsets.

100
Thank you! Now questions

101

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy