Bim309 Ai Week13
Bim309 Ai Week13
2024
Outline
§ Supervised Learning (cont.)
§ Perceptron Algorithm
§ Neural Networks
§ Perceptron to MLP
§ The XOR Example
§ Backpropagation Algorithm
§ Backpropagation Rules
§ Unsupervised Learning: Clustering
§ K-Means Clustering
§ Hierarchical Clustering
1
25.12.2024
Outline
§ Supervised Learning (cont.)
§ Perceptron Algorithm
§ Neural Networks
§ Perceptron to MLP
§ The XOR Example
§ Backpropagation Algorithm
§ Backpropagation Rules
§ Unsupervised Learning: Clustering
§ K-Means Clustering
§ Hierarchical Clustering
Biological Neurons
2
25.12.2024
A Simplified Mathematical
Model of a Biological Neuron
§ Perceptron is a simplified model of a biological neuron, designed to mimic how
neurons work in our brain, but mathematically.
§ Structure:
§ Inputs: Correspond to dendrites in biological neurons. These are weighted signals
(e.g., numbers) from other sources
§ Weights: Each input is multiplied by a weight that determines its importance
§ Summation: The weighted inputs are added together
§ Activation Function: A threshold is applied to the sum. If it exceeds the
threshold, the perceptron "fires" (outputs a signal like 1); otherwise, it doesn't
(outputs 0)
§ Function: A perceptron is used in AI for decision-making tasks like classification (e.g.,
determining if an email is spam or not).
6
Perceptron
§ Belongs to Neural Networks (NN) class of algorithms (that try to mimic how
the brain functions)
§ The first algorithm (or the simplest NN) used was the Perceptron
(Rosenblatt 1959)
§ The idea is that, given some inputs, the neuron should produce some output
§ Worked extremely well to recognize:
1. Handwritten characters (LeCun et a. 1989)
2. Spoken words (Lang et al. 1990)
3. Faces (Cottrel 1990)
§ NN were popular in the 90’s but then lost some of its popularity
§ Now NN back with deep learning, thanks to algorithmic and computational
progress
3
25.12.2024
Perceptron
bias
output
step function
Given 𝑛 examples and 𝑑 features
%
4
25.12.2024
Perceptron
§ Works perfectly …
§ if data is linearly separable
§ If not, it will not converge
§ Idea: Start with a random hyperplane and adjust it using your training data
by updating the weights
§ Iterative method
§ Perceptron Convergence Theorem states that any linear function can be learned using
this algorithm in a finite number of iterations
10
Perceptron
11
5
25.12.2024
Perceptron
Some observations:
§ The weights 𝑤! , . . . , 𝑤" determine the slope of the decision boundary
§ 𝑤# determines the offset of the decision boundary (sometimes noted b)
§ Line 6 (#adjust the weights) corresponds to:
12
Perceptron: Example
13
6
25.12.2024
Perceptron: Example
14
Perceptron: Example
15
7
25.12.2024
Perceptron: Example
16
Perceptron: Example
17
8
25.12.2024
Perceptron: Example
18
Perceptron: Example
19
9
25.12.2024
Perceptron: Example
20
Perceptron: Example
21
10
25.12.2024
Perceptron: Example
22
Perceptron: Example
Finally
converged!
23
11
25.12.2024
Perceptron: Example
With some
test data:
24
Perceptron
§ The 𝑤$ determine the contribution of 𝑥$ to the label
§ −𝑤# is a quantity that ∑&$%! 𝑤$ 𝑥! needs to exceed for the perceptron
to output 1
§ Can be used to represent many Boolean functions:
§ AND, OR, NAND, NOR, NOT
§ but not all of them (e.g., XOR)
25
12
25.12.2024
26
27
13
25.12.2024
28
Outline
§ Supervised Learning (cont.)
§ Perceptron Algorithm
§ Neural Networks
§ Perceptron to MLP
§ The XOR Example
§ Backpropagation Algorithm
§ Backpropagation Rules
§ Unsupervised Learning: Clustering
§ K-Means Clustering
§ Hierarchical Clustering
29
14
25.12.2024
From Perceptron to NN
§ The perceptron works perfectly
§ if data is linearly separable
§ if not, it will not converge L
§ Neural networks use the ability of the perceptrons to represent
elementary functions and combine them in a network of layers of
elementary questions
§ However, a cascade of linear functions is still linear!
§ And we want networks that represent highly non-linear functions
30
From Perceptron to NN
§ Also, perceptron used a threshold function (step function), which is
undifferentiable and not suitable for gradient descent in case data is
not linearly separable
§ We want a function whose output is a linear function of the inputs
§ One possibility is to use the
sigmoid function:
'! !
𝑔 𝑧 = !( '!
= !( ' "!
𝒈 𝒛 = 𝟏 𝐰𝐡𝐞𝐧 𝒛 → +∞ 𝒈 𝒛 = 𝟎 𝐰𝐡𝐞𝐧 𝒛 → −∞
31
15
25.12.2024
sigmoid function
Given 𝑛 examples and 𝑑 features
For an example 𝑥$ ( the 𝑖 )* line in the matrix of examples)
1
𝑓 𝑥! =
" ∑.
+,- $+%/+
1+ 𝑒
32
First observe:
33
16
25.12.2024
𝑤! = −10
→ 𝑔(𝑤! + 𝑤") = 1 →
𝑤! + 𝑤" = 10 → −10 + 𝑤" = 10 → 𝑤" = 20
→ 𝑔(𝑤! + 𝑤#) = 1 →
𝑤! + 𝑤# = 10 → −10 + 𝑤# = 10 → 𝑤# = 20
𝑥# 𝑥" (𝑥#𝑶𝑹 𝑥") g(x)
0 0 0 -10
0 1 1 10
1 0 1 10
1 1 1 30
→ 𝑔(−10 + 20𝑥# +20𝑥")
1
1 + 𝑒 $($#!&"!'! &"!'" )
34
Note: How the weights in the NAND are the inverse weights of the AND
(𝑥5 𝑵𝑨𝑵𝑫 𝑥6 ) = 𝑵𝑶𝑻 (𝑥5 𝑨𝑵𝑫 𝑥6 )
35
17
25.12.2024
𝑥# 𝑥" (𝑥#𝑶𝑹 𝑥") (𝑥#𝑨𝑵𝑫 𝑥") (𝑥#𝑵𝑨𝑵𝑫 𝑥") (𝑥#𝑶𝑹 𝑥") 𝑨𝑵𝑫 (𝑥#𝑵𝑨𝑵𝑫 𝑥") (𝑥#𝑿𝑶𝑹 𝑥")
0 0 0 0 1 0 0
0 1 1 0 1 1 1
1 0 1 0 1 1 1
1 1 1 1 0 0 0
36
Input Output
Examples label
(features)
37
18
25.12.2024
38
Backpropagation Algorithm
§ Note: “Feedforward NN” (as opposed to “Recurrent Networks”) have no
connections that loop
§ Backpropagation stands for “backward propagation of errors”
§ Learn the weights for a multilayer network
§ Given a network with a fixed architecture (neurons and inter-connections)
§ Use Gradient descent to minimize the squared error between the network
output value 𝒐 and the ground truth 𝒚
§ We suppose multiple output 𝑘
§ Challenge: Search in all possible weight values for all neurons in the network
to find the best weighting of this architecture that would lead us to minimize
the squared error of the training data
39
19
25.12.2024
Feedforward-
Backpropagation i 𝑤!"
Feature j
vector label f(x)
𝑂A
𝑒 = (𝑥, 𝑦)
40
Backpropagation Rules
§ We consider 𝒌 outputs
§ For an example 𝑒 defined by (𝑥, 𝑦), the error on training example 𝑒,
summed over all output neurons in the network is:
1 1
𝐸/ 𝑤 = * 𝑦0 − 𝑜0
2
0
§ Gradient descent iterates through all the training examples one at a
time, descending the gradient of the error
𝜕𝐸/ (𝑤)
∆𝑤!2 = −𝛼
𝜕𝑤!2
41
20
25.12.2024
𝜕𝐸A (𝑤)
∆𝑤!" = −𝛼
𝜕𝑤!"
Backpropagation Rules
𝜕𝐸' 𝜕𝐸' 𝜕𝑧1 𝜕𝐸'
= = 𝑥
𝜕𝑤$1 𝜕𝑧1 𝜕𝑤$1 𝜕𝑧1 $1
𝜕𝐸'
△ 𝑤$1 = −𝛼 𝑥
𝜕𝑧1 $1
23#
We consider two cases in calculating 24$
(let’s abandon the index 𝑒):
43
Backpropagation Rules
44
21
25.12.2024
45
Backpropagation Rules
46
22
25.12.2024
Backpropagation Algorithm
47
Observations
§ Convergence: small changes
in the weights
§ There are other activation
functions à Hyperbolic
tangent function, is
practically better for NN as its
outputs range from -1 to 1
48
23
25.12.2024
Multi-Class Case
§ Nowadays, networks with more than two layers, a.k.a. deep networks,
have proven to be very effective in many domains
§ Examples of deep networks: restricted Boltzman machines, convolutional
NN, auto encoders, etc.
49
MNIST Database
http://yann.lecun.com/exdb/mnist/
50
24
25.12.2024
Tensorflow
http://playground.tensorflow.org/
51
52
25
25.12.2024
https://www.youtube.com/watch?v=aircAruvnKk
Machine Learning Systems: Principles and Practices of Engineering Artificially Intelligent Systems (https://mlsysbook.ai/)
53
Outline
§ Supervised Learning (cont.)
§ Perceptron Algorithm
§ Neural Networks
§ Perceptron to MLP
§ The XOR Example
§ Backpropagation Algorithm
§ Backpropagation Rules
§ Unsupervised Learning: Clustering
§ K-Means Clustering
§ Hierarchical Clustering
54
26
25.12.2024
55
56
27
25.12.2024
Unsupervised Learning
Training data: “examples” x
𝑥! , . . . , 𝑥& , 𝑥$ ∈ 𝑋 ⊂ ℝ&
§ Clustering/segmentation:
𝒇 ∶ ℝ𝒅 → 𝑪𝟏 , . . . , 𝑪𝒌 (set of clusters)
57
Unsupervised Learning
weight
length
58
28
25.12.2024
Methods: K-means,
Unsupervised Learning Gaussian mixtures,
weight
Hierarchical clustering,
Spectral clustering, etc.
length
59
K-Means: Example
60
29
25.12.2024
K-Means: Example
61
K-Means: Example
62
30
25.12.2024
K-Means: Example
63
K-Means: Example
64
31
25.12.2024
K-Means: Example
65
Clustering: K-Means
§ Minimize:
66
32
25.12.2024
Clustering: K-Means
Algorithm K-Means
67
BUT...
Cons
- Need to know K
- Suffer from the curse of dimensionality
- No theoretical foundation
68
33
25.12.2024
K-Means: Questions
69
K-Means: Question-1
How to set 𝒌 to optimally cluster the data?
G-means algorithm (Hamerly and Elkan, NIPS 2003)
1. Initialize 𝑘 to be a small number
2. Run k-means with those cluster centers, and store the resulting centers as C
3. Assign each point to its nearest cluster
4. Determine if the points in each cluster fit a Gaussian distribution (Anderson-
Darling test)
5. For each cluster
§ If the points seem to be normally distributed, keep the cluster center
§ Otherwise, replace it with two cluster centers
6. Repeat this algorithm from step 2 until no more cluster centers are created
70
34
25.12.2024
K-Means: Question-2
How to evaluate your model?
§ Not trivial (as compared to counting the number of errors in classification)
§ Internal evaluation: using same data, high intra-cluster similarity
(documents within a cluster are similar) and low inter-cluster similarity
§ E.g., Davies-Bouldin index that takes into account both the distance inside the
clusters and the distance between clusters. The lower the value of the index, the
wider is the separation between different clusters, and the more tightly the points
within each cluster are located together
§ External evaluation: use of ground truth of external data
§ E.g., mutual information, entropy, adjusted random index, etc.
https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
https://towardsdatascience.com/evaluation-metrics-for-clustering-models-5dde821dd6cd
71
K-Means: Question-3
How to cluster non circular shapes?
§ There are other methods that handle other shapes
§ Spectral Clustering
§ DBSCAN
§ BIRCH, etc.
72
35
25.12.2024
K-Means Clustering
73
K-Means Clustering
74
36
25.12.2024
K-Means Clustering
75
K-Means Clustering
76
37
25.12.2024
K-Means Clustering
77
K-Means Clustering
78
38
25.12.2024
K-Means Clustering
79
K-Means Clustering
80
39
25.12.2024
K-Means Clustering
81
K-Means Clustering
82
40
25.12.2024
K-Means Clustering
83
K-Means Clustering
84
41
25.12.2024
K-Means Clustering
85
K-Means Clustering
86
42
25.12.2024
K-Means Clustering
87
K-Means Clustering
88
43
25.12.2024
K-Means Clustering
89
K-Means Clustering
Optimal Solution
90
44
25.12.2024
Hierarchical Clustering
We don't want to specify the number of clusters, … but we want to group the houses…
There is a different way to group
91
Hierarchical Clustering
Grouping Rule: If two houses are close to each other, they order from the same pizza
restaurant!!
92
45
25.12.2024
Hierarchical Clustering
93
Hierarchical Clustering
94
46
25.12.2024
Hierarchical Clustering
95
Hierarchical Clustering
96
47
25.12.2024
Hierarchical Clustering
97
Hierarchical Clustering
98
48
25.12.2024
Hierarchical Clustering
99
Hierarchical Clustering
100
49
25.12.2024
Hierarchical Clustering
101
Hierarchical Clustering
102
50
25.12.2024
Hierarchical Clustering
103
§ Goal: subdivide a market into distinct subsets of customers where any subset
may conceivably be selected as a market target to be reached with a distinct
marketing mix.
§ Approach:
§ Collect different attributes of customers based on their geographical and
lifestyle related information
§ Find clusters of similar customers
§ Measure the clustering quality by observing buying patterns of customers in
same cluster vs. those from different clusters
104
51
25.12.2024
§ Goal: To find groups of documents that are similar to each other based on the
important terms appearing in them.
§ Gain: Information Retrieval can utilize the clusters to relate a new document or
search term to clustered documents.
105
106
52
25.12.2024
Anomaly Detection
§ Detect significant deviations from normal behavior
§ Applications:
§ Credit Card Fraud Detection
§ Network Intrusion Detection
107
53