Week 8
Week 8
Learning
Nitin Indurkhya
(you can google my name and read about my background)
Acknowledgement
• This set of slides is based on input from A/Prof. Vivek Balachandran,
Prof. Yu Chien Siang and A/Prof. Guo Huaqun!
Course Logistics
• All details are in the module profile in the LMS
• Lab work will involve Python programming
• Work in same groups as in part-1 of the course
• Lets do a quick poll !!!
Quick Mentimeter Poll…
Assessment of part-2
• Quizzes each week (5%)
– Note that there is no quiz in week 12
– Quiz 11 will have 2 marks, other quizzes are worth 1 mark
– Quizzes will be from 11:30am to 11:35am
– Machine-marking
– Absolutely no makeup, approved MC’s will be extrapolated
• Group Coursework (25%)
• Final exam in week-13 (20%)
– Individual assessment
– One hour exam (MCQ and short-answers) using lockdown
browser in NYP
• Similar to what you did in the first half
Learning Outcomes (for today)
• Define the concept of Machine Learning
• Understand three types of Machine Learning
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
• Understand concept of supervised learning algorithms and apply them into the
lab assignment and coursework
• K Nearest Neighbors
• Decision Tree
• Random Forest
Machine Learning
• What?
• Like humans: learning from the past
• The science of getting computers to learn without
being explicitly programmed
• Machine Learning is an application of AI wherein
the system gets the ability to automatically learn
and improve based on experience
• Machine Learning Applications
• Top 5 Machine Learning Application For 2022
• https://www.youtube.com/watch?v=aKhz79s-Row
Dimensions of ‘Learning’
• Has one acquired KNOWLEDGE
• Can one UNDERSTAND (meta-knowledge)
• Can one REFINE existing knowledge (background)
• What is the best REPRESENTATION (for specific knowledge)
• What are the different SOURCES (modalities)
• Is PERFORMANCE (of a specific skill) improving
• …many others…
Learning and Intelligence
• Generally accepted that ‘ability to learn’ is CENTRAL to ‘intelligence’
• Hence ‘Machine Learning’ is accepted as CENTRAL to ‘Artificial
Intelligence’ (AI)
• Over time, many so-called ‘Intelligent’ activities were reassessed:
– Game-playing (Chess is the most famous example)
– Information retrieval
• Evolving perspective of ‘Intelligence’
– Whatever CANNOT be done by a computer is ‘intelligence’
– Once an AI-task is ‘solved’ it is no longer considered part of AI
• Hence evolving view of ‘learning’!!!
Machine Learning Definition
• Arthur Samuel (1959). Machine Learning: Field of study that gives
computers the ability to learn without being explicitly programmed.
Traditional Programming:
Data
Output Finding a Square root
Program Computer of a number
Machine Learning:
Input
Data Classification ‘Predicting’ the weather
Computer Program today
Corresponding
Output
‘Machine Learning’ as ‘ML’
• Recall the term ‘Dynamic Programming’ in Algorithms
– Just a name, other programming paradigms were not any less ‘dynamic’.
• ML is a VERY SPECIFIC view of the phrase ‘Machine Learning’
• Objective is to build a MODEL and use it for NEW CASES
• Model is built using a spreadsheet of data (mostly numeric)
• Problems posed as CLASSIFICATION or REGRESSION tasks.
• Very constrained classes of models are considered.
• Has very little to do with LEARNING and PREDICTION
Key points of the ML you will learn
• Spreadsheet of data used as input
– May need some transformations
• Predictions are NOT causal, merely correlations
• Connection of data to the real-world is not always clear-cut
• Iterative nature of modeling task
– Tweaking parameters, adjusting inputs, testing incessantly
• Focus is on a mature set of ML methods
– You won’t just be using them as black boxes but will learn HOW they work!
Supervised Learning
Model is given a
Model learns dataset and is left
through to automatically
observation & find patterns and
finds structures in relationships in that
data. dataset by creating
clusters.
Reinforcement Learning
• Reinforcement learning
• involves an agent that interacts with its
environment by producing actions &
discovers errors or rewards.
• It is like being stuck in an isolated island,
where you must explore the environment
and learn how to live and adapt to the
living conditions on your own.
• Model learns through the trial-and-error
method
• It learns on the basis of reward or penalty
given for every action it performs
Supervised Unsupervised Reinforcement
Types of Learning Learning Learning
Machine The machine is
Involve an agent
that interacts with
Learning The machine trained on
its environment by
Definition learns by using unlabeled data
producing actions
labeled data without any
& discovers errors
guidance
or rewards
Types of Classification or Association or
Reward Based
Problems Regression Classification
No pre-defined
Types of Data Labelled Data Unlabelled Data
data
External
Training No Supervision No Supervision
Supervision
Map Labeled Understand
Follow trail-and-
Approach input to known pattern and
error method
output discover output
Supervised learning process
• 2 Stage process
• Learning (training): Learn a model using the training data
• Testing: Test the model using unseen test data to assess the model
accuracy
Label Machine
learning
Feature
algorithm
extractor
features
Training input
Feature Classifier
extractor Label
model
features
Test Input
Instance-based learning
• Instance-based Learning
• Learning=storing all training instances
• Classification=assigning target function to a new
instance
• Referred to as “Lazy” learning
• Model is created at the point of classification
• Supervised learning
• https://www.youtube.com/watch?v=4HKqjENq9OU
• KNN Algorithm - How KNN Algorithm Works With Example | Data
Science For Beginners (27 min)
KNN Algorithm
• Features
• All instances correspond to points in an n-dimensional Euclidean space
• Classification is delayed till a new instance arrives
• Classification done by comparing feature vectors of the different points
• Target function may be discrete or real-valued
• For discrete-valued, the KNN returns the most common value among the k
nearest training examples.
K-Nearest Neighbor (How it works)
1 Nearest Neighbor
K-Nearest Neighbor (How it works)
3 Nearest Neighbor
KNN Algorithm
• Training algorithm
• For each training example <x,f(x)> add the example to the list
• Classification algorithm
• Given a query instance xq to be classified
• Let x1,..,xk be k instances which are nearest to xq
k
argmax
fˆ (x q )
v V i=1
(v, f (x i ))
What is a good value for k?
• Determined experimentally
• Start with k=1 and use a test set to validate the error rate of the
classifier
• Repeat with k=k+1
• Choose the value of k for which the error rate is minimum
• Is K = 10 or K = 11, better?
• How to test efficacy
• N-fold cross validation!
N-fold Cross Validation
• Split data into N block
• Perform the classification N times
• Each round of testing has
• N-1 blocks used for Training
• 1 block used for Testing
• Total of N results are obtained
• Average the error across all the N classification
Distance Calculation
• All instances correspond to points in an n-dimensional Euclidean
space 𝑋 = 𝑋1 , 𝑋2 , … . . 𝑋𝑛
f (x ) i
f : →d
fˆ (x q ) i=1
k
https://study.com/academy/lesson/discrete-continuous-functions-definition-examples.html
Curse of Dimensionality
• Imagine instances are described by 20 features
(attributes) but only 3 are relevant to target function
• Curse of dimensionality: nearest neighbor is easily
misled when instance space is high-dimensional
• Dominated by large number of irrelevant features
• https://deepai.org/machine-learning-glossary-and-
terms/curse-of-
dimensionality#:~:text=The%20curse%20of%20dime
nsionality%20refers,and%20%E2%80%9Ccloseness%
E2%80%9D%20of%20data.
Possible solutions
• Weight features
• Use cross-validation to automatically choose weights
z1,…,zn
The two wind turbines above seem very close to each other in two dimensions but separate when
• Feature subset selection viewed in a third dimension. This is the same effect the curse of dimensionality has on data.
KNN – When?
• When to use KNN
• Classification problems
• Data has definitive manageable feature space
• Lots of training data
Advantages:
• Training is very fast
• Learn complex target functions
• Do not lose information
Disadvantages:
• Slow at query time (or classification)
• Easily fooled by irrelevant features (attributes)
Why try Distance methods?
• Easy to apply
• No “model” to build usually
• Most packages have polished implementations
• Non-parametric in nature
• Doesn't make assumptions about population
distribution
• Quite competitive (surprisingly!)
• Theoretical result: E(1-NN) >= 2*E(Bayes)
• Will discuss Bayes rule later!!
Decision Tree Age
Old Young
Sex Healthy
Female Male
• Tree structure
Diseased
• Inverted tree starting with a root node Healthy
No Yes No Yes
Decision Tree dissected
Outlook
No Yes No Yes
Building Decision Tree
• Top-down tree construction
• At start, all training examples are at the root.
• Partition the examples recursively by choosing one attribute each time.
https://medium.com/datadriveninvestor/tree-algorithms-id3-c4-5-c5-0-and-cart-413387342164
Which attribute to select? Outlook
Windy
Sunny Overcast Rain
9 Yes 9 Yes
Outlook 5 No Windy 5 No
log2x = logx/log2
Information Gain
• Entropy tells how pure or impure one subset is
• How to combine entropy of all subsets?
• Aggregate information from several different subsets
• Average them?
• Not a simple average (Why?)
• Weight on the entropy value for each subset
• Proportional size of the subset
• Information Gain
• Entropy difference before and after the split
𝑆𝑣
𝐺𝑎𝑖𝑛 𝑆, 𝐴 = Entropy(S) - σ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑉 )
𝑆
Find the Gain of Outlook
9 Yes
• Entropy of the set of outcomes 5 No
• Before the split happens (at the root node) Outlook
• Entropy of Sunny
• Entropy of Overcast
Sunny Overcast Rain
• Entropy of Rain
• Expected information for the attribute subsets
2 Yes 4 Yes 3 Yes
• Information gain 3 No 0 No 2 No
• Information gain:
Find the Gain of Outlook
9 Yes
• Entropy of the set of outcomes (E) 5 No
• E(9/14,5/14) (X) Outlook
• Entropy of Sunny (E1)
• E(2/5,3/5)
• Entropy of Overcast (E2) Sunny Overcast Rain
• E(4/4,0/4)
• Entropy of Rain (E3) 2 Yes 4 Yes 3 Yes
• E(3/5,2/5) 3 No 0 No 2 No
• Expected information for the attribute subsets
• (5/14)*E1 + (4/14)*E2+ (5/14)*E3 (Y)
• Information gain:
• X-Y
Find the Gain of Outlook
• Entropy of the set of outcomes
• E(9/14,5/14) = -9/14log2(9/14) – 5/14log2(5/14) 9 Yes
= -9/14log(9/14)/log2 – 5/14log(5/14)/log2 = 0.94 5 No
Outlook
• Entropy of Sunny
• E(2/5,3/5) = -2/5log2(2/5) – 3/5log2(3/5)
= -2/5log(2/5)/log2 – 3/5log(3/5)/log2 = 0.971
• Entropy of Overcast Sunny Overcast Rain
• E(4/4,0/4) = -4/4log2(4/4) – 0/4log2(0/4) = 0
• Entropy of Rain 2 Yes 4 Yes 3 Yes
• E(3/5,2/5) = -3/5log2(3/5) – 2/5log2(2/5) 3 No 0 No 2 No
= -3/5log(3/5)/log2 – 2/5log(2/5)/log2 = 0.971
• Expected information for the attribute subsets
• (5/14)*0.971 + (4/14)*0+ (5/14)*0.971 = 0.69
• Information gain:
• Gain(S, Outlook) = 0.94 – 0.69 = 0.25
Find the Gain of Temperature
• Entropy of the set of outcomes 9 Yes
• E(9/14,5/14) = -9/14log2(9/14) – 5/14log2(5/14)
5 No
= -9/14log(9/14)/log2 – 5/14log(5/14)/log2 = 0.94
Temperature
• Entropy of Hot
• E(2/4,2/4) = -2/4log2(2/4) – 2/4log2(2/4) = 1
• Entropy of Mild Hot Mild Cold
• E(4/6,2/6) = -4/6log2(4/6) – 2/6log2(2/6)
= -4/6log(4/6)/log2 – 2/6log(2/6)/log2 = 0.92
• Entropy of Cold 2 Yes 4 Yes 3 Yes
• E(3/4,1/4) = -3/4log2(3/4) – 1/4log2(1/4) 2 No 2 No 1 No
= -3/4log(3/4)/log2 – 1/4log(1/4)/log2 = 0.81
• Expected information for the attribute subsets
• (4/14)*1 + (6/14)*0.92+ (4/14)*0.81 = 0.29+0.39+0.23 = 0.91
• Information gain:
• Gain(S, Temperature) = 0.94 – 0.91 = 0.03
Find the Gain of Humidity
• Entropy of the set of outcomes Entropy(9/14, 5/14) Yes – 9
• E(9/14,5/14) = -9/14log2(9/14) – 5/14log2(5/14) Humidity No - 5
= -9/14log(9/14)/log2 – 5/14log(5/14)/log2 = 0.94
• High Humidity entropy
• E(3/7,4/7) = -3/7log2(3/7) – 4/7log2(4/7)
= -3/7log(3/7)/log2 – 4/7log(4/7)/log2 = 0.985 High Normal High - 7
Normal - 7
• Normal Humidity entropy
• E(6/7,1/7) = -6/7log2(6/7) – 1/7log2(1/7)
= -6/7log(6/7)/log2 – 1/7log(1/7)/log2 = 0.592 Yes - 3 Yes - 6
No - 4 No - 1
• Expected information for the attribute subsets
• (7/14)*0.985 + (7/14)*0.592 = 0.79
Entropy(3/7, 4/7) Entropy(6/7, 1/7)
• Information gain:
• Gain(S, Humidity) = 0.94 – 0.79 = 0.15
Find the Gain of Windy
9 Yes
• Entropy of the set of outcomes Windy 5 No
• Es
• True entropy
• E1 True False
• False entropy
• E2 6 Yes
3 Yes
• Expected information for the attribute subsets 3 No 2 No
• Information gain:
• Gain(S, Windy) = ?
Computing information gain
• Information gain for each attributes
• Gain(“Outlook’) = 0.25
• Gain(“Temperature”) = 0.03
• Gain(“Humidity”) = 0.15
• Gain(“Windy”) =
• Find the node with the maximum gain
• The root node is Outlook!
Decision Tree
Outlook
• Overcast node
already ended up Sunny Overcast Rain
having leaf node
‘Yes’
• Two subtrees of Humidity Yes Windy
Sunny and Rain
to compute
information gain: High Normal False
True
• Humidity
• Temperature
No Yes No Yes
• Windy
Overfitting
• Overfitting: A tree may overfit the training data
• Symptoms: tree too deep and too many branches, some may reflect anomalies due
to noise or outliers
• Keep splitting until each node contains 1 example
• Singleton = pure
• Good accuracy on training data but poor on test data
• Two approaches to avoid overfitting
• Pre-pruning: Halt tree construction early
• Stop splitting when not statistically significant
• Difficult to decide because we do not know what may happen subsequently if we keep
growing the tree.
• Post-pruning: Remove branches or sub-trees from a “fully grown” tree.
• This method is commonly used
• Uses a statistical method to estimates the errors at each node for pruning.
• A validation set may be used for pruning as well.
An example
Postpruning
• Postpruning waits until the full decision tree has built and then
prunes the attributes
• Two techniques:
• Subtree Replacement
• Subtree Raising
Subtree Replacement
• Entire subtree is replaced by a single leaf node
C 4 5
1 2 3
Subtree Replacement
• Node 2 replaced the subtree
• Generalizes tree a little more, but may increase accuracy
2 4 5
Subtree Raising
• Entire subtree is raised onto another node
C 4 5
1 2 3
Subtree Raising
• Entire subtree is raised onto another node
C B
1 2 3
Random Forest (RF)
• Ensemble Classifier
• Consists of many decision trees
• Created from subsets of data
• Random sampling of subsets
• Classification
• Classify using each of the trees in the random forest
• Each classifier predicts the outcome
• Final decision by voting
• The method combines Breiman's "bagging" idea and the random selection
of features
• https://www.youtube.com/watch?v=eM4uJ6XGnSM
• Random Forest Algorithm - Random Forest Explained (45 min)
An example! Age Age