Lecture B1 - Overview and Intro
Lecture B1 - Overview and Intro
• Toledo: only one course is activated, please register for H0E96A (Beginselen…)
3
Goals & materials
• Goals of the course:
• 1. provide an overview of machine learning theory and methods, including insight
in how / why the different methods work
• 2. enable you to apply machine learning in non-trivial contexts
4
Background
• Course reserved for Master of CS / Master of AI
• We assume knowledge of
• Calculus
• Linear algebra
• Probability theory
• Statistics
• Programming & Algorithms
• Artificial Intelligence (H06U1A)
as seen in earlier courses in KU Leuven’s bachelor programs
Informatica / Burgerlijk Ingenieur - Computerwetenschappen
5
Teaching schedule
Week Wednesday Thursday Ex.
6
✔︎
✔︎
✔︎
✔︎
✔︎
✔︎
✔︎
✔︎
The Exam
• Written exam, closed book. Testing both knowledge and insight, both theory and practice.
• Q: “Up to what level of detail should I study this? Do I need to know everything that’s in the
reader?”
• A: You should
• know & understand everything mentioned in the lectures
• be able to solve problems of similar difficulty as those covered in the exercise sessions
• be able to answer questions that require you to reason about the concepts you’ve
seen
• be able to extrapolate: apply a concept in a different context
Reading materials are meant to help you digest the content of the lectures, not provide
additional content (unless mentioned otherwise)
7
You should be able to…
• Execute (parts of) algorithms we’ve seen on concrete data. E.g.: given some data,
• show the kernel matrix that an SVM learner computes
• perform a backpropagation step in a given neural network
• Reason about the behavior of an algorithm on a more abstract level (based on its
properties rather than on mimicking it)
• E.g.: an SVM has k support vectors out of n instances, and 0 training error: give a
lower bounds on the accuracy estimate obtained using leave-one-out cross-validation
• What would happen if we change … in the algorithm we’ve seen?
• The answer is not given in the course materials, but you can infer it by reasoning
about the concepts that we have seen
• Explain the architecture of a particular type of model, and explain its advantages and
disadvantages
• …
8
This is a challenging course
• In some exam periods, <50% of students pass this exam. Mean score
often <10. (Max score still often 19 or 20)
• Perceived causes for low grades:
• Not reading the question carefully
• Tendency to reproduce rather than reason
• Tendency to study questions & answers, rather than the course
itself
• Superficial rather than deep understanding of course topics
• Standard investment in a 6-ECTS course: 180 hours (lectures,
exercises, self-study). Aim for 12 hours per week during the semester.
9
What is Machine
Learning?
Discussion time…
Learn from data
Find pattern in data and generalize
11
The RoboSail project
• Pieter Adriaans (Univ. of Amsterdam), around 2003: first auto-
pilot for sailing boats (www.robosail.com)
• No suitable mathematical theory => had to learn how to sail
• Boat full of AI technology (agents, sensors, ...), including
machine learning components
12
Language learning
• Children learn language by simply hearing sentences being used
in a certain context
• Can a computer do the same: given examples (sentence +
description of context), learn the meaning of words/sentences?
13
Autonomous cars
• Very first race with fully autonomous cars was in… 2004
Censored
• DARPA grand challenge : have autonomous cars race each other
on desert roads
• In 2004, no winner - “best” car got about 12 km far
• In 2005, five cars made it to the finish (212 km)
14
The Robot Scientist
• King et al., Nature, 2004
• Scientific research, e.g., in drug discovery, is iterative:
• Determine what experiment to perform
• Perform the experiment Feedback
15
Automating manual tasks
• E.g.: nurse rostering in a hospital: need to accommodate non-obvious
constraints (e.g., leave enough time between shifts)
• Hard to automate, unless constraints can be learned from earlier examples
16
Other applications…
• Recommender systems: e.g., Google, Facebook, Amazon, … try
to show you ads you might like
• Email spam filters: by observing which mails you flag as “spam”,
try to learn your preferences
• Natural language processing: e.g., Sentiment analysis: is this
movie review mostly positive or negative?
• Vision: learn to recognize pedestrians, …
• … and many, many more
17
Definitions of machine learning?
• Tom Mitchell, 1996: Machine learning is the study of how to make programs
improve their performance on certain tasks from (own) experience
• “performance” = speed, accuracy, …
• “experience” = earlier observations
• “Improve performance” in the most general meaning: this includes learning
from scratch.
• Useful (in principle) for anything that we don’t know how to program -
computer “programs itself”
• Vision: recognizing faces, traffic signs, …
• Game playing, e.g., AlphaGo
• Link to artificial intelligence : computer solves hard problems autonomously
18
Machine learning vs. other AI
• In machine learning, the key is data
• Examples of questions & their answer
• Observations of earlier attempts to solve some problem
19
Machine learning and AI
• Many misconceptions about machine learning these days
• Since the mid-2000s, “Deep Learning” has received a lot of
attention: it revolutionized computer vision, speech recognition,
natural language processing
• Avalanche of new researchers drawn to the field, without knowledge
of the broader field of AI, or history of ML (“AI = ML = deep learning”)
• See, e.g., A. Darwiche, https://www.youtube.com/watch?v=UTzCwCic-Do
(also published in Communications of the ACM, October 2018)
20
Machine learning and AI
• There is still progress on all fronts, deep learning is just one of them
• This course reflects that viewpoint
• (schema below is incomplete, just serves to illustrate complexity of
scientific impact)
21
Tasks Techniques Models Applications
22
Basic concepts and
terminology
Machine learning
• ML in its typical form:
• Input = dataset
• Output = some kind of model
• ML in its most general form:
• input = knowledge
• output = a more general form of knowledge
• Learning = inferring general knowledge from more specific knowledge (observations
➔ model) = inductive inference
• Learning methods are often categorized according to the format of the input and
output, and according to the goal of the learning process (but there are many more
dimensions along which they can be categorized)
24
A typical task
• Given examples of pictures + label (saying what’s on the picture),
infer a procedure that will allow you to correctly label new pictures
• E.g.: learn to classify fish as “salmon” or “sea bass”
25
A generic approach
• Find informative features (here: lightness, width)
• Find a line/curve/hyperplane/… in this feature space that separates
the classes
26
27
28
29
Predictive versus descriptive
THE MODEL YOU LEARN REPRESENTS A MATHEMATICAL FUNCTION
• Predictive learning : learn a model that can predict a particular property / attribute / variable
from inputs
• Many tasks are special cases of predictive learning
• E.g., face recognition: given a picture of a face, say who it is
• E.g., spam filtering: given an email, say whether it’s spam or not
Name of task Learns a model that can …
Concept learning / Distinguish instances of class C from other instances
Binary classification
Classification Assign a class C (from a given set of classes) to an instance
Regression Assign a numerical value to an instance
Multi-label classification Assign a set of labels (from a given set) to an instance
Multivariate regression Assign a vector of numbers to an instance
Multi-target prediction Assign a vector of values (numerical, categorical) to an instance
Ranking Assign ≤ or > to a pair of instances
30
PREDICTIVE LEARNING: YOU LEARN FUNCTIONS
31
Function learning
• Task : learn a function X →Y that fits the given data (with X and Y sets
of variables that occur in the data)
• Such a function will obviously be useful for predicting Y from X
• May also be descriptive, if we can understand the function
• Often, some family of functions F is given, and we need to estimate
the parameters of the function f in F that best fits the data
• e.g., linear regression : determine a and b such that y = ax + b fits
the data as well as possible
• What does “fit the data” mean? Measured by a so-called loss function
2
∑
• e.g., quadratic loss: ( f (x) − y) with f the learned function and
(x,y)∈D
D the dataset
32
Distribution learning
• Task: given a data set drawn from a distribution, estimate this distribution
• Often made distinction: parametric vs. non-parametric
• Parametric: a family of distributions is given (e.g., “Gaussian”), we only need to
estimate the parameters of the target distribution
• Non-parametric: no specific family is assumed
• Often made distinction: generative vs. discriminative CAPISCI MEGLIO
• Generative: learn the joint probability distribution (JPD) over all variables (once
you have that, you can generate new instances by random sampling from it)
• Discriminative: learn a conditional probability distribution of Y given X, for some
given set of variables X (called input variables) and Y (called target variables)
. ...... . . . ...... . .
33
These categorizations are somewhat
fuzzy…
• A descriptive pattern may be useful for prediction
• “Bank X always refuses loans to people who earn less than
1200 euros per month” (description)
• Bob earns 1100 euros per month => Bank X will not give him
a loan
• While functions are directly useful for prediction, a probability
distribution can be used just as well
• Given known information X, predict as value for Y, the value
with the highest conditional probability given X
34
Parametric vs. non-parametric
• Parametric: a family of functions (or distributions, or …) is given,
and each function is uniquely defined by the values of a fixed set
of parameters
• e.g. (function learning): linear regression
• e.g. (distribution learning): fitting a gaussian
• Non-parametric: no specific family of functions is assumed
• Typically, we are searching a space that contains models with
varying structure, rather than just different parameter values
• This often requires searching a discrete space
• E.g.: decision trees, rules, …. (see later)
35
Link with “explainable AI”
• Explainable AI (XAI) refers to the study of AI systems that can explain their
decisions / whose decisions we can understand
• Two different levels here:
• We understand the (learned) model used for decision making
• We understand the individual decision
• E.g. “I could not get a loan because I earn too little”: we can understand this
decision even if we don’t know the whole decision process the bank uses
• A learned model that is not straightforward to interpret, is called a black-
box model
• Machine learning poses additional challenges for XAI, as it often learns
black-box models
36
Responsible AI : challenges
• Privacy-preserving data analysis
• We need lots of data to learn from; this may include personal data
• How can we guarantee that the analysis of these data will not violate
the privacy of the people whose data this is?
• Generally, when data is collected, consent is needed for a specific
purpose, and data must be used solely for that purpose — how can we
guarantee it won’t be abused?
• Learning “safe” models : models that will not violate certain constraints
that are imposed (including constraints on bias, discrimination, privacy, …)
37
Predictive learning
• A very large part of machine learning focuses on predictive
learning
• In the following, we zoom in on that part
38
Prediction: task definition
The prediction task, in general:
o Given: a description of some instance
o Predict: some property of interest (the “target”)
Examples:
o classify emails as spam / non-spam
o classify fish as salmon / bass
o forecast tomorrow’s weather based on today’s measurements
39
Training & prediction sets
• Training set: a set of examples, instance descriptions that
include the target property (a.k.a. labeled instances)
• Prediction set: a set of instance descriptions that do not include
the target property (“unlabeled” instances)
??? ???
40
Inductive vs. transductive learning
We can consider as outcome of the learning process, either
• the predictions themselves: transductive learning
• or: a function that can predict the label of any unlabeled
instance: inductive learning
.(x1,y1) .(x1,y1)
.(x4, ) .(x4, )
.(x2,y2) .(x2,y2)
.(x5, ) .(x5, )
.(x3,y3) .(x3,y3)
41
Inductive vs. transductive learning
We can consider as outcome of the learning process, either
• the predictions themselves: transductive learning
• or: a function that can predict the label of any unlabeled
instance: inductive learning
f: X→Y
.(x1,y1) .(x1,y1)
.(x4,y4) .(x4, )
.(x2,y2) .(x2,y2)
.(x5,y5) .(x5, )
.(x3,y3) .(x3,y3)
42
Inductive vs. transductive learning
We can consider as outcome of the learning process, either
• the predictions themselves: transductive learning
• or: a function that can predict the label of any unlabeled
instance: inductive learning
f: X→Y
.(x1,y1) .(x1,y1)
.(x4,y4) .(x4,f(x4))
.(x2,y2) .(x2,y2)
.(x5,y5) .(x5,f(x5))
.(x3,y3) .(x3,y3)
43
Interpretable vs. black-box
The predictive function or model learned from the data may be
represented in a format that we can easily interpret, or not
44
Overfitting and underfitting
• “Occam’s razor”: among equally accurate models, choose
the simpler one
• Trade-off: explain data vs. simplicity
• Both overfitting and underfitting are harmful
45
Levels of supervision
• Supervised learning: learning a (predictive) model from labeled
instances (as in cats & dogs example)
46
Semi-supervised learning
• How can unlabeled examples help learn a better model?
+ -
+
-
This illustration:
- 2 classes, called + and - ?
- Representing instances
in a 2-dimensional space
47
Semi-supervised learning
• How can unlabeled examples help learn a better model?
. . . . .. .
. . . . + . -
. . . . . . .
. . . .
. .
. + . .. .
.
. . . .
. . - . .
. ?. .
. . . . . .
. . . .
48
Semi-supervised learning
• How can unlabeled examples help learn a better model?
. . . . .. .
. . . . + . -
. . . . . . .
. . . .
. .
. + . .. .
.
. . . . .
. .. - . .
. . ?. .
. . . .
. . .
49
Unsupervised learning
• Can you see three classes here?
• Even though we don’t know the names of the classes, we still see some structure
(clusters) that we could use to predict which class a new instance belongs to
• Identifying this structure is called clustering
• From a predictive point of view, this is unsupervised learning
. . . .. .
. . . . .
. . . .
.
.. . .
. .. . .
. .
50
PU-learning
• PU-learning is a special case of semi-supervised learning
• PU stands for “positive / unlabeled”
• All the labeled examples belong to one class (called the
“positive” class)
. .. . .
“kicking the ball” requires
PU-learning because:
+ When Mike kicks the ball,
. . . .
not kick the ball, it is never
mentioned that he does not.
.
51
Weakly supervised learning
• Weakly supervised learning is a generalized form of semi-supervised learning
• Semi-supervised: for a single instance, we either know its label or we do not
• Weakly supervised: we may have partial information about a label
• e.g., it is certainly a member of a given set (= superset learning)
• e.g., at least one instance among a given set of instances has the label,
but we do not know which one (= multi-instance learning)
• e.g., we know two instances have the same label, but we don’t know
which one it is (= constraint-based clustering)
• …
52
Relationship between different
supervision settings
Predictive learning
Supervised Unsupervised
Learning Learning
Weakly supervised
PU-learning
53
Format of input data
Format of input data
• Input is often assumed to be a set of instances that are all described using the
same variables (features, attributes)
• The data are “i.i.d.”: “independent and identically distributed”
• The training set can be seen as a random sample from one distribution
• The training set can be shown as a table (instances x variables) : tabular data
• This is also called the standard setting
55
Format of input data: tabular
Sepal Sepal Petal Petal Class
length width length width
5.1 3.5 1.4 0.2 Setosa
Training
4.9 3.0 1.4 0.2 Setosa
set
7.0 3.2 4.7 1.4 Versicolor
6.3 3.3 6.0 2.5 Virginica
56
Format of input data: sequences
• Learning from sequences:
• 1 prediction per sequence?
• 1 prediction per element?
• 1 element in sequence can be …
• A number (e.g., time series)
• A symbol (e.g., strings)
abababab: +
• A tuple aabbaabb: -
• A more complex structure
57
Format of input data: trees
• 1 prediction per tree / per node in the tree
• Nodes can be …
• Unlabeled
• Labeled with symbols (e.g., HTML/XML structures)
• …
ul-
- -
<li>- <li> <li>
E.g.: this tree indicates as “positive” a text field
preceded by Address: inside a list (<li>) context <b>- (text)+
“Adress:” -
58
Format of input data: graph
• Example: Social network
• Target value known for some
nodes, not for others
• Predict node label
• Predict edge
• Predict edge label
• …
• Use network structure for
these predictions
59
Format of input data: raw data
• “Raw” data are in a format that seems simple (e.g., a vector of
numbers), but components ≠ meaningful features
• Example: photo (vector of pixels)
• Raw data often need to be processed in a non-trivial way to obtain
meaningful features; on the basis of these features, a function can be
learned
• This is what deep learning excels at
60
Format of input data: knowledge
• “Knowledge” can consist of facts, rules, definitions, ….
• We can represent knowledge about some domain in a
knowledge representation language (such languages are often
based on logic)
atm(m1,a1,o,2,3.43,-3.11,0.04). ...
atm(m1,a2,c,2,6.03,-1.77,0.67). hacc(M,A):- atm(M,A,o,2,_,_,_).
... hacc(M,A):- atm(M,A,o,3,_,_,_).
bond(m1,a2,a3,2). hacc(M,A):- atm(M,A,s,2,_,_,_).
bond(m1,a5,a6,1). hacc(M,A):- atm(M,A,n,ar,_,_,_).
bond(m1,a6,a7,du). zincsite(M,A):-
... atm(M,A,du,_,_,_,_).
hdonor(M,A) :-
atm(M,A,h,_,_,_,_),
not(carbon_bond(M,A)), !.
...
61
Data preprocessing
• Data may not be in a format that your learner can handle
• Data wrangling: bring it into the right format
• Even if it’s in a format you learner can handle (e.g., tabular), the
features it contains may not be very informative, or there may be
very few relevant features among many irrelevant ones.
• E.g.: individual pixels in an image are usually not very informative
• Feature selection: select among many input features the most
informative ones
• Feature construction: construct new features, derived from the
given ones
62
What learning method to use?
• Which learners are suitable for your problem, depends strongly
(but not solely!) on the structure of the input data
• Most learners use the standard format
• A set of instances, where each instance is described by a
fixed set of attributes (a.k.a. features, variables)
• also called attribute-value format or tabular format
• At the other extreme, inductive logic programming handles any
kind of knowledge that can be represented using clausal logic
• This includes sequences, graphs, …
63
What learning method to use?
• The data format and the learning task impose strong constraints
on which learning methods can be used
• Other aspects determine whether the method performs well:
• Inductive bias (see later)
• Ability to handle missing values
• Ability to handle noise
• Ability to handle high-dimensional data
• Ability to handle large datasets
• Ability to generalize from small datasets (avoid overfitting)
• We’ll cover many of these aspects at different points in the course
64
Missing data Sepal Sepal Petal Petal Class
length width length width
5.1 ? 1.4 0.2 Setosa
Some training examples may have 4.9 3.0 1.4 0.2 Setosa
missing values… how to handle these?
7.0 3.2 ? 1.4 Versicolor
Some options:
6.3 3.3 6.0 2.5 Virginica
1. Leave out from training set
- Information loss… Missingness itself can be relevant! May
2. Guess the missing value correlate with class (e.g., exit polls), …
- What if guess is wrong? Statisticians distinguish MCAR, MAR, NMAR:
3. Treat ‘?’ as a separate value Missing (Completely) At Random, or Not
… program it to consider it just if you have it
65
Output formats,
methods (overview)
Output formats
• The output of a learning system is a model
• Many different types of model exist
• The learning algorithm or method is strongly linked to the type of model
• High-level overviews of machine learning methods often categorize
them along this axis
67
Different views of the landscape
Domingos: Flach: Bishop:
“five tribes” “three types of models” “the world is Bayesian”
68
Parametrized functions
• Typically, a certain format for the functions is provided; e.g.:
linear functions of the inputs
• Within this set, we look for the parameter values that best fit the
data
• Standard example: linear regression
2.5
● ●●
● ●
●●●● ● ● ● ●
y = ax + b
● ● ●
●●●● ● ●
2.0
●●●● ● ●
●● ● ●
●●
● ● ● ●● ● ● ●
● ●
● ● ● ●
1.5
● ●●● ●●●
Petal.Width
● ● ●●● ●
●
● ● ● ●●
●● ● ●
●●
●●●●●
●
width = 0.416*length - 0.363
1.0
● ● ● ●●
●
0.5
●
● ●●● ●
●●● ●
● ●●
● ● ●●
●●
●●●● ●
● ●●
1 2 3 4 5 6 7
Petal.Length
69
Conjunctive concepts
• A conjunctive concept is expressed as a set of conditions, all of
which must be true
• “x has class C if and only if <condition1> and <condition2> and
… and <condition k>”
70
Rule sets
• A rule set is a set of rules of the form “if … then …” or “if … then
… else …”
• Example: definition of leap years
71
Decision trees
• A decision tree represents a stepwise procedure to arrive at
some decision
72
Neural networks
• A neural network is a complex structure of neurons, each of
which aggregate multiple input signals into a single output signal
h11 = f(a11x+b11y+c11z)
x
y Out
73
Probabilistic graphical models
• A PGM represents a (high-dimensional) joint distribution over
multiple variables as a product of (low-dimensional) factors
• Different type of PGMs: Bayesian networks, Markov networks,
factor graphs, …
f1(a) A B f2(a)
f4(c, d) D E f5(c, e)
. . .
. . . .
. . ? . .
.. . .. . .
. .
75
Search methods
• How do we find the most suitable model?
• Sometimes, there is a closed form solution (e.g., linear regression)
• If not, we typically need to search some hypothesis space
• Two very different types of spaces, each with their own search
methods :
• Discrete spaces (methods: hill-climbing, best-first, …)
• Continuous spaces (methods: gradient descent, …)
• Typically:
• Model structure not fixed in advanced => discrete
• Fixed model structure, tune numerical parameters => continuous
76
Example: gradient descent in a
continuous space
B
.
(-1,10) Color encodes
loss
. .
Y . . Gradient
. . . descent
x
. .
(1,3)
.
. (2,0)
X -1 1 2 A
(a,b) represents y = ax + b
78
Candidate Elimination: illustration
• A company produces intelligent robots. Some robots
misbehave. We suspect that one particular combination of
features is the cause for this misbehavior.
• For ease of discussion, we here assume robots have four
relevant characteristics:
• Color : B R M
• Body shape : S T
• =Legs/wheels: L W
• #“eyes” : 1 2
• Find the combination that misbehaves
79
Candidate Elimination: illustration
• We will represent a hypothesis as a tuple <color, body, legs,
eyes> where color = B, R, M or ? (? means “any color”) etc.
• Hypothesis space: {B,R,M,?} x {S,T,?} x {L,W,?} x {1,2,?}
• Let S(h) be the set of robots characterized by a hypothesis h
• Hypothesis h1 is more general than h2 if and only if S(h2)⊆S(h1)
80
Search space is a lattice
<?,?,?,?>
BS?? BT?? RS?? RT?? MS?? MT?? B?L? B?W? R?L? R?W? M?L? M?W? … ??L1 ??L2 ??W1 ??W2
BSL1 BSL2 BSW1 BSW2 BTL1 BTL2 BTW1 BTW2 ......... MTL2 MTW1 MTW2
⊥
81
Observation 1:
Candidate Elimination
Misbehaves
!
<?,?,?,?>
BS?? BT?? RS?? RT?? MS?? MT?? B?L? B?W? R?L? R?W? M?L? M?W? … ??L1 ??L2 ??W1 ??W2
BSL? BSW? BTL? BTW? ..... B?W2 ... BS?2 ... ?SW2 ... ?TL2 ?TW1 ?TW2
BSL1 BSL2 BSW1 BSW2 BTL1 BTL2 BTW1 BTW2 ......... MTL2 MTW1 MTW2
⊥
82
Observation 2:
Candidate Elimination
Does not misbehave
<?,?,?,?>
BS?? BT?? RS?? RT?? MS?? MT?? B?L? B?W? R?L? R?W? M?L? M?W? … ??L1 ??L2 ??W1 ??W2
BSL? BSW? BTL? BTW? ..... B?W2 ... BS?2 ... ?SW2 ... ?TL2 ?TW1 ?TW2
BSL1 BSL2 BSW1 BSW2 BTL1 BTL2 BTW1 BTW2 ......... MTL2 MTW1 MTW2
⊥
83
Candidate Elimination
+ +
84
Candidate Elimination
+ +
85
Candidate Elimination
• The candidate elimination algorithm illustrates
• Search in a discrete hypothesis space (with lattice structure)
• Search for all solutions, rather than just one, in an efficient manner
• Importance of generality ordering
• We’ll see many other learning approaches, all with their own pros & cons
86