0% found this document useful (0 votes)
55 views86 pages

Lecture B1 - Overview and Intro

Uploaded by

Riccardo Forte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views86 pages

Lecture B1 - Overview and Intro

Uploaded by

Riccardo Forte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Beginselen van Machine Learning

MSc Computerwetenschappen, Artificiële Intelligentie

Principles of Machine Learning


MSc Computer Science

H. Blockeel, J. Davis, L. De Raedt


Overview and
practicalities
Actually 2 courses…
• H0E96A: Beginselen van machine learning
• H0E98A: Principles of machine learning
• Same content, different language

• Study materials for both courses are in English


• Live lectures mostly in Dutch. Lecture recordings in both Dutch and English
• Exercise sessions: explanations available in both Dutch and English, according
to student’s preferences
• Exam: bilingual Dutch / English

• Toledo: only one course is activated, please register for H0E96A (Beginselen…)

3
Goals & materials
• Goals of the course:
• 1. provide an overview of machine learning theory and methods, including insight
in how / why the different methods work
• 2. enable you to apply machine learning in non-trivial contexts

• Study material: slides + recordings + reader (pointers to a selection of texts)


• Machine learning is a broad and fast-evolving domain
• Textbooks tend to be either deep and narrow (focusing on one particular type of
approach), or broad but less deep
• This course aims at being broad and deep, and is meant for students with a
strong technical background in engineering, mathematics, and computer science
• No suitable single textbook

4
Background
• Course reserved for Master of CS / Master of AI
• We assume knowledge of
• Calculus
• Linear algebra
• Probability theory
• Statistics
• Programming & Algorithms
• Artificial Intelligence (H06U1A)
as seen in earlier courses in KU Leuven’s bachelor programs
Informatica / Burgerlijk Ingenieur - Computerwetenschappen

5
Teaching schedule
Week Wednesday Thursday Ex.

1 23/9 Introduction - Blockeel K-nearest neighbors, evaluation metrics - Davis


2 30/9 Decision trees - Blockeel - Flipped classroom Ensembles - Davis
14:00 Dutch - 15:00 English

3 7/10 Rule learning - Blockeel - 14:00 - 15:00 Experimental methodology - Davis

4 14/10 Learning theory - Blockeel


5 21/10 Q&A session Statistical learning, introduction to optimization - Davis
6 28/10 Support vector machines - Blockeel
7 4/11 Artificial neural networks - Blockeel
8 11/11 Artificial neural networks - Blockeel

9 18/11 Q&A session


10 25/11
11 2/12 Reinforcement learning - De Raedt Reinforcement learning - De Raedt
12 9/12

Dutch (recording in English available)


English (recording with Dutch subtitles available)
Bilingual

6
✔︎
✔︎
✔︎
✔︎
✔︎
✔︎
✔︎
✔︎
The Exam
• Written exam, closed book. Testing both knowledge and insight, both theory and practice.

• Q: “Up to what level of detail should I study this? Do I need to know everything that’s in the
reader?”
• A: You should
• know & understand everything mentioned in the lectures
• be able to solve problems of similar difficulty as those covered in the exercise sessions
• be able to answer questions that require you to reason about the concepts you’ve
seen
• be able to extrapolate: apply a concept in a different context
Reading materials are meant to help you digest the content of the lectures, not provide
additional content (unless mentioned otherwise)

7
You should be able to…
• Execute (parts of) algorithms we’ve seen on concrete data. E.g.: given some data,
• show the kernel matrix that an SVM learner computes
• perform a backpropagation step in a given neural network
• Reason about the behavior of an algorithm on a more abstract level (based on its
properties rather than on mimicking it)
• E.g.: an SVM has k support vectors out of n instances, and 0 training error: give a
lower bounds on the accuracy estimate obtained using leave-one-out cross-validation
• What would happen if we change … in the algorithm we’ve seen?
• The answer is not given in the course materials, but you can infer it by reasoning
about the concepts that we have seen
• Explain the architecture of a particular type of model, and explain its advantages and
disadvantages
• …

8
This is a challenging course
• In some exam periods, <50% of students pass this exam. Mean score
often <10. (Max score still often 19 or 20)
• Perceived causes for low grades:
• Not reading the question carefully
• Tendency to reproduce rather than reason
• Tendency to study questions & answers, rather than the course
itself
• Superficial rather than deep understanding of course topics
• Standard investment in a 6-ECTS course: 180 hours (lectures,
exercises, self-study). Aim for 12 hours per week during the semester.

9
What is Machine
Learning?
Discussion time…
Learn from data
Find pattern in data and generalize

How would you define machine learning?

recognize cancer, amazon with


In what contexts is it used? advertisment

How does it relate to the rest of AI?

learning from examples is machine learning


AI is broader

11
The RoboSail project
• Pieter Adriaans (Univ. of Amsterdam), around 2003: first auto-
pilot for sailing boats (www.robosail.com)
• No suitable mathematical theory => had to learn how to sail
• Boat full of AI technology (agents, sensors, ...), including
machine learning components

12
Language learning
• Children learn language by simply hearing sentences being used
in a certain context
• Can a computer do the same: given examples (sentence +
description of context), learn the meaning of words/sentences?

“Mike is kicking the ball” Mental


Model
Learn Test

Data & pictures by Zitnick et al., 2013

13
Autonomous cars
• Very first race with fully autonomous cars was in… 2004
Censored
• DARPA grand challenge : have autonomous cars race each other
on desert roads
• In 2004, no winner - “best” car got about 12 km far
• In 2005, five cars made it to the finish (212 km)

14
The Robot Scientist
• King et al., Nature, 2004
• Scientific research, e.g., in drug discovery, is iterative:
• Determine what experiment to perform
• Perform the experiment Feedback

• Interpret the results


• Robot Scientist removes the human from the loop, by reasoning about its
own learning process: which new experiments will be most informative?

2nd version, “Eve” (2015)


discovered lead against
malaria on first run

15
Automating manual tasks
• E.g.: nurse rostering in a hospital: need to accommodate non-obvious
constraints (e.g., leave enough time between shifts)
• Hard to automate, unless constraints can be learned from earlier examples

Illustration from L. De Raedt’s


SYNTH project,
picture by G. De Smet

16
Other applications…
• Recommender systems: e.g., Google, Facebook, Amazon, … try
to show you ads you might like
• Email spam filters: by observing which mails you flag as “spam”,
try to learn your preferences
• Natural language processing: e.g., Sentiment analysis: is this
movie review mostly positive or negative?
• Vision: learn to recognize pedestrians, …
• … and many, many more

• P. Domingos’ bestseller The Master Algorithm provides an excellent


account of how machine learning affects our daily life

17
Definitions of machine learning?
• Tom Mitchell, 1996: Machine learning is the study of how to make programs
improve their performance on certain tasks from (own) experience
• “performance” = speed, accuracy, …
• “experience” = earlier observations
• “Improve performance” in the most general meaning: this includes learning
from scratch.
• Useful (in principle) for anything that we don’t know how to program -
computer “programs itself”
• Vision: recognizing faces, traffic signs, …
• Game playing, e.g., AlphaGo
• Link to artificial intelligence : computer solves hard problems autonomously

18
Machine learning vs. other AI
• In machine learning, the key is data
• Examples of questions & their answer
• Observations of earlier attempts to solve some problem

• Machine learning makes use of inductive inference: reasoning from


specific to general
• In statistics: sample → population
• In philosophy of science: concrete observations → general theory
• In machine learning: observations → any situation

• This aspect of machine learning links it to data mining, data analysis,


statistics, …

19
Machine learning and AI
• Many misconceptions about machine learning these days
• Since the mid-2000s, “Deep Learning” has received a lot of
attention: it revolutionized computer vision, speech recognition,
natural language processing
• Avalanche of new researchers drawn to the field, without knowledge
of the broader field of AI, or history of ML (“AI = ML = deep learning”)
• See, e.g., A. Darwiche, https://www.youtube.com/watch?v=UTzCwCic-Do
(also published in Communications of the ACM, October 2018)

AI Logic, Machine Deep


expert systems Learning Learning
Timeline
1970 1980 1990 2000 2010
“did not work”, “precise”, “formal”, “works best!”,
“not precise” “worked much better” “forget all the rest”

20
Machine learning and AI
• There is still progress on all fronts, deep learning is just one of them
• This course reflects that viewpoint
• (schema below is incomplete, just serves to illustrate complexity of
scientific impact)

AI Constraint solving SAT solvers, ASP, …


l l for Agents AI
e s
rk s w blem
Wo e pro Machine Learning Inductive log. prog.
som
Logic
Too r Statistical
igi o r ks Lifted Learning
(noise d for other Probabilistic W Relational & Inference
, unce
rtaint
probl
ems Logics Doe Learning ML
y) sn ’t
“Subsymbolic” Deep learning
Neural networks DL
methods

21
Tasks Techniques Models Applications

The machine learning landscape


Automata Support vector
Statistical
machines
relational
Neural networks Natural
learning
Recommender language
systems Regression processing
Deep learning
Clustering Nearest neighbors
Decision trees
Convex
Rule learners
optimization Classification
Matrix
factorization Greedy
Transfer learning
search Bayesian
Vision
Probabilistic learning
graphical models Reinforcement learning
Learning theory
Speech

22
Basic concepts and
terminology
Machine learning
• ML in its typical form:
• Input = dataset
• Output = some kind of model
• ML in its most general form:
• input = knowledge
• output = a more general form of knowledge
• Learning = inferring general knowledge from more specific knowledge (observations
➔ model) = inductive inference

• Learning methods are often categorized according to the format of the input and
output, and according to the goal of the learning process (but there are many more
dimensions along which they can be categorized)

24
A typical task
• Given examples of pictures + label (saying what’s on the picture),
infer a procedure that will allow you to correctly label new pictures
• E.g.: learn to classify fish as “salmon” or “sea bass”

25
A generic approach
• Find informative features (here: lightness, width)
• Find a line/curve/hyperplane/… in this feature space that separates
the classes

26
27
28
29
Predictive versus descriptive
THE MODEL YOU LEARN REPRESENTS A MATHEMATICAL FUNCTION
• Predictive learning : learn a model that can predict a particular property / attribute / variable
from inputs
• Many tasks are special cases of predictive learning
• E.g., face recognition: given a picture of a face, say who it is
• E.g., spam filtering: given an email, say whether it’s spam or not
Name of task Learns a model that can …
Concept learning / Distinguish instances of class C from other instances
Binary classification
Classification Assign a class C (from a given set of classes) to an instance
Regression Assign a numerical value to an instance
Multi-label classification Assign a set of labels (from a given set) to an instance
Multivariate regression Assign a vector of numbers to an instance
Multi-target prediction Assign a vector of values (numerical, categorical) to an instance
Ranking Assign ≤ or > to a pair of instances

30
PREDICTIVE LEARNING: YOU LEARN FUNCTIONS

Predictive versus descriptive


• Descriptive learning : given a data set, describe certain
patterns in the dataset (or in the population it is drawn from)
• E.g., analyzing large databases:
• “Bank X always refuses loans to people who earn less than
1200 euros per month”
• “99.7% of all pregnant patients in this hospital are female”
• “At supermarket X, people who buy cheese are twice as likely
to also buy wine”

31
Function learning
• Task : learn a function X →Y that fits the given data (with X and Y sets
of variables that occur in the data)
• Such a function will obviously be useful for predicting Y from X
• May also be descriptive, if we can understand the function
• Often, some family of functions F is given, and we need to estimate
the parameters of the function f in F that best fits the data
• e.g., linear regression : determine a and b such that y = ax + b fits
the data as well as possible
• What does “fit the data” mean? Measured by a so-called loss function
2

• e.g., quadratic loss: ( f (x) − y) with f the learned function and
(x,y)∈D
D the dataset

32
Distribution learning
• Task: given a data set drawn from a distribution, estimate this distribution
• Often made distinction: parametric vs. non-parametric
• Parametric: a family of distributions is given (e.g., “Gaussian”), we only need to
estimate the parameters of the target distribution
• Non-parametric: no specific family is assumed
• Often made distinction: generative vs. discriminative CAPISCI MEGLIO
• Generative: learn the joint probability distribution (JPD) over all variables (once
you have that, you can generate new instances by random sampling from it)
• Discriminative: learn a conditional probability distribution of Y given X, for some
given set of variables X (called input variables) and Y (called target variables)

Parametric Non-parametric LE COSE CHE GENERANO IMMAGINI


SONO GENERATIVE

. ...... . . . ...... . .
33
These categorizations are somewhat
fuzzy…
• A descriptive pattern may be useful for prediction
• “Bank X always refuses loans to people who earn less than
1200 euros per month” (description)
• Bob earns 1100 euros per month => Bank X will not give him
a loan
• While functions are directly useful for prediction, a probability
distribution can be used just as well
• Given known information X, predict as value for Y, the value
with the highest conditional probability given X

34
Parametric vs. non-parametric
• Parametric: a family of functions (or distributions, or …) is given,
and each function is uniquely defined by the values of a fixed set
of parameters
• e.g. (function learning): linear regression
• e.g. (distribution learning): fitting a gaussian
• Non-parametric: no specific family of functions is assumed
• Typically, we are searching a space that contains models with
varying structure, rather than just different parameter values
• This often requires searching a discrete space
• E.g.: decision trees, rules, …. (see later)

35
Link with “explainable AI”
• Explainable AI (XAI) refers to the study of AI systems that can explain their
decisions / whose decisions we can understand
• Two different levels here:
• We understand the (learned) model used for decision making
• We understand the individual decision
• E.g. “I could not get a loan because I earn too little”: we can understand this
decision even if we don’t know the whole decision process the bank uses
• A learned model that is not straightforward to interpret, is called a black-
box model
• Machine learning poses additional challenges for XAI, as it often learns
black-box models

36
Responsible AI : challenges
• Privacy-preserving data analysis
• We need lots of data to learn from; this may include personal data
• How can we guarantee that the analysis of these data will not violate
the privacy of the people whose data this is?
• Generally, when data is collected, consent is needed for a specific
purpose, and data must be used solely for that purpose — how can we
guarantee it won’t be abused?

• Learning “safe” models : models that will not violate certain constraints
that are imposed (including constraints on bias, discrimination, privacy, …)

37
Predictive learning
• A very large part of machine learning focuses on predictive
learning
• In the following, we zoom in on that part

38
Prediction: task definition
The prediction task, in general:
o Given: a description of some instance
o Predict: some property of interest (the “target”)

Examples:
o classify emails as spam / non-spam
o classify fish as salmon / bass
o forecast tomorrow’s weather based on today’s measurements

How? By analogy to cases seen before

39
Training & prediction sets
• Training set: a set of examples, instance descriptions that
include the target property (a.k.a. labeled instances)
• Prediction set: a set of instance descriptions that do not include
the target property (“unlabeled” instances)

• Prediction task : predict the labels of the unlabeled instances

Dog Dog Dog

??? ???

Cat Cat Cat

40
Inductive vs. transductive learning
We can consider as outcome of the learning process, either
• the predictions themselves: transductive learning
• or: a function that can predict the label of any unlabeled
instance: inductive learning

.(x1,y1) .(x1,y1)
.(x4, ) .(x4, )
.(x2,y2) .(x2,y2)
.(x5, ) .(x5, )
.(x3,y3) .(x3,y3)

Transduction: outcome=predictions Induction: outcome = function


for making predictions

41
Inductive vs. transductive learning
We can consider as outcome of the learning process, either
• the predictions themselves: transductive learning
• or: a function that can predict the label of any unlabeled
instance: inductive learning

f: X→Y
.(x1,y1) .(x1,y1)
.(x4,y4) .(x4, )
.(x2,y2) .(x2,y2)
.(x5,y5) .(x5, )
.(x3,y3) .(x3,y3)

Transduction: outcome=predictions Induction: outcome = function


for making predictions

42
Inductive vs. transductive learning
We can consider as outcome of the learning process, either
• the predictions themselves: transductive learning
• or: a function that can predict the label of any unlabeled
instance: inductive learning

f: X→Y
.(x1,y1) .(x1,y1)
.(x4,y4) .(x4,f(x4))
.(x2,y2) .(x2,y2)
.(x5,y5) .(x5,f(x5))
.(x3,y3) .(x3,y3)

Transduction: outcome=predictions Induction: outcome = function


for making predictions

43
Interpretable vs. black-box
The predictive function or model learned from the data may be
represented in a format that we can easily interpret, or not

Non-interpretable models are also called black-box models

In some cases, it is crucial that predictions can be explained (e.g.: bank


deciding whether to give you a loan)

Note difference between explaining a model and explaining a prediction

44
Overfitting and underfitting
• “Occam’s razor”: among equally accurate models, choose
the simpler one
• Trade-off: explain data vs. simplicity
• Both overfitting and underfitting are harmful

45
Levels of supervision
• Supervised learning: learning a (predictive) model from labeled
instances (as in cats & dogs example)

• Unsupervised learning: learning a model from unlabeled instances


• such models are usually not directly predictive (without any
information on what to predict, how could you learn from that?)
• still useful indirectly, or for non-predictive tasks: see later

• Semi-supervised learning: learn a predictive model from a few


labeled and many unlabeled examples

46
Semi-supervised learning
• How can unlabeled examples help learn a better model?

+ -

+
-
This illustration:
- 2 classes, called + and - ?
- Representing instances
in a 2-dimensional space

47
Semi-supervised learning
• How can unlabeled examples help learn a better model?

. . . . .. .
. . . . + . -
. . . . . . .
. . . .
. .
. + . .. .
.
. . . .
. . - . .
. ?. .
. . . . . .
. . . .
48
Semi-supervised learning
• How can unlabeled examples help learn a better model?

. . . . .. .
. . . . + . -
. . . . . . .
. . . .
. .
. + . .. .
.
. . . . .
. .. - . .
. . ?. .
. . . .
. . .
49
Unsupervised learning
• Can you see three classes here?
• Even though we don’t know the names of the classes, we still see some structure
(clusters) that we could use to predict which class a new instance belongs to
• Identifying this structure is called clustering
• From a predictive point of view, this is unsupervised learning

. . . .. .
. . . . .
. . . .
.
.. . .
. .. . .
. .
50
PU-learning
• PU-learning is a special case of semi-supervised learning
• PU stands for “positive / unlabeled”
• All the labeled examples belong to one class (called the
“positive” class)

. . “Mike is kicking the ball”


.+ . Learning the meaning of

. .. . .
“kicking the ball” requires
PU-learning because:
+ When Mike kicks the ball,

.. .. the sentence may mention


this, or not. When Mike does

. . . .
not kick the ball, it is never
mentioned that he does not.

.
51
Weakly supervised learning
• Weakly supervised learning is a generalized form of semi-supervised learning
• Semi-supervised: for a single instance, we either know its label or we do not
• Weakly supervised: we may have partial information about a label
• e.g., it is certainly a member of a given set (= superset learning)
• e.g., at least one instance among a given set of instances has the label,
but we do not know which one (= multi-instance learning)
• e.g., we know two instances have the same label, but we don’t know
which one it is (= constraint-based clustering)
• …

“There’s a Lamborghini in this picture”

“This is either a Ferrari


or a Lamborghini”

52
Relationship between different
supervision settings

Predictive learning

Supervised Unsupervised
Learning Learning
Weakly supervised

Semi-supervised Multi-instance Superset Constraint-based


learning learning learning clustering

PU-learning

53
Format of input data
Format of input data
• Input is often assumed to be a set of instances that are all described using the
same variables (features, attributes)
• The data are “i.i.d.”: “independent and identically distributed”
• The training set can be seen as a random sample from one distribution
• The training set can be shown as a table (instances x variables) : tabular data
• This is also called the standard setting

• There are other formats: instances can be


• nodes in a graph
• whole graphs
• elements of a sequence
• …

55
Format of input data: tabular
Sepal Sepal Petal Petal Class
length width length width
5.1 3.5 1.4 0.2 Setosa
Training
4.9 3.0 1.4 0.2 Setosa
set
7.0 3.2 4.7 1.4 Versicolor
6.3 3.3 6.0 2.5 Virginica

Sepal Sepal Petal Petal Class


length width length width
Prediction
4.8 3.2 1.3 0.3 ?
set
7.1 3.3 5.2 1.7 ?

56
Format of input data: sequences
• Learning from sequences:
• 1 prediction per sequence?
• 1 prediction per element?
• 1 element in sequence can be …
• A number (e.g., time series)
• A symbol (e.g., strings)
abababab: +
• A tuple aabbaabb: -
• A more complex structure

57
Format of input data: trees
• 1 prediction per tree / per node in the tree
• Nodes can be …
• Unlabeled
• Labeled with symbols (e.g., HTML/XML structures)
• …
ul-

- -
<li>- <li> <li>
E.g.: this tree indicates as “positive” a text field
preceded by Address: inside a list (<li>) context <b>- (text)+
“Adress:” -
58
Format of input data: graph
• Example: Social network
• Target value known for some
nodes, not for others
• Predict node label
• Predict edge
• Predict edge label
• …
• Use network structure for
these predictions

59
Format of input data: raw data
• “Raw” data are in a format that seems simple (e.g., a vector of
numbers), but components ≠ meaningful features
• Example: photo (vector of pixels)
• Raw data often need to be processed in a non-trivial way to obtain
meaningful features; on the basis of these features, a function can be
learned
• This is what deep learning excels at

(Image: Nielsen, 2017, Neural networks and deep learning)

60
Format of input data: knowledge
• “Knowledge” can consist of facts, rules, definitions, ….
• We can represent knowledge about some domain in a
knowledge representation language (such languages are often
based on logic)

atm(m1,a1,o,2,3.43,-3.11,0.04). ...
atm(m1,a2,c,2,6.03,-1.77,0.67). hacc(M,A):- atm(M,A,o,2,_,_,_).
... hacc(M,A):- atm(M,A,o,3,_,_,_).
bond(m1,a2,a3,2). hacc(M,A):- atm(M,A,s,2,_,_,_).
bond(m1,a5,a6,1). hacc(M,A):- atm(M,A,n,ar,_,_,_).
bond(m1,a6,a7,du). zincsite(M,A):-
... atm(M,A,du,_,_,_,_).
hdonor(M,A) :-
atm(M,A,h,_,_,_,_),
not(carbon_bond(M,A)), !.
...

61
Data preprocessing
• Data may not be in a format that your learner can handle
• Data wrangling: bring it into the right format
• Even if it’s in a format you learner can handle (e.g., tabular), the
features it contains may not be very informative, or there may be
very few relevant features among many irrelevant ones.
• E.g.: individual pixels in an image are usually not very informative
• Feature selection: select among many input features the most
informative ones
• Feature construction: construct new features, derived from the
given ones

62
What learning method to use?
• Which learners are suitable for your problem, depends strongly
(but not solely!) on the structure of the input data
• Most learners use the standard format
• A set of instances, where each instance is described by a
fixed set of attributes (a.k.a. features, variables)
• also called attribute-value format or tabular format
• At the other extreme, inductive logic programming handles any
kind of knowledge that can be represented using clausal logic
• This includes sequences, graphs, …

63
What learning method to use?
• The data format and the learning task impose strong constraints
on which learning methods can be used
• Other aspects determine whether the method performs well:
• Inductive bias (see later)
• Ability to handle missing values
• Ability to handle noise
• Ability to handle high-dimensional data
• Ability to handle large datasets
• Ability to generalize from small datasets (avoid overfitting)
• We’ll cover many of these aspects at different points in the course

64
Missing data Sepal Sepal Petal Petal Class
length width length width
5.1 ? 1.4 0.2 Setosa
Some training examples may have 4.9 3.0 1.4 0.2 Setosa
missing values… how to handle these?
7.0 3.2 ? 1.4 Versicolor
Some options:
6.3 3.3 6.0 2.5 Virginica
1. Leave out from training set
- Information loss… Missingness itself can be relevant! May
2. Guess the missing value correlate with class (e.g., exit polls), …
- What if guess is wrong? Statisticians distinguish MCAR, MAR, NMAR:
3. Treat ‘?’ as a separate value Missing (Completely) At Random, or Not
… program it to consider it just if you have it

Handling missing data can be tricky…


Some learning methods “can handle them” (no user
intervention needed) - but not always optimally

65
Output formats,
methods (overview)
Output formats
• The output of a learning system is a model
• Many different types of model exist
• The learning algorithm or method is strongly linked to the type of model
• High-level overviews of machine learning methods often categorize
them along this axis

67
Different views of the landscape
Domingos: Flach: Bishop:
“five tribes” “three types of models” “the world is Bayesian”

- Symbolists - Probabilistic - Bayes


- Connectionists - Geometric - Bayes
- Evolutionaries - Logical - Bayes
- Bayesians
- Analogizers

68
Parametrized functions
• Typically, a certain format for the functions is provided; e.g.:
linear functions of the inputs
• Within this set, we look for the parameter values that best fit the
data
• Standard example: linear regression
2.5

● ●●
● ●
●●●● ● ● ● ●

y = ax + b
● ● ●
●●●● ● ●
2.0

●●●● ● ●
●● ● ●
●●
● ● ● ●● ● ● ●
● ●
● ● ● ●
1.5

● ●●● ●●●
Petal.Width

● ● ●●● ●


● ● ● ●●
●● ● ●
●●
●●●●●

width = 0.416*length - 0.363
1.0

● ● ● ●●


0.5


● ●●● ●
●●● ●
● ●●
● ● ●●
●●
●●●● ●
● ●●

1 2 3 4 5 6 7

Petal.Length

69
Conjunctive concepts
• A conjunctive concept is expressed as a set of conditions, all of
which must be true
• “x has class C if and only if <condition1> and <condition2> and
… and <condition k>”

• E.g.: accept application for mortgage if and only if :


salary ≥ 3 * monthly payback and no other mortgage running

70
Rule sets
• A rule set is a set of rules of the form “if … then …” or “if … then
… else …”
• Example: definition of leap years

Examples of leap years: Input


1900, 1992, 2004, …
Examples of non-leap years:
1993, 2000, 2011, 2018, …
Output

If year is a multiple of 400 then leap


else if year is a multiple of 100 then not leap
else if year is multiple of 4 then leap
else not leap

71
Decision trees
• A decision tree represents a stepwise procedure to arrive at
some decision

Is today a good day to play tennis?

72
Neural networks
• A neural network is a complex structure of neurons, each of
which aggregate multiple input signals into a single output signal

h11 = f(a11x+b11y+c11z)

x
y Out

73
Probabilistic graphical models
• A PGM represents a (high-dimensional) joint distribution over
multiple variables as a product of (low-dimensional) factors
• Different type of PGMs: Bayesian networks, Markov networks,
factor graphs, …

f1(a) A B f2(a)

Example: Bayesian network C f3(a, b, c)

f4(c, d) D E f5(c, e)

f(a, b, c, d, e) = f1(a) ⋅ f2(b) ⋅ f3(a, b, c) ⋅ f4(c, d) ⋅ f5(c, e)


P(A, B, C, D, E) = P(A) ⋅ P(B) ⋅ P(C | A, B) ⋅ P(D | C) ⋅ P(E | C)
74
Instance-based learning
(a.k.a. “nearest neighbor methods”)
• The “model” is simply the data set itself
• Predictions for new cases are made by comparing it to earlier observed cases
• If it’s similar for observed features, it’s probably also similar for unobserved (to
be predicted) features

. . .
. . . .
. . ? . .
.. . .. . .
. .
75
Search methods
• How do we find the most suitable model?
• Sometimes, there is a closed form solution (e.g., linear regression)
• If not, we typically need to search some hypothesis space
• Two very different types of spaces, each with their own search
methods :
• Discrete spaces (methods: hill-climbing, best-first, …)
• Continuous spaces (methods: gradient descent, …)
• Typically:
• Model structure not fixed in advanced => discrete
• Fixed model structure, tune numerical parameters => continuous

76
Example: gradient descent in a
continuous space
B

.
(-1,10) Color encodes
loss
. .
Y . . Gradient

. . . descent
x
. .
(1,3)

.
. (2,0)

X -1 1 2 A

y=2x y=-x+10 y=x+3

(a,b) represents y = ax + b

Input/output space Parameter space


77
Example: Version Spaces
• We try to identify a conjunctive concept from data
• More specific: given a hypothesis space H, return all concepts in
H that are consistent with the data. This set is called the Version
Space.
• The algorithm called Candidate Elimination does this by
exploiting a generality ordering over H, and returning only the
most general and most specific hypotheses in H that are
consistent with the data (the “borders” of H)
• This involves an exhaustive search in a discrete space

78
Candidate Elimination: illustration
• A company produces intelligent robots. Some robots
misbehave. We suspect that one particular combination of
features is the cause for this misbehavior.
• For ease of discussion, we here assume robots have four
relevant characteristics:
• Color : B R M
• Body shape : S T
• =Legs/wheels: L W
• #“eyes” : 1 2
• Find the combination that misbehaves

79
Candidate Elimination: illustration
• We will represent a hypothesis as a tuple <color, body, legs,
eyes> where color = B, R, M or ? (? means “any color”) etc.
• Hypothesis space: {B,R,M,?} x {S,T,?} x {L,W,?} x {1,2,?}
• Let S(h) be the set of robots characterized by a hypothesis h
• Hypothesis h1 is more general than h2 if and only if S(h2)⊆S(h1)

• Most general hypothesis is … <?,?,?,?> censored

• Most specific hypothesis is … there are many! <B,T,L,1> is one


censored

• Can extend hypothesis space with ⊥: S(⊥)=∅

80
Search space is a lattice
<?,?,?,?>

B??? R??? M??? ?S?? ?T?? ??L? ??W? ???1 ???2

BS?? BT?? RS?? RT?? MS?? MT?? B?L? B?W? R?L? R?W? M?L? M?W? … ??L1 ??L2 ??W1 ??W2

BSL? BSW? BTL? BTW? ......... ?TL2 ?TW1 ?TW2

BSL1 BSL2 BSW1 BSW2 BTL1 BTL2 BTW1 BTW2 ......... MTL2 MTW1 MTW2


81
Observation 1:

Candidate Elimination
Misbehaves

!
<?,?,?,?>

B??? R??? M??? ?S?? ?T?? ??L? ??W? ???1 ???2

BS?? BT?? RS?? RT?? MS?? MT?? B?L? B?W? R?L? R?W? M?L? M?W? … ??L1 ??L2 ??W1 ??W2

BSL? BSW? BTL? BTW? ..... B?W2 ... BS?2 ... ?SW2 ... ?TL2 ?TW1 ?TW2

BSL1 BSL2 BSW1 BSW2 BTL1 BTL2 BTW1 BTW2 ......... MTL2 MTW1 MTW2


82
Observation 2:

Candidate Elimination
Does not misbehave

<?,?,?,?>

B??? R??? M??? ?S?? ?T?? ??L? ??W? ???1 ???2

BS?? BT?? RS?? RT?? MS?? MT?? B?L? B?W? R?L? R?W? M?L? M?W? … ??L1 ??L2 ??W1 ??W2

BSL? BSW? BTL? BTW? ..... B?W2 ... BS?2 ... ?SW2 ... ?TL2 ?TW1 ?TW2

BSL1 BSL2 BSW1 BSW2 BTL1 BTL2 BTW1 BTW2 ......... MTL2 MTW1 MTW2


83
Candidate Elimination

+ +

84
Candidate Elimination

The most/least general solutions


define the whole version space
-

+ +

85
Candidate Elimination
• The candidate elimination algorithm illustrates
• Search in a discrete hypothesis space (with lattice structure)
• Search for all solutions, rather than just one, in an efficient manner
• Importance of generality ordering

• Some obvious disadvantages:


• Not robust to noise: result = set of hypotheses consistent with all
data; 1 erroneous data point → set may be empty!
• Only conjunctive concepts : strong limitation

• We’ll see many other learning approaches, all with their own pros & cons

86

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy