Classification Methods I
Classification Methods I
CLASSIFICATION
Introducción a la Ciencia de Datos
Some of the figures in this presentation are taken from: An Introduction to Statistical Learning, with applications in
R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
Some slides are based on Abbass Al Sharif‘s slides for his course DSO 530: Applied Modern Statistical Learning
Techniques.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !2
Overview of Classification
• Examples:
① An online banking service must be able to determine
whether or not a transaction being performed on the site
is fraudulent, on the basis of the user’s IP address, past
transaction history, and so on.
② A person arrives at the emergency room with a set of
symptoms that could possibly be attributed to one of
three medical conditions. Which of the three conditions
does the individual have?
③ On the basis of DNA sequence data for a number of
patients with and without a given disease, a biologist
would like to figure out which DNA mutations are
deleterious (disease-causing) and which are not.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !3
Overview of Classification
• We will always assume that we have observed a set of n
different data points. These observations are called the
training data because we will use these observations to
train, or teach, our method how to estimate f.
• Let xij represent the value of the jth predictor, or input, for
observation i, where i = 1, 2, …, n and j = 1, 2, …, p.
Overview of Classification
• Our goal is to apply a classification method to the
training data in order to estimate the unknown
function f.
• In other words, we want to find a function f such
that Y ≈ f(X) for any observation (X,Y).
• In general, we do not really care how well the
classification method works training on the training
data. Rather, we are interested in the accuracy of
the predictions that we obtain when we apply our
method to previously unseen test data.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !5
K-NN
Classification Methods
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !6
• where X
! 1 and X
! 2 are the examples to be compared, each
having n features. The term x! 11 refers to the value of the
first feature of example X
" 1, while x
" 21 refers to the value of
the first feature of example X
" 2.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !13
distance
ingredient sweetness crunchiness food type to the
tomato
grape 8 5 fruit 2.2
green bean 3 7 vegetable 4.2
nuts 3 6 protein 3.6
orange 7 3 fruit 1.4
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !14
Distance measures
Abu Alfeilat HA, Hassanat ABA, Lasassmeh O, Tarawneh AS, Alhasanat MB, Eyal Salman HS, Prasath VBS. Effects of Distance Measure Choice
on K-Nearest Neighbor Classifier Performance: A Review. Big Data. 2019 Dec;7(4):221-248. doi: 10.1089/big.2018.0175. Epub 2019 Aug 14.
PMID: 31411491.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !15
Choosing an appropriate k
• The balance between underfitting and overfitting the
training data is a problem known as the bias-variance
tradeoff.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !16
K-NN
R session
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !20
Exercise 1
• Create a function my_knn that accepts any measure from
the philentropy package and performs basic knn.
• A possible function interface could be:
• The function will output the predictions over the test set.