0% found this document useful (0 votes)
17 views20 pages

Classification Methods I

The document discusses the k-nearest neighbors (k-NN) classification algorithm. It explains how k-NN works by finding the k closest training examples in feature space and assigning a label based on the majority vote of those neighbors. The document discusses choosing k to balance bias and variance, preparing data for k-NN including normalizing features, and the strengths and weaknesses of the approach.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views20 pages

Classification Methods I

The document discusses the k-nearest neighbors (k-NN) classification algorithm. It explains how k-NN works by finding the k closest training examples in feature space and assigning a label based on the majority vote of those neighbors. The document discusses choosing k to balance bias and variance, preparing data for k-NN including normalizing features, and the strengths and weaknesses of the approach.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !

CLASSIFICATION
Introducción a la Ciencia de Datos

Some of the figures in this presentation are taken from: An Introduction to Statistical Learning, with applications in
R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

Some slides are based on Abbass Al Sharif‘s slides for his course DSO 530: Applied Modern Statistical Learning
Techniques.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !2

Overview of Classification
• Examples:
① An online banking service must be able to determine
whether or not a transaction being performed on the site
is fraudulent, on the basis of the user’s IP address, past
transaction history, and so on.
② A person arrives at the emergency room with a set of
symptoms that could possibly be attributed to one of
three medical conditions. Which of the three conditions
does the individual have?
③ On the basis of DNA sequence data for a number of
patients with and without a given disease, a biologist
would like to figure out which DNA mutations are
deleterious (disease-causing) and which are not.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !3

Overview of Classification
• We will always assume that we have observed a set of n
different data points. These observations are called the
training data because we will use these observations to
train, or teach, our method how to estimate f.

• Let xij represent the value of the jth predictor, or input, for
observation i, where i = 1, 2, …, n and j = 1, 2, …, p.

• Correspondingly, let yi represent the response variable


for the ith observation. Then our training data consist of
{(x1, y1),(x2, y2), …,(xn, yn)} where xi = (xi1, xi2, …, xip)T.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !4

Overview of Classification
• Our goal is to apply a classification method to the
training data in order to estimate the unknown
function f.
• In other words, we want to find a function f such
that Y ≈ f(X) for any observation (X,Y).
• In general, we do not really care how well the
classification method works training on the training
data. Rather, we are interested in the accuracy of
the predictions that we obtain when we apply our
method to previously unseen test data.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !5

K-NN
Classification Methods
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !6

Understanding classification using NN


• Nearest Neighbor (NN) Classifiers are defined by their
characteristic of classifying unlabeled examples by
assigning them the class of the most similar labeled
examples.
• NN classifiers are well-suited for classification tasks where
relationships among the features and the target classes
are difficult to understand, yet the items of similar class
type tend to be fairly homogeneous.
• If there is not a clear distinction among the groups, the
algorithm is by and large not well-suited for identifying the
boundary.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !7

The k-NN algorithm


• The k-NN algorithm begins with a training dataset
containing examples that are classified into several
categories, as labeled by a nominal variable.
• Assume that we have a test dataset containing unlabeled
examples that otherwise have the same features as the
training data.
• For each record in the test dataset, k-NN identifies k
records in the training data that are the "nearest" in
distance/similarity, where k is an integer specified in
advance.
• The unlabeled test instance is assigned the class of the
majority of the k nearest neighbors.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !8

The k-NN algorithm


• Dataset: Blind tasting experience:
• Only two features of each ingredient is recorded: (1) a measure
from 1 to 10 of how crunchy the ingredient is, and (2) a measure
from 1 to 10 score of how sweet the ingredient tastes.
• We then labeled each ingredient as one of three types of food:
fruits, vegetables, or proteins.

ingredient sweetness crunchiness food type


apple 10 9 fruit
bacon 1 4 protein
banana 10 1 fruit
carrot 7 10 vegetable
celery 3 10 vegetable
cheese 1 1 protein
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !9

The k-NN algorithm


• The k-NN algorithm treats the features as coordinates in a
multidimensional feature space:
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !10

The k-NN algorithm


• Similar types of food tend to be grouped closely together:
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !11

The k-NN algorithm


• Use k-NN to settle the age-old question: is a tomato a fruit
or a vegetable?
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !12

The k-NN algorithm


• Locating the tomato's nearest neighbors requires a
distance function, or a formula that measures the similarity
between two instances.
• Traditionally, the k-NN algorithm uses Euclidean distance:

dist(X1, X2) = (x11 − x21)2 + (x12 − x22)2 + … + (x1n − x2n)2

• where X
! 1 and X
! 2 are the examples to be compared, each
having n features. The term x! 11 refers to the value of the
first feature of example X
" 1, while x
" 21 refers to the value of
the first feature of example X
" 2.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !13

The k-NN algorithm


• For example, to calculate the distance between the tomato
(sweetness = 6, crunchiness = 4), and the green bean
(sweetness = 3, crunchiness = 7), we can use the formula
as follows:

dist(tomato, green bean) = (6 − 3)2 + (4 − 7)2 = 4.2

distance
ingredient sweetness crunchiness food type to the
tomato
grape 8 5 fruit 2.2
green bean 3 7 vegetable 4.2
nuts 3 6 protein 3.6
orange 7 3 fruit 1.4
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !14

Distance measures

Abu Alfeilat HA, Hassanat ABA, Lasassmeh O, Tarawneh AS, Alhasanat MB, Eyal Salman HS, Prasath VBS. Effects of Distance Measure Choice
on K-Nearest Neighbor Classifier Performance: A Review. Big Data. 2019 Dec;7(4):221-248. doi: 10.1089/big.2018.0175. Epub 2019 Aug 14.
PMID: 31411491.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !15

Choosing an appropriate k
• The balance between underfitting and overfitting the
training data is a problem known as the bias-variance
tradeoff.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !16

Preparing data for use with k-NN


• Features are typically transformed to a standard range
(normalized) prior to applying the k-NN algorithm.
• The rationale for this step is that the distance formula is
dependent on how features are measured.
• In particular, if certain features have much larger values
than others, the distance measurements will be strongly
dominated by the larger values.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !17

Preparing data for use with k-NN


• Qualitative features
• Nominal:
• Dummy coding
• Ordinal:
• Transform to numeric keeping ordering...
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !18

The k-NN algorithm


STRENGTHS WEAKNESSES

• Simple and effective • Does not produce a


• Makes no assumptions model, which limits the
about the underlying data ability to find novel insights
in relationships among
distribution features
• Fast training phase • Slow classification phase
• Requires a large amount
of memory
• Qualitative features and
missing data require
additional processing
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !19

K-NN
R session
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !20

Exercise 1
• Create a function my_knn that accepts any measure from
the philentropy package and performs basic knn.
• A possible function interface could be:

my_knn <- function(train, train_labels, test, k=1, metric=“euclidean”)

• The function will output the predictions over the test set.

• Select two distance/similarity measures and apply the


my_knn function to each of them with different k choices
for the breast cancer data and do a comparison of the
results (try using a plot).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy