0% found this document useful (0 votes)

17 views20 pages

Classification Methods I

The document discusses the k-nearest neighbors (k-NN) classification algorithm. It explains how k-NN works by finding the k closest training examples in feature space and assigning a label based on the majority vote of those neighbors. The document discusses choosing k to balance bias and variance, preparing data for k-NN including normalizing features, and the strengths and weaknesses of the approach.

Uploaded by

Juanjo herrera aranda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views20 pages

Classification Methods I

Uploaded by

Juanjo herrera aranda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !

CLASSIFICATION
Introducción a la Ciencia de Datos

Some of the figures in this presentation are taken from: An Introduction to Statistical Learning, with applications in
R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

Some slides are based on Abbass Al Sharif‘s slides for his course DSO 530: Applied Modern Statistical Learning
Techniques.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !2

Overview of Classification
• Examples:
① An online banking service must be able to determine
whether or not a transaction being performed on the site
is fraudulent, on the basis of the user’s IP address, past
transaction history, and so on.
② A person arrives at the emergency room with a set of
symptoms that could possibly be attributed to one of
three medical conditions. Which of the three conditions
does the individual have?
③ On the basis of DNA sequence data for a number of
patients with and without a given disease, a biologist
would like to figure out which DNA mutations are
deleterious (disease-causing) and which are not.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !3

Overview of Classification
• We will always assume that we have observed a set of n
different data points. These observations are called the
training data because we will use these observations to
train, or teach, our method how to estimate f.

• Let xij represent the value of the jth predictor, or input, for
observation i, where i = 1, 2, …, n and j = 1, 2, …, p.

• Correspondingly, let yi represent the response variable

for the ith observation. Then our training data consist of
{(x1, y1),(x2, y2), …,(xn, yn)} where xi = (xi1, xi2, …, xip)T.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !4

Overview of Classification
• Our goal is to apply a classification method to the
training data in order to estimate the unknown
function f.
• In other words, we want to find a function f such
that Y ≈ f(X) for any observation (X,Y).
• In general, we do not really care how well the
classification method works training on the training
data. Rather, we are interested in the accuracy of
the predictions that we obtain when we apply our
method to previously unseen test data.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !5

K-NN
Classification Methods
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !6

Understanding classification using NN

• Nearest Neighbor (NN) Classifiers are defined by their
characteristic of classifying unlabeled examples by
assigning them the class of the most similar labeled
examples.
• NN classifiers are well-suited for classification tasks where
relationships among the features and the target classes
are difficult to understand, yet the items of similar class
type tend to be fairly homogeneous.
• If there is not a clear distinction among the groups, the
algorithm is by and large not well-suited for identifying the
boundary.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !7

The k-NN algorithm

• The k-NN algorithm begins with a training dataset
containing examples that are classified into several
categories, as labeled by a nominal variable.
• Assume that we have a test dataset containing unlabeled
examples that otherwise have the same features as the
training data.
• For each record in the test dataset, k-NN identifies k
records in the training data that are the "nearest" in
distance/similarity, where k is an integer specified in
advance.
• The unlabeled test instance is assigned the class of the
majority of the k nearest neighbors.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !8

The k-NN algorithm

• Dataset: Blind tasting experience:
• Only two features of each ingredient is recorded: (1) a measure
from 1 to 10 of how crunchy the ingredient is, and (2) a measure
from 1 to 10 score of how sweet the ingredient tastes.
• We then labeled each ingredient as one of three types of food:
fruits, vegetables, or proteins.

ingredient sweetness crunchiness food type

apple 10 9 fruit
bacon 1 4 protein
banana 10 1 fruit
carrot 7 10 vegetable
celery 3 10 vegetable
cheese 1 1 protein
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !9

The k-NN algorithm

• The k-NN algorithm treats the features as coordinates in a
multidimensional feature space:
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !10

The k-NN algorithm

• Similar types of food tend to be grouped closely together:
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !11

The k-NN algorithm

• Use k-NN to settle the age-old question: is a tomato a fruit
or a vegetable?
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !12

The k-NN algorithm

• Locating the tomato's nearest neighbors requires a
distance function, or a formula that measures the similarity
between two instances.
• Traditionally, the k-NN algorithm uses Euclidean distance:

dist(X1, X2) = (x11 − x21)2 + (x12 − x22)2 + … + (x1n − x2n)2

• where X
! 1 and X
! 2 are the examples to be compared, each
having n features. The term x! 11 refers to the value of the
first feature of example X
" 1, while x
" 21 refers to the value of
the first feature of example X
" 2.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !13

The k-NN algorithm

• For example, to calculate the distance between the tomato
(sweetness = 6, crunchiness = 4), and the green bean
(sweetness = 3, crunchiness = 7), we can use the formula
as follows:

dist(tomato, green bean) = (6 − 3)2 + (4 − 7)2 = 4.2

distance
ingredient sweetness crunchiness food type to the
tomato
grape 8 5 fruit 2.2
green bean 3 7 vegetable 4.2
nuts 3 6 protein 3.6
orange 7 3 fruit 1.4
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !14

Distance measures

Abu Alfeilat HA, Hassanat ABA, Lasassmeh O, Tarawneh AS, Alhasanat MB, Eyal Salman HS, Prasath VBS. Effects of Distance Measure Choice
on K-Nearest Neighbor Classifier Performance: A Review. Big Data. 2019 Dec;7(4):221-248. doi: 10.1089/big.2018.0175. Epub 2019 Aug 14.
PMID: 31411491.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !15

Choosing an appropriate k
• The balance between underfitting and overfitting the
training data is a problem known as the bias-variance
tradeoff.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !16

Preparing data for use with k-NN

• Features are typically transformed to a standard range
(normalized) prior to applying the k-NN algorithm.
• The rationale for this step is that the distance formula is
dependent on how features are measured.
• In particular, if certain features have much larger values
than others, the distance measurements will be strongly
dominated by the larger values.
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !17

Preparing data for use with k-NN

• Qualitative features
• Nominal:
• Dummy coding
• Ordinal:
• Transform to numeric keeping ordering...
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !18

The k-NN algorithm

STRENGTHS WEAKNESSES

• Simple and effective • Does not produce a

• Makes no assumptions model, which limits the
about the underlying data ability to find novel insights
in relationships among
distribution features
• Fast training phase • Slow classification phase
• Requires a large amount
of memory
• Qualitative features and
missing data require
additional processing
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !19

K-NN
R session
Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !20

Exercise 1
• Create a function my_knn that accepts any measure from
the philentropy package and performs basic knn.
• A possible function interface could be:

my_knn <- function(train, train_labels, test, k=1, metric=“euclidean”)

• The function will output the predictions over the test set.

• Select two distance/similarity measures and apply the

my_knn function to each of them with different k choices
for the breast cancer data and do a comparison of the
results (try using a plot).

Unit 5 - DA - Classification & Clustering
No ratings yet
Unit 5 - DA - Classification & Clustering
105 pages
Unit 4_KVR
No ratings yet
Unit 4_KVR
111 pages
Lazy LearningClassification Using Nearest Neighbors
No ratings yet
Lazy LearningClassification Using Nearest Neighbors
36 pages
Machine Learning and Web Scraping Lecture 03
No ratings yet
Machine Learning and Web Scraping Lecture 03
22 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning - Javatpoint
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning - Javatpoint
18 pages
05_kNN
No ratings yet
05_kNN
49 pages
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
No ratings yet
K-Nearest Neighbor (KNN) Algorithm For Machine Learning
17 pages
KNN Using Python
No ratings yet
KNN Using Python
23 pages
02-knn__slides
No ratings yet
02-knn__slides
57 pages
Lecture5
No ratings yet
Lecture5
21 pages
08 Classification Using K NN
No ratings yet
08 Classification Using K NN
23 pages
K Nearest Neighbors
No ratings yet
K Nearest Neighbors
22 pages
MachineLearning Unit-III Ppt
No ratings yet
MachineLearning Unit-III Ppt
26 pages
Chapter#10 (Part#01) SL (K-NN)
No ratings yet
Chapter#10 (Part#01) SL (K-NN)
27 pages
Ml 7th Sem Aiml Ite Notes Complete Long[1]-63-155
No ratings yet
Ml 7th Sem Aiml Ite Notes Complete Long[1]-63-155
93 pages
CH 04 Classification Techniques
No ratings yet
CH 04 Classification Techniques
89 pages
CMTH642 - Module 10.2- Classification
No ratings yet
CMTH642 - Module 10.2- Classification
10 pages
26. K Nearest Neighbor
No ratings yet
26. K Nearest Neighbor
32 pages
ML Program 4
No ratings yet
ML Program 4
10 pages
Introduction To Machine Learning: K-Nearest Neighbors: Zhongheng Zhang
No ratings yet
Introduction To Machine Learning: K-Nearest Neighbors: Zhongheng Zhang
7 pages
AI28
No ratings yet
AI28
5 pages
s42452-019-1356-9
No ratings yet
s42452-019-1356-9
15 pages
Research and Implementation of Machine
No ratings yet
Research and Implementation of Machine
6 pages
ML 4 (1)
No ratings yet
ML 4 (1)
33 pages
12_23ECE216_Nearest Neighbors
No ratings yet
12_23ECE216_Nearest Neighbors
29 pages
FPA unit 2
No ratings yet
FPA unit 2
20 pages
Knn
No ratings yet
Knn
11 pages
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
No ratings yet
CH5 Data Mining Classification Prepared by Dr. Maher Abuhamdeh
61 pages
ML UNIT 5..
No ratings yet
ML UNIT 5..
40 pages
ML CH 3
No ratings yet
ML CH 3
88 pages
Assignment 3 B
No ratings yet
Assignment 3 B
7 pages
02-knn Notes
No ratings yet
02-knn Notes
23 pages
'Machine Learning (Nagarjun)
No ratings yet
'Machine Learning (Nagarjun)
10 pages
COS4852 2023 Unit 2 - KNN
No ratings yet
COS4852 2023 Unit 2 - KNN
10 pages
Machine Learning Unit-3.1
No ratings yet
Machine Learning Unit-3.1
20 pages
KNN Updated
No ratings yet
KNN Updated
30 pages
ML Unit-2
No ratings yet
ML Unit-2
55 pages
K-NN Algorithm and Clustering Analysis
No ratings yet
K-NN Algorithm and Clustering Analysis
93 pages
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
No ratings yet
Data Mining: Kabith Sivaprasad (BE/1234/2009) Rimjhim (BE/1134/2009) Utkarsh Ahuja (BE/1226/2009)
32 pages
Unit v Non Parametric Machine Learning
No ratings yet
Unit v Non Parametric Machine Learning
47 pages
STAT 479: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 479: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
23 pages
Experiment No 7 ML
No ratings yet
Experiment No 7 ML
4 pages
ML Unit -2
No ratings yet
ML Unit -2
85 pages
19-K-Nearest Neighbor Learning.-22-08-2024
No ratings yet
19-K-Nearest Neighbor Learning.-22-08-2024
25 pages
Whitepaper KX
No ratings yet
Whitepaper KX
230 pages
CSE445 NSU Week_5
No ratings yet
CSE445 NSU Week_5
26 pages
K_Nearest_Neighbour_Classifier
No ratings yet
K_Nearest_Neighbour_Classifier
24 pages
Lecture 14 and 15
No ratings yet
Lecture 14 and 15
42 pages
KNN
No ratings yet
KNN
53 pages
3.1 K Nearest Neighbour Classifier (1)
No ratings yet
3.1 K Nearest Neighbour Classifier (1)
24 pages
KNN
No ratings yet
KNN
32 pages
5. K-Nearest Neighbors
No ratings yet
5. K-Nearest Neighbors
35 pages
K-Nearest Neighbor Classification-Algorithm and Characteristics
No ratings yet
K-Nearest Neighbor Classification-Algorithm and Characteristics
6 pages
12 ML KNN
No ratings yet
12 ML KNN
28 pages
K-Nearest Neighbor
No ratings yet
K-Nearest Neighbor
22 pages
Unit V: Distance and Rule Based Models
No ratings yet
Unit V: Distance and Rule Based Models
56 pages
Decision Tree KNN
No ratings yet
Decision Tree KNN
9 pages
6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
Distance-Based Methods - KNN
No ratings yet
Distance-Based Methods - KNN
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Classification Methods I

Uploaded by

Classification Methods I

Uploaded by

Máster Universitario Oficial en Ciencia de Datos e Ingeniería de Computadores !

• Correspondingly, let yi represent the response variable

Understanding classification using NN

The k-NN algorithm

The k-NN algorithm

ingredient sweetness crunchiness food type

The k-NN algorithm

The k-NN algorithm

The k-NN algorithm

The k-NN algorithm

dist(X1, X2) = (x11 − x21)2 + (x12 − x22)2 + … + (x1n − x2n)2

The k-NN algorithm

dist(tomato, green bean) = (6 − 3)2 + (4 − 7)2 = 4.2

Preparing data for use with k-NN

Preparing data for use with k-NN

The k-NN algorithm

• Simple and effective • Does not produce a

my_knn <- function(train, train_labels, test, k=1, metric=“euclidean”)

• Select two distance/similarity measures and apply the

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.