0% found this document useful (0 votes)
186 views10 pages

KNN ALGORITHM IN MACHINELEARNING

K-Nearest Neighbors is a lazy learning algorithm that classifies new data points based on the majority class of its k nearest neighbors. It requires storing all training data and calculating distances between new and stored points, making it computationally expensive for large datasets. Preprocessing techniques like dimensionality reduction and attribute weighting can help address the "curse of dimensionality" issue KNN faces with high-dimensional data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views10 pages

KNN ALGORITHM IN MACHINELEARNING

K-Nearest Neighbors is a lazy learning algorithm that classifies new data points based on the majority class of its k nearest neighbors. It requires storing all training data and calculating distances between new and stored points, making it computationally expensive for large datasets. Preprocessing techniques like dimensionality reduction and attribute weighting can help address the "curse of dimensionality" issue KNN faces with high-dimensional data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

K – Nearest Neighbors

Day 4
Introduction
At its core, the algorithm says:

• Pick a number of neighbors you want to use for


classification or regression
• Choose a method to measure distances
• Keep a data set with records
• For every new point, identify the number of nearest
neighbors you picked using the method you chose
• Let them vote if it is a classification issue or take a
mean/median for regression
Diagrammatically
• Let them vote if it is a
classification issue or take a
mean/median for regression
• Here, N = 1. The new green point is labeled black as
its nearest neighbor is black too

• Here, N = 3. The new green point


is labeled white based on the
voting of three nearest neighbors
From the algorithm, clearly KNN is

• Lazy: This is a technical term! All the techniques


we learned so far have a phase called “training
phase” and try to identify a function from the
training set. Then apply this function to the test
data. Such learning is called “eager learning”. K-
NN on the other hand does not generalize and
uses all the training data (or a subset) in the
testing phase. This type of learning is called lazy
learning or instance based learning.
• K-NN requires more time, as all data points
are needed to decide.
• It requires more memory as all training data
needs to be stored. So, it is very expensive for
large data sets and large dimensions. Where
N is the number of training examples, d is the
dimension of each sample.
• Hence, a lot of time must be spent in reducing
the N and d. K-NN does suffer from the curse
of dimensionality.
Attributes
Handling curse of dimensionality
• K-NN is heavily impacted by huge number of
dimensions
• Reduce the dimensions using
– Correlation , Principal Component Analysis
– Gain Ratio, Info gain (filter approach: We lose
some that are important)
– Wrapper methods (Forward selection,
Backward elimination)
– Weighting attributes
• Scaling the attributes
– Attributes with larger range can dominate
To understand this, consider the following pair of data
points (0.1, 20) and (0.9, 720)
• The distance is almost completely dominated by (720-
20) = 700. To avoid this, we standardize attributes to
force the attributes to have a common value range.
The common techniques include
• Taking logarithms when one variable is varying several
orders of magnitude
• Dividing with the highest value to get the variables
between 0 and 1
• Standardizing will bring most of the data to -3 and 3
• Categorical variables and Ordinal variables need to
be converted to numeric
• Handling missing values
– K-NN is impacted heavily by missing values
– Imputation is one option
• Handling overfitting
– Remove outliers (Wilson Editing)
• Speeding up KNN
– Condensation
Feature Engineering
• Library (class)
• kNN produces complex decision surfaces.
• As complexity of decision surfaces increases,
accuracy decreases and we need more data
• Increase K to decrease the over-fit.
• kNN gives no explicability.
• kNN is a distance method, only numeric
variables. Convert categorical/ordinal
values into numerical
• kNN works well in batch mode not in real
time
• kNN fails when there are missing values.
(use kNN Imputation in data pre-
processing to fill missing values)
• In kNN, training is easy but predictions are
difficult

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy