0% found this document useful (0 votes)

21 views12 pages

Unit 3 (Classification)

The document discusses the steps involved in classification learning including problem identification, data identification and preprocessing, defining training and test data sets, algorithm selection, training, and evaluation. It then provides details on the k-nearest neighbors algorithm, including how it works, determining the value of k, and its strengths and weaknesses.

Uploaded by

niranjan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views12 pages

Unit 3 (Classification)

Uploaded by

niranjan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 12

CLASSIFICATION LEARNING STEPS

Problem Identification: Identifying the problem is the first

step in the supervised learning model. The problem needs
to be a well-formed problem,i.e. a problem with well-
defined goals and benefit, which has a long-term impact.

Identification of Required Data: On the basis of the problem

identified above, the required data set that precisely
represents the identified problem needs to be
identified/evaluated. For example: If the problem is to
predict whether a tumour is malignant or benign, then the
corresponding patient data sets related to malignant tumour
and benign tumours are to be identified.
Data Pre-processing: This is related to the cleaning/transforming the
data set. This step ensures that all the unnecessary/irrelevant data
elements are removed. Data pre-processing refers to the
transformations applied to the identified data before feeding the
same into the algorithm. Because the data is gathered from different
sources, it is usually collected in a raw format and is not ready for
immediate analysis. This step ensures that the data is ready to be fed
into the machine learning algorithm.
Definition of Training Data Set: Before starting the analysis, the user should decide what kind of data set is to be
used as a training set. In the case of signature analysis, for example, the training data set might be a single
handwritten alphabet, an entire handwritten word (i.e. a group of the alphabets) or an entire line of handwriting
(i.e. sentences or a group of words). Thus, a set of ‘input meta-objects’ and corresponding ‘output meta-objects’
are also gathered. The training set needs to be actively representative of the real-world use of the given scenario.
Thus, a set of data input (X) and corresponding outputs (Y) is gathered either from human experts or
experiments.

Algorithm Selection: This involves determining the structure of the learning function and the corresponding
learning algorithm. This is the most critical step of supervised learning model. On the basis of various parameters,
the best algorithm for a given problem is chosen.

Training: The learning algorithm identified in the previous step is run on the gathered training set for further fine
tuning. Some supervised learning algorithms require the user to determine specific control parameters (which are
given as inputs to the algorithm). These parameters (inputs given to algorithm) may also be adjusted by
optimizing performance on a subset (called as validation set) of the training set.

Evaluation with the Test Data Set: Training data is run on the algorithm, and its performance is measured here. If
a suitable result is not obtained, further training of parameters may be required.
COMMON CLASSIFICATION ALGORITHMS

1. k-Nearest Neighbour (kNN)

2. Decision tree
3. Random forest
4. Support Vector Machine (SVM)
5. Naïve Bayes classifier
k -Nearest Neighbour (kNN)
The kNN algorithm is a simple but extremely powerful classification algorithm.
The name of the algorithm originates from the underlying philosophy of kNN – i.e. people
having similar background or mindset tend to stay close to each other.
In other words, neighbours in a locality have a similar background. In the same way, as a part of
the kNN algorithm, the unknown and unlabelled data which comes for a prediction problem is
judged on the basis of the training data set elements which are similar to the unknown element.
So, the class label of the unknown element is assigned on the basis of the class labels of the
similar training data set elements (metaphorically can be considered as neighbours of the
unknown element).
 K Nearest Neighbor (KNN) is a very simple, easy-to-understand, and versatile machine learning
algorithm. It’s used in many different areas, such as handwriting detection, image recognition,
and video recognition.
 KNN is most useful when labeled data is too expensive or impossible to obtain, and it can
achieve high accuracy in a wide variety of prediction-type problems.
 KNN is a simple algorithm, based on the local minimum of the target function which is used to
learn an unknown function of desired precision and accuracy.
 The algorithm also finds the neighborhood of an unknown input, its range or distance from it,
and other parameters. It’s based on the principle of “information gain”—the algorithm finds
out which is most suitable to predict an unknown value.
In the kNN algorithm, the class label of the test data elements is decided by the class label of
the training data elements which are neighbouring, i.e. similar in nature. But there are two
challenges:

 What is the basis of this similarity or when can we say that two data elements are similar?
 How many similar elements should be considered for deciding the class label of each test
data element?

For first one ,The most common approach adopted by kNN to measure similarity between two data elements
is Euclidean
distance.
Considering a very simple data set having two features (say f1 and f2),Euclidean distance between two data
elements d1 and d2 can be measured by
where f11 = value of feature f1 for data
element d1
f12 = value of feature f1 for data element d2
f21 = value of feature f2 for data element d1
f22 = value of feature f2 for data element d2
The answer to the second question, i.e. how many similar elements should be considered.

The answer lies in the value of ‘k’ which is a user-defined parameter given as an input to the
algorithm
.
 In the kNN algorithm, the value of ‘k’ indicates the number of neighbours that need to be
considered.
 For example, if the value of k is 3, only three nearest neighbours or three training data
elements closest to the test data element are considered.
 Out of the three data elements, the class which is predominant is considered as the class
label to be assigned to the test data. In case the value of k is 1, only the closest training
data element is considered. The class label of that data element is directly assigned to the
test data element
But it is often a tricky decision to decide the value of k. The reasons are as follows:

 If the value of k is very large (in the extreme case equal to the total number of records in the training data),
the class label of the majority class of the training data set will be assigned to the test data regardless of the
class labels of the neighbours nearest to the test data.
 If the value of k is very small (in the extreme case equal to 1), the class value of a noisy data or outlier in the
training data set which is the nearest neighbour to the test data will be assigned to the test data.
The best k value is somewhere between these two extremes.

Few strategies, highlighted below, are adopted by machine learning practitioners to arrive at a value for k.

 One common practice is to set k equal to the square root of the number of training records.
 An alternative approach is to test several k values on a variety of test data sets and choose the one that
delivers the best performance.
 Another interesting approach is to choose a larger value of k, but apply a weighted voting process in which the
vote of close neighbours is considered more influential than the vote of distant neighbours.
kNN algorithm

Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number of nearest neighbours to be
considered)

Steps:

Do for all test data points

Calculate the distance (usually Euclidean distance) of the test data point from the different training data points.

Find the closest ‘k’ training data points, i.e. training data points whose distances are least from the test data point.

If k = 1

Then assign class label of the training data point to the test data point

Else

Whichever class label is predominantly present in the training data points, assign that class label to the test data
point

End do
Why the kNN algorithm is called a lazy learner?

that eager learners follow the general steps of machine learning, i.e. perform an abstraction of
the information obtained from the input data and then follow it through by a generalization
step. However, as we have seen in the case of the kNN algorithm, these steps are completely
skipped. It stores the training data and directly applies the philosophy of nearest
neighbourhood finding to arrive at the classification. So, for kNN, there is no learning
happening in the real sense. Therefore, kNN falls under the category of lazy learner.

Strengths of the kNN algorithm

 Extremely simple algorithm – easy to understand
 Very effective in certain situations, e.g. for recommender system design
 Very fast or almost no time required for the training phase
Weaknesses of the kNN algorithm
 Does not learn anything in the real sense. Classification is done completely on the
basis of the training data. So, it has a heavy reliance on the training data. If the
training data does not represent the problem domain comprehensively, the algorithm
fails to make an effective classification.
 Because there is no model trained in real sense and the classification is done
completely on the basis of the training data, the classification process is very slow.
 Also, a large amount of computational space is required to load the training data for
classification.

Application of the kNN algorithm

kNN algorithm is widely adopted in
 recommender systems
 searching documents/ contents similar to a given
document/content.

40 Algorithms Every Data Scientist Should Know - Navigating Through Essential AI and ML Algorithms by W
No ratings yet
40 Algorithms Every Data Scientist Should Know - Navigating Through Essential AI and ML Algorithms by W
848 pages
Data Science CBSE Notes
No ratings yet
Data Science CBSE Notes
45 pages
Topics in Module-3-: ML & Cloud Computing For Iot
No ratings yet
Topics in Module-3-: ML & Cloud Computing For Iot
149 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
ML 3RD Unit
No ratings yet
ML 3RD Unit
67 pages
Machine Learning Algorithms Laiki
No ratings yet
Machine Learning Algorithms Laiki
123 pages
ML Unit 5..
No ratings yet
ML Unit 5..
40 pages
Unit 4 Supervised Learning
100% (1)
Unit 4 Supervised Learning
75 pages
Image Processing 7
No ratings yet
Image Processing 7
193 pages
Week 09 Lesson 1 Intro Machine Learning 1 To 32
No ratings yet
Week 09 Lesson 1 Intro Machine Learning 1 To 32
61 pages
ML Unit 3
No ratings yet
ML Unit 3
106 pages
UNIT 2 - Notes
No ratings yet
UNIT 2 - Notes
31 pages
Unit 5
No ratings yet
Unit 5
73 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
130 pages
ML 5
No ratings yet
ML 5
76 pages
CH 04 Classification Techniques
No ratings yet
CH 04 Classification Techniques
89 pages
Unit-4 Unsupervised Algorithm
No ratings yet
Unit-4 Unsupervised Algorithm
18 pages
AIML Unit-IV & V
100% (1)
AIML Unit-IV & V
47 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
Unit Ii
No ratings yet
Unit Ii
102 pages
ML Unit 2 (Ab22)
No ratings yet
ML Unit 2 (Ab22)
61 pages
Unit 3 - Supervise Learning Classification
No ratings yet
Unit 3 - Supervise Learning Classification
23 pages
Datamites Certified Data Analyst Brochure INDIA V9
No ratings yet
Datamites Certified Data Analyst Brochure INDIA V9
18 pages
Machine Learning3
No ratings yet
Machine Learning3
51 pages
Day 4 Content
No ratings yet
Day 4 Content
35 pages
Customer Churn Data - A Project Based On Logistic Regression
100% (12)
Customer Churn Data - A Project Based On Logistic Regression
31 pages
Statistic Inference Unit 2 Notes
No ratings yet
Statistic Inference Unit 2 Notes
34 pages
M Tech Artificial Intelligence R15 Syllabus
No ratings yet
M Tech Artificial Intelligence R15 Syllabus
48 pages
K Nearest Neighbour Classifier
No ratings yet
K Nearest Neighbour Classifier
24 pages
Chapter 7 Supervised Learning Classification
No ratings yet
Chapter 7 Supervised Learning Classification
28 pages
Unit 4
No ratings yet
Unit 4
24 pages
ML UNIT - III-Complete
No ratings yet
ML UNIT - III-Complete
52 pages
ML Supervised Learning Unit 3
No ratings yet
ML Supervised Learning Unit 3
51 pages
Breast Cancer Prediction Project
No ratings yet
Breast Cancer Prediction Project
33 pages
12 - 23ECE216 - Nearest Neighbors
No ratings yet
12 - 23ECE216 - Nearest Neighbors
29 pages
MachineLearning Unit-III
No ratings yet
MachineLearning Unit-III
26 pages
UNIT 3 - Final
No ratings yet
UNIT 3 - Final
37 pages
FPA Unit 2
No ratings yet
FPA Unit 2
20 pages
3.1 K Nearest Neighbour Classifier
No ratings yet
3.1 K Nearest Neighbour Classifier
24 pages
Update Week 13 Machine Learning Supervised
No ratings yet
Update Week 13 Machine Learning Supervised
21 pages
Classification
No ratings yet
Classification
58 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
32 pages
Supervised Learning - SVM - DT
No ratings yet
Supervised Learning - SVM - DT
43 pages
ML 03 Classification
No ratings yet
ML 03 Classification
15 pages
Unit 4
No ratings yet
Unit 4
23 pages
Agronomy 13 02976
No ratings yet
Agronomy 13 02976
27 pages
Remote Sensing Technology Applications in Forestry and REDD PDF
No ratings yet
Remote Sensing Technology Applications in Forestry and REDD PDF
246 pages
19-K-Nearest Neighbor Learning.-22-08-2024
No ratings yet
19-K-Nearest Neighbor Learning.-22-08-2024
25 pages
Brain Functional Gradients Are Related To Cortical Folding Gradient
No ratings yet
Brain Functional Gradients Are Related To Cortical Folding Gradient
17 pages
DM - MP
No ratings yet
DM - MP
15 pages
Cse Vsem 503 B PR Unit 2 Notes
No ratings yet
Cse Vsem 503 B PR Unit 2 Notes
17 pages
Co-2 ML 2019
No ratings yet
Co-2 ML 2019
71 pages
Image Processing and Pattern Classification
100% (1)
Image Processing and Pattern Classification
3 pages
Using Machine Learning For Detection and Prediction of Chronic Diseases
No ratings yet
Using Machine Learning For Detection and Prediction of Chronic Diseases
17 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
Lecture 17 - KNN
No ratings yet
Lecture 17 - KNN
18 pages
Topic Wise Dsa Questions
No ratings yet
Topic Wise Dsa Questions
15 pages
ML 6
No ratings yet
ML 6
26 pages
Algorithm
No ratings yet
Algorithm
27 pages
Unit 5
No ratings yet
Unit 5
28 pages
Machine Learning Unit-3.1
No ratings yet
Machine Learning Unit-3.1
20 pages
Effects of Data Augmentation Method Borderline-SMO
No ratings yet
Effects of Data Augmentation Method Borderline-SMO
15 pages
K-Means and K-NN Methods For Determining Student Interest
No ratings yet
K-Means and K-NN Methods For Determining Student Interest
13 pages
Bai602 ML I
100% (1)
Bai602 ML I
4 pages
cst383 Team Project Student Alcohol Consumption Final
No ratings yet
cst383 Team Project Student Alcohol Consumption Final
17 pages
Machine Learning
No ratings yet
Machine Learning
15 pages
Machine Learning and Web Scraping Lecture 03
No ratings yet
Machine Learning and Web Scraping Lecture 03
22 pages
Data Analytics in The Internet of Things A Survey PDF
No ratings yet
Data Analytics in The Internet of Things A Survey PDF
24 pages
(KtabPDF Com) xrwA7TEBGp
No ratings yet
(KtabPDF Com) xrwA7TEBGp
32 pages
Asset-V1 ColumbiaX+CSMM.101x+1T2017+type@asset+block@AI Edx ML 5.1intro
No ratings yet
Asset-V1 ColumbiaX+CSMM.101x+1T2017+type@asset+block@AI Edx ML 5.1intro
70 pages
Module Iii
No ratings yet
Module Iii
15 pages
05 K-Nearest Neighbors
No ratings yet
05 K-Nearest Neighbors
15 pages
ML Unit 3 Part 2
No ratings yet
ML Unit 3 Part 2
8 pages
Classification
No ratings yet
Classification
7 pages
Ransomware Detection and Classification Using Ensemble Learning: A Random Forest Tree Approach
No ratings yet
Ransomware Detection and Classification Using Ensemble Learning: A Random Forest Tree Approach
7 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
12 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Homework 3 Residential
No ratings yet
Homework 3 Residential
5 pages
TSP CMC 41333
No ratings yet
TSP CMC 41333
14 pages
No Code Data Science Outline
No ratings yet
No Code Data Science Outline
6 pages
1 s2.0 S187705092030644X Main
No ratings yet
1 s2.0 S187705092030644X Main
11 pages
Postharvest Biology and Technology: Sciencedirect
No ratings yet
Postharvest Biology and Technology: Sciencedirect
9 pages
Computers and Electronics in Agriculture: P.S. Maya Gopal, R. Bhargavi T
No ratings yet
Computers and Electronics in Agriculture: P.S. Maya Gopal, R. Bhargavi T
9 pages
Spam Detection in Online Social Networks
No ratings yet
Spam Detection in Online Social Networks
14 pages
100 Days of ML Code
No ratings yet
100 Days of ML Code
15 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
Machine Learning Worksho (Proposal
No ratings yet
Machine Learning Worksho (Proposal
6 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
4 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 3 (Classification)

Uploaded by

Unit 3 (Classification)

Uploaded by

CLASSIFICATION LEARNING STEPS

Problem Identification: Identifying the problem is the first

Identification of Required Data: On the basis of the problem

1. k-Nearest Neighbour (kNN)

Do for all test data points

Strengths of the kNN algorithm

Application of the kNN algorithm

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.