0% found this document useful (0 votes)
158 views8 pages

Lecture 1 Clustering PDF

1. The course aims to provide an understanding of machine learning techniques including data handling, visualization, supervised and unsupervised learning. 2. Students will learn skills like clustering, association rule learning, and reinforcement learning and how to apply these techniques to solve real-life problems. 3. The objectives are to understand basic learning algorithms, analyze large datasets, implement machine learning models, and build intelligent systems to make automated decisions.

Uploaded by

Pika Xavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views8 pages

Lecture 1 Clustering PDF

1. The course aims to provide an understanding of machine learning techniques including data handling, visualization, supervised and unsupervised learning. 2. Students will learn skills like clustering, association rule learning, and reinforcement learning and how to apply these techniques to solve real-life problems. 3. The objectives are to understand basic learning algorithms, analyze large datasets, implement machine learning models, and build intelligent systems to make automated decisions.

Uploaded by

Pika Xavier
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

4/28/2023

ML: Course Objectives


COURSE OBJECTIVES

APEX INSTITUTE OF TECHNOLOGY


The Course aims to:
1. Understand and apply various data handling and visualization techniques.
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING 2. Understand about some basic learning algorithms and techniques and their applications, as well as
general questions related to analysing and handling large data sets.

MACHINE LEARNING (21CSH-286) 3. To develop skills of supervised and unsupervised learning techniques and implementation of these to
solve real life problems.
Faculty: Prof. (Dr.) Vineet Mehan (E13038) 4. To develop basic knowledge on the machine techniques to build an intellectual machine for making
decisions behalf of humans.
5. To develop skills for selecting suitable model parameters and apply them for designing optimized
machine learning applications.
Lecture – 1
Clustering DISCOVER . LEARN . EMPOWER
1 2

COURSE OUTCOMES Unit-3 Syllabus


Unit-3 Unsupervised Learning
On completion of this course, the students shall be able to:-
Clustering Types of Clustering: Centroid-based clustering, Density-based
Identify and implement simple learning strategies using data science and clustering, Distribution-based Clustering and Hierarchical clustering;
CO3
statistics principles. K- Means Clustering, KNN (K-Nearest Neighbours), DBSCAN
Evaluate machine learning model’s performance and apply learning strategy to clustering algorithm; Performance metrics for clustering: Silhouette
CO4
improve the performance of supervised and unsupervised learning model. Score
Association Rule Apriori algorithm, F-P Growth Algorithm, Applications of Association
Learning Rule Learning, Market Basket Analysis.

Reinforcement Types of Reinforcement learning, Key Features of Reinforcement


Learning Learning, Elements of Reinforcement Learning, Applications of
Reinforcement Learning.

3 4

SUGGESTIVE READINGS Index


• TEXT BOOKS:
• There is no single textbook covering the material presented in this course. Here is a list of books
• Clustering
recommended for further reading in connection with the material presented:
• T1: Tom.M.Mitchell, “Machine Learning, McGraw Hill International Edition”.
• T2: Ethern Alpaydin,” Introduction to Machine Learning. Eastern Economy Edition, Prentice Hall of India,
2005”.
• Types of Clustering
• T3: Andreas C. Miller, Sarah Guido, Introduction to Machine Learning with Python, O’REILLY (2001).

• Applications
• REFERENCE BOOKS:
• R1 Sebastian Raschka, Vahid Mirjalili, Python Machine Learning, (2014)
• R2 Richard O. Duda, Peter E. Hart, David G. Stork, “Pattern Classification, Wiley, 2nd Edition”.
• R3 Christopher Bishop, “Pattern Recognition and Machine Learning, illustrated Edition, Springer, 2006”.

5 By: Prof. (Dr.) Vineet Mehan 6

1
4/28/2023

Clustering Real Life Example


List of 15 Brightest Star Clusters
• Clustering  To cluster the data.

• How?

• Similar kind of data are put together to form a cluster.

Data with nearly similar Characteristics

By: Prof. (Dr.) Vineet Mehan 7 By: Prof. (Dr.) Vineet Mehan 8

Grouping unlabeled Data is called clustering.


Example Practical Example
• Search on Google

Cluster 2
Cluster 1
Unlabeled Data • Buy a product on Amazon

• Then Links \ Products that are relevant to search are shown by means
of clustering.
Cluster 3

• Idea: Groups of similar objects are made.

By: Prof. (Dr.) Vineet Mehan 9 By: Prof. (Dr.) Vineet Mehan 10

Grouping unlabeled Data is called clustering.


Clustering Example
meaning

• Clustering does not need a response class Unlike Classification which


needs a response class. Cluster of Stars Cluster of Circles
Cluster 2
Unlabeled Data Cluster 1

• In Dataset we have a response class  in Classification

Cluster of Diamonds
• No response class  in Clustering
Cluster 3

• After grouping  Visually look at cluster  and Optionally associate


meaning to each cluster.
By: Prof. (Dr.) Vineet Mehan 11 By: Prof. (Dr.) Vineet Mehan 12

2
4/28/2023

Clustering Types of Clustering


• Prediction in Clustering  is set of clusters themselves
1. Centroid-based Clustering

• But data must be in numeric form. 2. Density-based Clustering

3. Distribution-based Clustering
• If any other form then convert data into numeric form (Label
Encoding) 4. Hierarchical Clustering

By: Prof. (Dr.) Vineet Mehan 13 By: Prof. (Dr.) Vineet Mehan 14

1. Centroid-based Clustering K – means algorithm


K is chosen (i.e. No. of clusters to be made (E.g. K=3))

• Centroid  Center Randomly place centroids

• Clusters are formed according to Centroid Iteration

It minimizes the Aggregate (Mean) intra cluster distances


• How? and every iteration results in different clusters

• Distance of data points to centroid should be min. After Multiple Iterations

Centroids position is identified that


• E.g. K – means algorithm is one of the popular examples of this algorithm. has min. distance to the data points.
(K  number of Clusters, To be defined by users)
By: Prof. (Dr.) Vineet Mehan 15 By: Prof. (Dr.) Vineet Mehan 16

K – means algorithm 1. Centroid-based Clustering


Two Clusters Centroid
Algorithm

Centroid of Cluster 2

Centroid of Cluster 1

By: Prof. (Dr.) Vineet Mehan 17 By: Prof. (Dr.) Vineet Mehan 18

3
4/28/2023

1. Centroid-based Clustering 1. Centroid-based Clustering


Four Clusters Centroid
Algorithm
• Centroid-based algorithms are efficient but sensitive to initial
conditions and outliers.

• Initial conditions:

• Choosing adequate initial seeds affects both the speed and quality.

• Iterating improves the centroids position, from previous centroids.

By: Prof. (Dr.) Vineet Mehan 19 By: Prof. (Dr.) Vineet Mehan 20

Outliers Reasons of Outliers


• Outlier is an observation that appears far away and diverges from an • Experimental errors (data extraction or experiment
overall pattern in a sample. planning/executing errors)
• Measurement errors (instrument errors)
• Outliers in input data can skew and mislead the training process of • Data entry errors (human errors)
machine learning algorithms • Intentional (dummy outliers made to test detection methods)
• Data processing errors (data manipulation errors)
• It results in longer training times, less accurate models and ultimately • Sampling errors (extracting or mixing data from wrong or various
poorer results. sources)
• Natural (not an error, novelties in data)
By: Prof. (Dr.) Vineet Mehan 21 By: Prof. (Dr.) Vineet Mehan 22

2. Density-based Clustering 2. Density-based Clustering


Connects areas of high density Arbitrary-shaped distributions
• Density-based clustering connects areas of high density (concentrated
density) into clusters.

• This allows for arbitrary-shaped distributions as long as dense areas


can be connected.

By: Prof. (Dr.) Vineet Mehan 23 By: Prof. (Dr.) Vineet Mehan 24

4
4/28/2023

2. Density-based Clustering 2. Density-based Clustering


• These algorithms have difficulty with data of varying densities and
high dimensions.

• Further, by design, these algorithms do not assign outliers to clusters. Outliers not assigned

By: Prof. (Dr.) Vineet Mehan 25 By: Prof. (Dr.) Vineet Mehan 26

Pre-requisite Pre-requisite
• Data can be "distributed" (spread out) in different ways. • But there are many cases where the data tends to be around a central
value with no bias left or right, and it gets close to a "Normal
Distribution" like this:
• It can be spread out more on the left

• It can be spread out more on the right

• It can be jumbled The blue curve is a Normal Distribution.

The yellow histogram shows some data that


By: Prof. (Dr.) Vineet Mehan 27
follows it closely, but not perfectly (which is usual).
By: Prof. (Dr.) Vineet Mehan 28

Pre-requisite Standard Normal Distribution


• We say the data is "normally distributed":

• The Normal Distribution has:


• mean = median = mode
• symmetry about the center
• 50% of values less than the mean
• and 50% values greater than the mean

By: Prof. (Dr.) Vineet Mehan 29 By: Prof. (Dr.) Vineet Mehan 30

5
4/28/2023

Z Score Z Score
• A Z-Score is a statistical measurement of a score's relationship to the • The statistical formula for a value's z-score is calculated using the following
mean in a group of scores. formula:

• z=(x-μ)/σ
• In general, a Z-score of -3.0 to 3.0 suggests that a stock is trading
• Where:
within three standard deviations of its mean.
• z = Z-score
• x = the value being evaluated
• μ = the mean
• σ = the standard deviation
By: Prof. (Dr.) Vineet Mehan 31 By: Prof. (Dr.) Vineet Mehan 32

3. Distribution-based Clustering 3. Distribution-based Clustering


Data clustered into three Gaussian distributions
• This clustering approach assumes data is composed of distributions,
such as Gaussian distributions.

• In Figure, the distribution-based algorithm clusters data into three E.g. Expectation Maximization Algo.
Gaussian distributions. That uses Normal Distribution for
Clustering the data points
• As distance from the distribution's center increases, the probability
that a point belongs to the distribution decreases.

It is similar to centroid based clustering except that in this probability


By: Prof. (Dr.) Vineet Mehan 33 By: Prof. (Dr.) Vineet Mehan 34
is used to compute the clusters rather than mean.

4. Hierarchical Clustering 4. Hierarchical Clustering


In the animal kingdom, animals have been categorized into two main groups vertebrate and invertebrate.
This differentiation is mainly based on the presence and absence of the backbone (spinal column).
• Hierarchical clustering creates a tree of clusters.

• Hierarchical clustering, not surprisingly, is well suited to hierarchical


data, such as taxonomies (Categorization).

• Taxonomy is a system for naming and organizing things, especially


plants and animals, into groups that share similar qualities.

By: Prof. (Dr.) Vineet Mehan 35 By: Prof. (Dr.) Vineet Mehan 36

6
4/28/2023

4. Hierarchical Clustering Hierarchical Clustering


Plotted Data Points on XY axis
Hierarchy of Clusters Build
Y • Approach: Bottom to Up
E Data Points are close
(Form a cluster) ABCDE
• Inv. Elements to Clusters
D • Combination of Clusters According to similarity
C CDE
• Large Clusters
AB CD
A B Data Points are close • Also called as Agglomerative Clustering
(Form a cluster)
A B C D E • Hierarchy of Clusters Build and Represented is called Dendogram
X 5 Data Points
By: Prof. (Dr.) Vineet Mehan 37 By: Prof. (Dr.) Vineet Mehan 38

Hierarchical Clustering Applications of Clustering


• Approach: Top to Bottom • Marketing: It can be used to Cluster different customer segments for
marketing purposes.

• Large Clusters • Insurance: It is used to acknowledge the Cluster customers, Cluster policies
• Divide Clusters and Cluster frauds.
• Clusters to Inv. Elements
• Libraries: It is used in Cluster different books on the basis of topics and
information.
• Also called as Divisive Clustering
• Hierarchy of Clusters Build and Represented is called Dendogram • Biology: It can be used for Cluster different species of plants and animals.

By: Prof. (Dr.) Vineet Mehan 39 By: Prof. (Dr.) Vineet Mehan 40

Summary Task
• Clustering • Identify any 5 application areas of Clustering in details and infer which
type of clustering technique would be best suited corresponding to
each one of them. (BT-Level 4)
• Types of Clustering

• Applications

41 By: Prof. (Dr.) Vineet Mehan 42

7
4/28/2023

REFERENCES
• https://www.geeksforgeeks.org/clustering-in-machine-learning/

• https://www.javatpoint.com/clustering-in-machine-learning THANK YOU


• https://developers.google.com/machine-
learning/clustering/overview

For queries
Email: vineet.e13038@cumail.in
43 44

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy