0% found this document useful (0 votes)
3 views27 pages

Unsupervised Learning (1)

Unsupervised learning is a machine learning approach that identifies patterns in unlabeled data, focusing on discovering hidden structures, clustering similar data points, and reducing dimensionality. Key techniques include clustering, anomaly detection, dimensionality reduction, and association rule learning, with K-Means being a prominent clustering algorithm. Choosing the optimal number of clusters (k) is challenging and often involves methods like the Elbow Method, while anomaly detection identifies significant deviations from normal behavior in data.

Uploaded by

SALIHU ISMAIL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views27 pages

Unsupervised Learning (1)

Unsupervised learning is a machine learning approach that identifies patterns in unlabeled data, focusing on discovering hidden structures, clustering similar data points, and reducing dimensionality. Key techniques include clustering, anomaly detection, dimensionality reduction, and association rule learning, with K-Means being a prominent clustering algorithm. Choosing the optimal number of clusters (k) is challenging and often involves methods like the Elbow Method, while anomaly detection identifies significant deviations from normal behavior in data.

Uploaded by

SALIHU ISMAIL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Unsupervised

Learning
What is Unsupervised Learning?
 Unsupervised learning is a type of machine learning method where algorithms explore and find patterns within
unlabeled data.
 Unlike supervised learning, it does not require labeled examples or explicit target variables.

Objectives of Unsupervised Learning:

 Discover hidden patterns or intrinsic structures within data.


 Group or cluster similar data points based on inherent features.
 Reduce dimensionality for easier interpretation of complex data.

Examples:

 Customer Segmentation: Grouping customers by purchasing behaviors to target marketing campaigns.


 Social Network Analysis: Finding groups of similar users based on interactions.
 Bioinformatics: Identifying genetic profiles associated with diseases.
https://machinelearningmastery.com/types-of-classification-in-machine-learning/
https://www.nvidia.com/en-au/glossary/k-means/
Types of Unsupervised Learning
1. Clustering: Identify groups or clusters in data based on similarity.
Examples:
o K-Means Clustering
o Hierarchical Clustering
2. Anomaly Detection: Detect unusual or rare events that deviate significantly from normal behavior.
Examples:
o Gaussian-based anomaly detection
o Autoencoder-based methods
3. Dimensionality Reduction: Reduce the number of features while retaining key information for easier interpretation and visualization.
Examples:
o Principal Component Analysis (PCA)
o t-distributed Stochastic Neighbor Embedding (t-SNE)
4. Association Rule Learning: Discover interesting relationships or frequent co-occurrences among variables.
Supervised vs. Unsupervised Learning
Aspect Supervised Learning Unsupervised Learning

Data Input Labeled data (inputs x, labels y) Unlabeled data (inputs x only)

Find intrinsic structures, patterns, or


Learning Objective Learn to predict target labels (y) accurately
relationships in data

Linear regression, Decision trees, Neural K-Means clustering, Principal Component


Common Algorithms
networks (with labels) Analysis (PCA), Autoencoders

Clustering, Dimensionality reduction,


Use-Cases Classification, Regression
Anomaly detection

Outcome Accurate predictions Meaningful data structures and insights


What is Clustering?
Definition:

 Clustering is an unsupervised machine learning method used to identify groups (clusters) of similar data points in an
unlabeled dataset.
 The algorithm automatically organizes data points into meaningful groups based on similarity or distance criteria
without predefined labels​.

Intuition:

 Data points within the same cluster share similarities, while data points in different clusters exhibit distinct
characteristics.

Motivation:

 Materials science data (structural, compositional, and performance-related) are often complex and unlabeled.
 Uncover meaningful groupings and hidden patterns that facilitate new insights and guide further investigations.
The K-Means Algorithm
What is K-Means?

K-Means clustering is a popular unsupervised algorithm that partitions data into K distinct clusters based on distance to
centroids.

Step-by-Step Example:

Step 1: Initialization: Randomly choose K initial centroids (cluster centers) from the dataset.

Step 2: Assigning Points to Centroids: Assign each data point to the nearest centroid based on Euclidean distance.

Step 3: Updating (move) Centroids: Calculate new centroid positions by taking the average of all points assigned to each
centroid.

Repeat Steps 2 and 3 until convergence: Convergence: Centroids no longer significantly change position; cluster
assignments remain stable
Step 1: Initialization

· Initialization: The K-means algorithm starts by selecting


initial positions for K cluster centers.

· Number of Clusters (K): The number K is chosen by the


user based on the specific application or problem.

· Initial Locations: Users may either specify initial locations


manually or allow the algorithm to choose them randomly
from the dataset.

· Importance: Good initial positions help ensure accurate


clustering and faster convergence.
Step 2: Assigning Points to Centroids: Assign each data
point to the nearest centroid based on Euclidean
distance.

Step 3: Updating (move) Centroids: Calculate new


centroid positions by taking the average of all points
assigned to each centroid.

https://www.nvidia.com/en-au/glossary/k-means/
Repeat Steps 2 and 3 until convergence:
Convergence: Centroids no longer significantly
change position; cluster assignments remain stable
Repeat Steps 2 and 3 until convergence:
Convergence: Centroids no longer significantly
change position; cluster assignments remain stable
Formalizing the K-Means Algorithm
Notation:
 Dataset: , K: Number of clusters.
 Centroids:
Algorithm Steps:
1. Initialization:: Randomly select initial centroids
2. Cluster Assignment (Step 1):
o Assign each data point to the nearest centroid ​.
2. Centroid Update (Step 2):
o Update centroid positions by computing the mean of assigned points:
where is the set of points assigned to cluster k.
4. Repeat:
o Repeat Steps 2 and 3 until centroids no longer move significantly or assignments stop changing​
Optimization Objective in K-Means
Cost (Distortion) Function:

 K-Means optimizes the distortion (cost) function, defined as:


o Measures average squared distance between points and their assigned centroids.

Goal: Minimize the distortion to achieve compact, well-defined clusters.

Why This Works:

 Cluster assignment step minimizes distortion by assigning points to the nearest centroid.
 Centroid update step minimizes distortion by moving centroids to positions that reduce squared distances within
clusters.

Convergence:

 Distortion function monotonically decreases (or remains constant) with each iteration.
 Algorithm guaranteed to converge to local optimum (minimum distortion)​
Practical Aspects – Initialization of K-Means
Importance of Initialization: Different initial centroid positions can lead to very different clustering outcomes.
Common Initialization Strategies:
 Random Selection: Choose K random data points from the dataset as initial centroids.
 Multiple Initializations:
o Run K-Means multiple times with different initial centroids.
o Select clustering with lowest distortion.
Example of Good vs. Poor Initialization:
 Good Initialization: Leads to clear, intuitive clusters with minimal distortion.
 Poor Initialization: Can result in poor local minima, suboptimal clusters, and higher distortion values​.
Best Practices:
 Typically use multiple random initializations (e.g., 50-100 times).
 Evaluate and select the clustering result that gives the lowest distortion.
 Consider advanced initialization methods (e.g., K-Means++ algorithm) for improved results.
The Challenge of Selecting 'k'
Why is choosing 'k' difficult?

 In clustering, the "correct" number of clusters often isn’t clearly defined.


 No explicit labels or ground truth to validate optimal cluster count.

Ambiguity Illustrated:

 Different observers may suggest different numbers of clusters from the same dataset.
 Example: One scientist sees 2 distinct clusters, another might identify 4 distinct clusters in the same dataset.
 Ambiguity occurs because clustering outcomes depend on interpretation, research context, and data complexity​.

Implications:

 No single universally "correct" value for 'k'.


 Choice of clusters often depends on application-specific criteria and practical considerations.
https://pub.towardsai.net/fully-explained-k-means-clustering-with-python-e7caa573176a
Methods for Choosing 'k': The Elbow Method
What is the Elbow Method? A method to estimate a suitable number of clusters (k) by observing how clustering
performance improves with increasing cluster count.
How it works (step-by-step):
1. Run K-Means clustering algorithm multiple times, varying the number of clusters (k = 1, 2, …, n).
2. Calculate and record the distortion (cost function) for each k.
3. Plot distortion values vs. number of clusters (k).
Identifying the "Elbow":
 Look for the point ("elbow") where the decrease in distortion significantly slows down.
 The "elbow" point is often chosen as a suitable number of clusters.
Pros: Simple and intuitive visual tool and easy to interpret, widely used in practice.
Cons: Sometimes ambiguous; no clear elbow point. Also subjective: Different interpretations of the same plot are
possible.
Illustrative Example: A clear elbow at k = 3 might suggest 3 clusters as optimal for a particular dataset​
Practical Considerations in Materials Applications
Why Context Matters in Materials Science: Optimal number of clusters (k) strongly depends on specific research
objectives or downstream applications.
Examples:
 Materials Phase Identification:
o Choosing k based on known physical or chemical phases observed in experimental or simulation data.
o Balancing detail (more clusters) with interpretability and simplicity (fewer clusters).
 Alloy Design and Composition Screening:
o Selecting k according to practical manufacturing constraints or categories of alloys to investigate.
o More clusters allow finer distinctions in compositions; fewer clusters reduce complexity and simplify
exploration.
 Product Design and Commercial Constraints:
o Example from manufacturing (e.g., clothing): Number of clusters influenced by production feasibility, cost
constraints, or market segmentation needs.
o Fewer clusters simplify production but might compromise the ideal performance or customer fit. More
clusters could increase cost and complexity.
What is Anomaly Detection?
Definition: Anomaly detection algorithms identify data points or events that deviate significantly from the
expected behavior or typical data patterns​.

Intuition:

 Algorithms learn patterns from a dataset representing "normal" behavior.


 Points far from typical patterns are flagged as anomalies.

Illustrative Example:

 Aircraft engine manufacturing:


o Normal: Typical range of heat and vibration signatures.
o Anomalous: Unusually high or low heat/vibration indicating potential failure or defects​
Why Anomaly Detection in Materials Science?
Motivation:

 Detecting defects or failures.


 Preventing catastrophic failures in critical materials or components.
 Improving quality control during manufacturing.

Specific Applications:

 Defect Detection: Identifying anomalies in additive manufacturing processes, metal casting, or


semiconductor fabrication.
 Material Failure Prediction: Early detection of unusual mechanical properties or degradation.
 Quality Assurance: Continuous monitoring of products (e.g., aerospace alloys, composites) to flag unusual
behavior or defects early
Anomaly Detection vs. Supervised Learning
Aspect Anomaly Detection Supervised Learning

Primarily unlabeled; few labeled Requires sufficient labeled


Labels Needed?
anomalies for evaluation only positive (anomaly) examples

Number of Positive
Very few (0-20) Moderate to large
Examples

Typically unknown or novel


Nature of Anomalies Typically known anomalies
anomalies

Recognize previously seen


Goal Identify novel deviations
anomalies
Pattern Recognition "Anomaly Detection Challenges"
The Gaussian (Normal) Distribution
Definition:

 Gaussian distribution models the probability of a random variable x being seen in the dataset:

Parameters:

 μ (mean): Center of the distribution.

 σ2 (variance): Spread or dispersion of data around the mean.

Visual Intuition:

 Bell-shaped curve indicating high probability near mean and low probability for points far from mean​
Anomaly Detection Algorithm: Mathematical Formulation
Step 1: Model Data with Gaussian Distribution: Assume each feature is modeled as a Gaussian:
o Estimate mean () and variance (​) for each feature ​:
Step 2: Compute Probability of a Data Point
 Given a new point x, compute its probability:

Step 3: Flag Anomalies


 Define threshold (ϵ); flag anomaly if:

Example:
 Anomalous engine data points have significantly lower probability than normal points

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy