Unsupervised Learning (1)
Unsupervised Learning (1)
Learning
What is Unsupervised Learning?
Unsupervised learning is a type of machine learning method where algorithms explore and find patterns within
unlabeled data.
Unlike supervised learning, it does not require labeled examples or explicit target variables.
Examples:
Data Input Labeled data (inputs x, labels y) Unlabeled data (inputs x only)
Clustering is an unsupervised machine learning method used to identify groups (clusters) of similar data points in an
unlabeled dataset.
The algorithm automatically organizes data points into meaningful groups based on similarity or distance criteria
without predefined labels.
Intuition:
Data points within the same cluster share similarities, while data points in different clusters exhibit distinct
characteristics.
Motivation:
Materials science data (structural, compositional, and performance-related) are often complex and unlabeled.
Uncover meaningful groupings and hidden patterns that facilitate new insights and guide further investigations.
The K-Means Algorithm
What is K-Means?
K-Means clustering is a popular unsupervised algorithm that partitions data into K distinct clusters based on distance to
centroids.
Step-by-Step Example:
Step 1: Initialization: Randomly choose K initial centroids (cluster centers) from the dataset.
Step 2: Assigning Points to Centroids: Assign each data point to the nearest centroid based on Euclidean distance.
Step 3: Updating (move) Centroids: Calculate new centroid positions by taking the average of all points assigned to each
centroid.
Repeat Steps 2 and 3 until convergence: Convergence: Centroids no longer significantly change position; cluster
assignments remain stable
Step 1: Initialization
https://www.nvidia.com/en-au/glossary/k-means/
Repeat Steps 2 and 3 until convergence:
Convergence: Centroids no longer significantly
change position; cluster assignments remain stable
Repeat Steps 2 and 3 until convergence:
Convergence: Centroids no longer significantly
change position; cluster assignments remain stable
Formalizing the K-Means Algorithm
Notation:
Dataset: , K: Number of clusters.
Centroids:
Algorithm Steps:
1. Initialization:: Randomly select initial centroids
2. Cluster Assignment (Step 1):
o Assign each data point to the nearest centroid .
2. Centroid Update (Step 2):
o Update centroid positions by computing the mean of assigned points:
where is the set of points assigned to cluster k.
4. Repeat:
o Repeat Steps 2 and 3 until centroids no longer move significantly or assignments stop changing
Optimization Objective in K-Means
Cost (Distortion) Function:
Cluster assignment step minimizes distortion by assigning points to the nearest centroid.
Centroid update step minimizes distortion by moving centroids to positions that reduce squared distances within
clusters.
Convergence:
Distortion function monotonically decreases (or remains constant) with each iteration.
Algorithm guaranteed to converge to local optimum (minimum distortion)
Practical Aspects – Initialization of K-Means
Importance of Initialization: Different initial centroid positions can lead to very different clustering outcomes.
Common Initialization Strategies:
Random Selection: Choose K random data points from the dataset as initial centroids.
Multiple Initializations:
o Run K-Means multiple times with different initial centroids.
o Select clustering with lowest distortion.
Example of Good vs. Poor Initialization:
Good Initialization: Leads to clear, intuitive clusters with minimal distortion.
Poor Initialization: Can result in poor local minima, suboptimal clusters, and higher distortion values.
Best Practices:
Typically use multiple random initializations (e.g., 50-100 times).
Evaluate and select the clustering result that gives the lowest distortion.
Consider advanced initialization methods (e.g., K-Means++ algorithm) for improved results.
The Challenge of Selecting 'k'
Why is choosing 'k' difficult?
Ambiguity Illustrated:
Different observers may suggest different numbers of clusters from the same dataset.
Example: One scientist sees 2 distinct clusters, another might identify 4 distinct clusters in the same dataset.
Ambiguity occurs because clustering outcomes depend on interpretation, research context, and data complexity.
Implications:
Intuition:
Illustrative Example:
Specific Applications:
Number of Positive
Very few (0-20) Moderate to large
Examples
Gaussian distribution models the probability of a random variable x being seen in the dataset:
Parameters:
Visual Intuition:
Bell-shaped curve indicating high probability near mean and low probability for points far from mean
Anomaly Detection Algorithm: Mathematical Formulation
Step 1: Model Data with Gaussian Distribution: Assume each feature is modeled as a Gaussian:
o Estimate mean () and variance () for each feature :
Step 2: Compute Probability of a Data Point
Given a new point x, compute its probability:
Example:
Anomalous engine data points have significantly lower probability than normal points