CS37300 Data Mining & Machine Learning: Anomaly Detection
CS37300 Data Mining & Machine Learning: Anomaly Detection
Anomaly Detection
Module 1: Overview
Prof. Chris Clifton
7 April 2020
Some materials from Introduction to Data Mining by Tan, Steinbach and Kumar
Task
• Anomalies/outliers: data points that are considerably
“different” from the remainder of the data
• Variants:
– Find all points with anomaly scores > threshold
– Find point with largest anomaly score
– Given a database D with mostly normal points, compute the
anomaly score of a point x with respect to D
Examples
• Fraud detection
• Intrusion detection
• Ecosystem disturbances
• System monitoring
• Biosurveillance/public health
• Data preprocessing
Types of anomalies
• Data from different classes
– “An outlier is an observation that differs so much from other
observations as to arouse suspicion that it was generated by a
different mechanism”
• Natural variation
– Extreme or unlikely variations are often interesting
• Data measurement and collection errors
– Preprocess to remove
Defining an outlier
• Notion of outlier is highly subjective and domain
dependent
• However, most definitions can be viewed as defining a
distribution for “normal” data and then looking for
deviations from that distribution
N1 o1
O3
o2
N2
Anomaly
Normal
Anomalous Subsequence
Source: Lazarevic et al, ECML/PKDD’08 Tutorial
Anomaly detection
• Challenges
– How many attributes are used to define an outlier?
– How many outliers are there in the data?
– Class labels are costly (evaluation can be challenging)
– Skewed class distribution (finding needles in haystack)
• Working assumption:
– There are considerably more “normal” observations than
“abnormal” observations in the data
Approaches
• Supervised
– Labels available for both normal data and anomalies
– Similar to classification with imbalanced classes
• Semi-supervised
– Labels available only for normal data
• Unsupervised
– No labels assumed
– Based on the assumption that anomalies are very rare compared
to normal data