0% found this document useful (0 votes)
37 views10 pages

CS37300 Data Mining & Machine Learning: Anomaly Detection

This document provides an overview of anomaly detection from a machine learning course. It defines anomalies as data points that are considerably different from the majority of data. It discusses different types of anomalies including point, contextual, and collective anomalies. It also outlines challenges in anomaly detection like defining outliers and evaluating models with skewed class distributions. The document concludes with outlining supervised, semi-supervised, and unsupervised approaches to anomaly detection.

Uploaded by

sanjay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views10 pages

CS37300 Data Mining & Machine Learning: Anomaly Detection

This document provides an overview of anomaly detection from a machine learning course. It defines anomalies as data points that are considerably different from the majority of data. It discusses different types of anomalies including point, contextual, and collective anomalies. It also outlines challenges in anomaly detection like defining outliers and evaluating models with skewed class distributions. The document concludes with outlining supervised, semi-supervised, and unsupervised approaches to anomaly detection.

Uploaded by

sanjay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

CS37300

Data Mining & Machine Learning

Anomaly Detection
Module 1: Overview
Prof. Chris Clifton
7 April 2020

Some materials from Introduction to Data Mining by Tan, Steinbach and Kumar
Task
• Anomalies/outliers: data points that are considerably
“different” from the remainder of the data
• Variants:
– Find all points with anomaly scores > threshold
– Find point with largest anomaly score
– Given a database D with mostly normal points, compute the
anomaly score of a point x with respect to D
Examples
• Fraud detection
• Intrusion detection
• Ecosystem disturbances
• System monitoring
• Biosurveillance/public health
• Data preprocessing
Types of anomalies
• Data from different classes
– “An outlier is an observation that differs so much from other
observations as to arouse suspicion that it was generated by a
different mechanism”
• Natural variation
– Extreme or unlikely variations are often interesting
• Data measurement and collection errors
– Preprocess to remove
Defining an outlier
• Notion of outlier is highly subjective and domain
dependent
• However, most definitions can be viewed as defining a
distribution for “normal” data and then looking for
deviations from that distribution

Source: Osmar Zaiane, UAlberta, PKDD


Point anomalies
• An individual data instance is anomalous with respect to
the data Y

N1 o1
O3

o2

N2

Source: Lazarevic et al, ECML/PKDD’08 Tutorial


Contextual anomalies
• An individual data instance is anomalous within a
context
• Requires a notion of context
• Also referred to as conditional anomalies (Song et. al,
TDKE ’06)

Anomaly
Normal

Source: Lazarevic et al, ECML/PKDD’08 Tutorial


Collective anomalies
• A collection of related data instances is anomalous
• Requires a relationship among data instances, e.g.:
– Sequential, Spatial, Graph Data
• The individual instances within a collective anomaly are
not anomalous by themselves

Anomalous Subsequence
Source: Lazarevic et al, ECML/PKDD’08 Tutorial
Anomaly detection
• Challenges
– How many attributes are used to define an outlier?
– How many outliers are there in the data?
– Class labels are costly (evaluation can be challenging)
– Skewed class distribution (finding needles in haystack)
• Working assumption:
– There are considerably more “normal” observations than
“abnormal” observations in the data
Approaches
• Supervised
– Labels available for both normal data and anomalies
– Similar to classification with imbalanced classes
• Semi-supervised
– Labels available only for normal data
• Unsupervised
– No labels assumed
– Based on the assumption that anomalies are very rare compared
to normal data

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy