Unit-5 Outlier Analysis
Unit-5 Outlier Analysis
Concepts and
Techniques
Outlier Analysis
1
What Are Outliers?
Outlier: A data object that deviates significantly from the normal
objects as if it were generated by a different mechanism
Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne
Gretzky, ...
Outliers are different from the noise data
Noise is random error or variance in a measured variable
Customer segmentation
Medical analysis
2
Types of Outliers (I)
Three kinds: global, contextual and collective outliers
Global outlier (or point anomaly) Global Outlier
Object is O if it significantly deviates from the rest of the data set
g
Ex. Intrusion detection in computer networks
Issue: Find an appropriate measurement of deviation
The border between normal and outlier objects is often a gray area
between normal objects and outliers. It may help hide outliers and
reduce the effectiveness of outlier detection
Understandability
Understand why these are outliers: Justification of the detection
Supervised, semi-supervised vs. unsupervised methods
Based on assumptions about normal data and outliers:
Statistical, proximity-based, and clustering-based methods
Outlier Detection I: Supervised Methods
Modeling outlier detection as a classification problem
Samples examined by domain experts used for training & testing
Methods for Learning a classifier for outlier detection effectively:
Model normal objects & report those not matching the model as
outliers, or
Model outliers and treat those not matching the model as normal
Challenges
Imbalanced classes, i.e., outliers are rare: Boost the outlier class
and make up some artificial outliers
Catch as many outliers as possible, i.e., recall is more important
than accuracy (i.e., not mislabeling normal objects as outliers) 7
Outlier Detection II: Unsupervised
Methods
Assume the normal objects are somewhat ``clustered'‘ into multiple
groups, each having some distinct features
An outlier is expected to be far away from any groups of normal objects
Weakness: Cannot detect collective outlier effectively
Normal objects may not share any strong patterns, but the collective
Problem 2: Costly since first clustering: but far less outliers than
normal objects
Newer methods: tackle outliers directly
8
Outlier Detection III: Semi-Supervised
Methods
Situation: In many applications, the number of labeled data is often
small: Labels could be on outliers only, normal objects only, or both
Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
If some labeled normal objects are available
Use the labeled examples and the proximate unlabeled objects to
train a model for normal objects
Those not fitting the model of normal objects are detected as outliers
If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods
9
Outlier Detection (1): Statistical Methods
Statistical methods (also known as model-based methods) assume that the normal
data follow some statistical model (a stochastic model)
The data not following the model are outliers.
10
Outlier Detection (2): Proximity-Based
Methods
An object is an outlier if the nearest neighbors of the object are far away, i.e., the
proximity of the object is significantly deviates from the proximity of most of the
other objects in the same data set
12
Chapter 12. Outlier Analysis
Outlier and Outlier Analysis
Outlier Detection Methods
Statistical Approaches
13
Statistical Approaches
Statistical approaches assume that the objects in a data set are
generated by a stochastic process (a generative model)
Idea: learn a generative model fitting the given data set, and then
identify the objects in low probability regions of the model as outliers
Methods are divided into two categories: parametric vs. non-
parametric
Parametric method
Assumes that the normal data is generated by a parametric
Non-parametric method
Not assume an a-priori statistical model and determine the model
15
Parametric Methods I: The Grubb’s Test
Univariate outlier detection: The Grubb's test (maximum normed
residual test) ─ another statistical method under normal distribution
For each object x in a data set, compute its z-score: x is an outlier if
16
Parametric Methods II: Detection of
Multivariate Outliers
Multivariate data: A data set involving two or more attributes or
variables
Transform the multivariate outlier detection task into a univariate outlier
detection problem
Method 1. Compute Mahalaobis distance
Let ō be the mean vector for a multivariate data set. Mahalaobis
distance for an object o to ō is MDist(o, ō) = (o – ō )T S –1(o – ō)
where S is the covariance matrix
Use the Grubb's test on this measure to detect outliers
Method 2. Use χ2 –statistic:
where Ei is the mean of the i-dimension among all objects, and n is
the dimensionality
If χ2 –statistic is large, then object oi is an outlier
17
Parametric Methods III: Using Mixture of
Parametric Distributions
Assuming data generated by a normal distribution
could be sometimes overly simplified
Example (right figure): The objects between the two
clusters cannot be captured as outliers since they
are close to the estimated mean
To overcome this problem, assume the normal data is generated by two
normal distributions. For any object o in the data set, the probability that
o is generated by the mixture of the two distributions is given by
where fθ1 and fθ2 are the probability density functions of θ1 and θ2
Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
An object o is an outlier if it does not belong to any cluster
18
Non-Parametric Methods: Detection Using
Histogram
The model of normal data is learned from the
input data without any a priori structure.
Often makes fewer assumptions about the data,
and thus can be applicable in more scenarios
Outlier detection using histogram:
Figure shows the histogram of purchase amounts in transactions
A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
Problem: Hard to choose an appropriate bin size for histogram
Too small bin size → normal objects in empty/rare bins, false positive
Too big bin size → outliers in some frequent bins, false negative
Solution: Adopt kernel density estimation to estimate the probability density
distribution of the data. If the estimated density function is high, the object is
likely normal. Otherwise, it is likely an outlier.
19
Major Statistical Data Mining Methods
Regression
Generalized Linear
Model
Analysis of Variance
Mixed-Effect Models
Factor Analysis
Discriminant Analysis
Survival Analysis 20
Statistical Data Mining (1)
21
Scientific and Statistical Data Mining (2)
regression
Mixed-effect models
For analyzing grouped data, i.e. data that can be classified
according to one or more grouping variables
Typically describe relationships between a response
variable and some covariates in data grouped according to
one or more factors
22
Scientific and Statistical Data Mining (3)
Regression trees
Binary trees used for classification
and prediction
Similar to decision trees:Tests are
performed at the internal nodes
In a regression tree the mean of the
objective attribute is computed and
used as the predicted value
Analysis of variance
Analyze experimental data for two or
more populations described by a
numeric response variable and one
or more categorical variables
(factors)
23
Statistical Data Mining (4)
Factor analysis
determine which variables are
Survival analysis
Predicts the
probability that a
patient undergoing a
medical treatment
would survive at least
to time t (life span
prediction)
25
Data Mining Applications
Industries
Data Mining in Science and Engineering
26
Data Mining for Financial Data Analysis (I)
Financial data collected in banks and financial institutions
are often relatively complete, reliable, and of high quality
Design and construction of data warehouses for
multidimensional data analysis and data mining
View the debt and revenue changes by month, by
region, by sector, and by other factors
Access statistical information such as max, min, total,
average, trend, etc.
Loan payment prediction/consumer credit policy analysis
feature selection and attribute relevance ranking
27
Data Mining for Financial Data Analysis (II)
29
Data Mining and Recommender Systems
Recommender systems: Personalization, making product
recommendations that are likely to be of interest to a user
Approaches: Content-based, collaborative, or their hybrid
Content-based: Recommends items that are similar to items the
user preferred or queried in the past
Collaborative filtering: Consider a user's social environment,
opinions of other customers who have similar tastes or preferences
Data mining and recommender systems
Users C × items S: extract from known to unknown ratings to
predict user-item combinations
Memory-based method often uses k-nearest neighbor approach
Model-based method uses a collection of ratings to learn a model
(e.g., probabilistic models, clustering, Bayesian networks, etc.)
Hybrid approaches integrate both to improve performance (e.g.,
using ensemble)
30
Summary
We present a high-level overview of mining complex data types
Statistical data mining methods, such as regression, generalized linear
models, analysis of variance, etc., are popularly adopted
Researchers also try to build theoretical foundations for data mining
Visual/audio data mining has been popular and effective
Application-based mining integrates domain-specific knowledge with
data analysis techniques and provide mission-specific solutions
Ubiquitous data mining and invisible data mining are penetrating our
data lives
Privacy and data security are importance issues in data mining, and
privacy-preserving data mining has been developed recently
Our discussion on trends in data mining shows that data mining is a
promising, young field, with great, strategic importance
31
References and Further Reading
The books lists a lot of references for further reading. Here we only list a few books