0% found this document useful (0 votes)
12 views32 pages

Unit-5 Outlier Analysis

Outliers are data objects that significantly deviate from normal patterns, and their detection is crucial in various applications like fraud detection and medical analysis. There are three types of outliers: global, contextual, and collective, each defined by different criteria of deviation. Outlier detection methods can be categorized into supervised, unsupervised, and semi-supervised approaches, utilizing statistical, proximity-based, and clustering techniques to identify anomalies in data.

Uploaded by

Sa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views32 pages

Unit-5 Outlier Analysis

Outliers are data objects that significantly deviate from normal patterns, and their detection is crucial in various applications like fraud detection and medical analysis. There are three types of outliers: global, contextual, and collective, each defined by different criteria of deviation. Outlier detection methods can be categorized into supervised, unsupervised, and semi-supervised approaches, utilizing statistical, proximity-based, and clustering techniques to identify anomalies in data.

Uploaded by

Sa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 32

Data Mining:

Concepts and
Techniques
Outlier Analysis

1
What Are Outliers?
 Outlier: A data object that deviates significantly from the normal
objects as if it were generated by a different mechanism
 Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne

Gretzky, ...
 Outliers are different from the noise data
 Noise is random error or variance in a measured variable

 Noise should be removed before outlier detection

 Outliers are interesting: It violates the mechanism that generates the


normal data
 Outlier detection vs. novelty detection: early stage, outlier; but later
merged into the model
 Applications:
 Credit card fraud detection

 Telecom fraud detection

 Customer segmentation

 Medical analysis
2
Types of Outliers (I)
 Three kinds: global, contextual and collective outliers
 Global outlier (or point anomaly) Global Outlier
 Object is O if it significantly deviates from the rest of the data set
g
 Ex. Intrusion detection in computer networks
 Issue: Find an appropriate measurement of deviation

 Contextual outlier (or conditional outlier)


 Object is O if it deviates significantly based on a selected context
c
 Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
 Attributes of data objects should be divided into two groups

Contextual attributes: defines the context, e.g., time & location

Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
 Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
 Issue: How to define or formulate meaningful context?
3
Types of Outliers (II)
 Collective Outliers
 A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers
 Applications: E.g., intrusion detection: Collective Outlier

When a number of computers keep sending
denial-of-service packages to each other
 Detection of collective outliers

Consider not only behavior of individual objects, but also that of
groups of objects

Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure
on objects.
 A data set may have multiple types of outlier
 One object may belong to more than one type of outlier
4
Challenges of Outlier Detection
 Modeling normal objects and outliers properly
 Hard to enumerate all possible normal behaviors in an application

 The border between normal and outlier objects is often a gray area

 Application-specific outlier detection


 Choice of distance measure among objects and the model of

relationship among objects are often application-dependent


 E.g., clinic data: a small deviation could be an outlier; while in

marketing analysis, larger fluctuations


 Handling noise in outlier detection
 Noise may distort the normal objects and blur the distinction

between normal objects and outliers. It may help hide outliers and
reduce the effectiveness of outlier detection
 Understandability
 Understand why these are outliers: Justification of the detection

 Specify the degree of an outlier: the unlikelihood of the object being

generated by a normal mechanism


5
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches
 Mining Contextual and Collective Outliers
 Outlier Detection in High Dimensional Data
 Summary
6
Outlier Detection I: Supervised Methods
 Two ways to categorize outlier detection methods:
 Based on whether user-labeled examples of outliers can be obtained:


Supervised, semi-supervised vs. unsupervised methods
 Based on assumptions about normal data and outliers:


Statistical, proximity-based, and clustering-based methods
 Outlier Detection I: Supervised Methods
 Modeling outlier detection as a classification problem


Samples examined by domain experts used for training & testing
 Methods for Learning a classifier for outlier detection effectively:


Model normal objects & report those not matching the model as
outliers, or

Model outliers and treat those not matching the model as normal
 Challenges


Imbalanced classes, i.e., outliers are rare: Boost the outlier class
and make up some artificial outliers

Catch as many outliers as possible, i.e., recall is more important
than accuracy (i.e., not mislabeling normal objects as outliers) 7
Outlier Detection II: Unsupervised
Methods
 Assume the normal objects are somewhat ``clustered'‘ into multiple
groups, each having some distinct features
 An outlier is expected to be far away from any groups of normal objects
 Weakness: Cannot detect collective outlier effectively
 Normal objects may not share any strong patterns, but the collective

outliers may share high similarity in a small area


 Ex. In some intrusion or virus detection, normal activities are diverse
 Unsupervised methods may have a high false positive rate but still

miss many real outliers.


 Supervised methods can be more effective, e.g., identify attacking

some key resources


 Many clustering methods can be adapted for unsupervised methods
 Find clusters, then outliers: not belonging to any cluster

 Problem 1: Hard to distinguish noise from outliers

 Problem 2: Costly since first clustering: but far less outliers than

normal objects

Newer methods: tackle outliers directly
8
Outlier Detection III: Semi-Supervised
Methods
 Situation: In many applications, the number of labeled data is often
small: Labels could be on outliers only, normal objects only, or both
 Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
 If some labeled normal objects are available
 Use the labeled examples and the proximate unlabeled objects to
train a model for normal objects
 Those not fitting the model of normal objects are detected as outliers
 If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
 To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods

9
Outlier Detection (1): Statistical Methods
 Statistical methods (also known as model-based methods) assume that the normal
data follow some statistical model (a stochastic model)
 The data not following the model are outliers.

 Example (right figure): First use Gaussian distribution


to model the normal data
 For each object y in region R, estimate g (y), the
D
probability of y fits the Gaussian distribution
 If g (y) is very low, y is unlikely generated by the
D
Gaussian model, thus an outlier
 Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
 There are rich alternatives to use various statistical models
 E.g., parametric vs. non-parametric

10
Outlier Detection (2): Proximity-Based
Methods
 An object is an outlier if the nearest neighbors of the object are far away, i.e., the
proximity of the object is significantly deviates from the proximity of most of the
other objects in the same data set

 Example (right figure): Model the proximity of an


object using its 3 nearest neighbors
 Objects in region R are substantially different
from other objects in the data set.
 Thus the objects in R are outliers
 The effectiveness of proximity-based methods highly relies on the
proximity measure.
 In some applications, proximity or distance measures cannot be
obtained easily.
 Often have a difficulty in finding a group of outliers which stay close to
each other
 Two major types of proximity-based outlier detection

Distance-based vs. density-based
11
Outlier Detection (3): Clustering-Based
Methods
 Normal data belong to large and dense clusters, whereas
outliers belong to small or sparse clusters, or do not belong
to any clusters
 Example (right figure): two clusters
 All points not in R form a large cluster
 The two points in R form a tiny cluster,
thus are outliers
 Since there are many clustering methods, there are many
clustering-based outlier detection methods as well
 Clustering is expensive: straightforward adaption of a
clustering method for outlier detection can be costly and
does not scale up well for large data sets

12
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches

13
Statistical Approaches
 Statistical approaches assume that the objects in a data set are
generated by a stochastic process (a generative model)
 Idea: learn a generative model fitting the given data set, and then
identify the objects in low probability regions of the model as outliers
 Methods are divided into two categories: parametric vs. non-
parametric
 Parametric method
 Assumes that the normal data is generated by a parametric

distribution with parameter θ


 The probability density function of the parametric distribution f(x, θ)

gives the probability that object x is generated by the distribution


 The smaller this value, the more likely x is an outlier

 Non-parametric method
 Not assume an a-priori statistical model and determine the model

from the input data


 Not completely parameter free but consider the number and nature

of the parameters are flexible and not fixed in advance


 Examples: histogram and kernel density estimation
14
Univariate Outliers Based on Normal
Distribution
 Univariate data: A data set involving only one attribute or variable
 Often assume that data are generated from a normal distribution, learn
the parameters from the input data, and identify the points with low
probability as outliers
 Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}

Use the maximum likelihood method to estimate μ and σ

 Taking derivatives with respect to μ and σ2, we derive the following


maximum likelihood estimates

 For the above data with n = 10, we have


 Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since

15
Parametric Methods I: The Grubb’s Test
 Univariate outlier detection: The Grubb's test (maximum normed
residual test) ─ another statistical method under normal distribution
 For each object x in a data set, compute its z-score: x is an outlier if

where is the value taken by a t-distribution at a


significance level of α/(2N), and N is the # of objects in the data
set

16
Parametric Methods II: Detection of
Multivariate Outliers
 Multivariate data: A data set involving two or more attributes or
variables
 Transform the multivariate outlier detection task into a univariate outlier
detection problem
 Method 1. Compute Mahalaobis distance

Let ō be the mean vector for a multivariate data set. Mahalaobis
distance for an object o to ō is MDist(o, ō) = (o – ō )T S –1(o – ō)
where S is the covariance matrix

Use the Grubb's test on this measure to detect outliers
 Method 2. Use χ2 –statistic:
 where Ei is the mean of the i-dimension among all objects, and n is
the dimensionality
 If χ2 –statistic is large, then object oi is an outlier
17
Parametric Methods III: Using Mixture of
Parametric Distributions
 Assuming data generated by a normal distribution
could be sometimes overly simplified
 Example (right figure): The objects between the two
clusters cannot be captured as outliers since they
are close to the estimated mean
 To overcome this problem, assume the normal data is generated by two
normal distributions. For any object o in the data set, the probability that
o is generated by the mixture of the two distributions is given by

where fθ1 and fθ2 are the probability density functions of θ1 and θ2
 Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
 An object o is an outlier if it does not belong to any cluster
18
Non-Parametric Methods: Detection Using
Histogram
 The model of normal data is learned from the
input data without any a priori structure.
 Often makes fewer assumptions about the data,
and thus can be applicable in more scenarios
 Outlier detection using histogram:

Figure shows the histogram of purchase amounts in transactions

A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
 Problem: Hard to choose an appropriate bin size for histogram

Too small bin size → normal objects in empty/rare bins, false positive

Too big bin size → outliers in some frequent bins, false negative
 Solution: Adopt kernel density estimation to estimate the probability density
distribution of the data. If the estimated density function is high, the object is
likely normal. Otherwise, it is likely an outlier.
19
Major Statistical Data Mining Methods

 Regression
 Generalized Linear
Model
 Analysis of Variance
 Mixed-Effect Models
 Factor Analysis
 Discriminant Analysis
 Survival Analysis 20
Statistical Data Mining (1)

 There are many well-established statistical techniques for data


analysis, particularly for numeric data
 applied extensively to data from scientific experiments and data

from economics and the social sciences


 Regression
 predict the value of a response
(dependent) variable from one or
more predictor (independent)
variables where the variables are
numeric
forms of regression: linear,
multiple, weighted, polynomial,
nonparametric, and robust

21
Scientific and Statistical Data Mining (2)

 Generalized linear models


 allow a categorical response variable (or

some transformation of it) to be related


to a set of predictor variables
 similar to the modeling of a numeric

response variable using linear


regression
 include logistic regression and Poisson

 regression
Mixed-effect models

For analyzing grouped data, i.e. data that can be classified
according to one or more grouping variables

Typically describe relationships between a response
variable and some covariates in data grouped according to
one or more factors
22
Scientific and Statistical Data Mining (3)

 Regression trees
 Binary trees used for classification
and prediction
 Similar to decision trees:Tests are
performed at the internal nodes
In a regression tree the mean of the
objective attribute is computed and
used as the predicted value
 Analysis of variance
 Analyze experimental data for two or
more populations described by a
numeric response variable and one
or more categorical variables
(factors)
23
Statistical Data Mining (4)
 Factor analysis
 determine which variables are

combined to generate a given factor


 e.g., for many psychiatric data, one

can indirectly measure other


quantities (such as test scores) that
reflect the factor of interest
 Discriminant analysis
 predict a categorical response

variable, commonly used in social


science
 Attempts to determine several

discriminant functions (linear


combinations of the independent
variables) that discriminate among
the groups defined by the response
variable
www.spss.com/datamine/factor.htm
24
Statistical Data Mining (5)

 Time series: many methods such as autoregression,


ARIMA (Autoregressive integrated moving-average
modeling), long memory time-series modeling
 Quality control: displays group summary charts

 Survival analysis
Predicts the
probability that a
patient undergoing a
medical treatment
would survive at least
to time t (life span
prediction)
25
Data Mining Applications

 Data mining: A young discipline with broad and diverse


applications
 There still exists a nontrivial gap between generic data

mining methods and effective and scalable data mining


tools for domain-specific applications
 Some application domains (briefly discussed here)
 Data Mining for Financial data analysis

 Data Mining for Retail and Telecommunication

Industries
 Data Mining in Science and Engineering

 Data Mining for Intrusion Detection and Prevention

 Data Mining and Recommender Systems

26
Data Mining for Financial Data Analysis (I)
 Financial data collected in banks and financial institutions
are often relatively complete, reliable, and of high quality
 Design and construction of data warehouses for
multidimensional data analysis and data mining
 View the debt and revenue changes by month, by
region, by sector, and by other factors
 Access statistical information such as max, min, total,
average, trend, etc.
 Loan payment prediction/consumer credit policy analysis
 feature selection and attribute relevance ranking

 Loan payment performance

 Consumer credit rating

27
Data Mining for Financial Data Analysis (II)

 Classification and clustering of customers for targeted


marketing
 multidimensional segmentation by nearest-neighbor,

classification, decision trees, etc. to identify customer


groups or associate a new customer to an appropriate
customer group
 Detection of money laundering and other financial crimes
 integration of from multiple DBs (e.g., bank

transactions, federal/state crime history DBs)


 Tools: data visualization, linkage analysis,

classification, clustering tools, outlier analysis, and


sequential pattern analysis tools (find unusual access
sequences)
28
Data Mining for Intrusion Detection and
Prevention
 Majority of intrusion detection and prevention systems use
 Signature-based detection: use signatures, attack patterns that are
preconfigured and predetermined by domain experts
 Anomaly-based detection: build profiles (models of normal
behavior) and detect those that are substantially deviate from the
profiles
 What data mining can help
 New data mining algorithms for intrusion detection
 Association, correlation, and discriminative pattern analysis help
select and build discriminative classifiers
 Analysis of stream data: outlier detection, clustering, model shifting
 Distributed data mining
 Visualization and querying tools

29
Data Mining and Recommender Systems
 Recommender systems: Personalization, making product
recommendations that are likely to be of interest to a user
 Approaches: Content-based, collaborative, or their hybrid
 Content-based: Recommends items that are similar to items the
user preferred or queried in the past
 Collaborative filtering: Consider a user's social environment,
opinions of other customers who have similar tastes or preferences
 Data mining and recommender systems
 Users C × items S: extract from known to unknown ratings to
predict user-item combinations
 Memory-based method often uses k-nearest neighbor approach
 Model-based method uses a collection of ratings to learn a model
(e.g., probabilistic models, clustering, Bayesian networks, etc.)
 Hybrid approaches integrate both to improve performance (e.g.,
using ensemble)
30
Summary
 We present a high-level overview of mining complex data types
 Statistical data mining methods, such as regression, generalized linear
models, analysis of variance, etc., are popularly adopted
 Researchers also try to build theoretical foundations for data mining
 Visual/audio data mining has been popular and effective
 Application-based mining integrates domain-specific knowledge with
data analysis techniques and provide mission-specific solutions
 Ubiquitous data mining and invisible data mining are penetrating our
data lives
 Privacy and data security are importance issues in data mining, and
privacy-preserving data mining has been developed recently
 Our discussion on trends in data mining shows that data mining is a
promising, young field, with great, strategic importance
31
References and Further Reading
 The books lists a lot of references for further reading. Here we only list a few books

 E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011


 S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data.
Morgan Kaufmann, 2002
 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed., Wiley-Interscience, 2000
 D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly
Connected World. Cambridge University Press, 2010.
 U. Fayyad, G. Grinstein, and A. Wierse (eds.), Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 J. Han, M. Kamber, J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed.
2011
 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd ed., Springer-Verlag, 2009
 D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press,
2009.
 B. Liu. Web Data Mining, Springer 2006.
 T. M. Mitchell. Machine Learning, McGraw Hill, 1997
 M. Newman. Networks: An Introduction. Oxford University Press, 2010.
 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann, 2nd ed. 2005
32

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy