0% found this document useful (0 votes)
19 views50 pages

Anomaly-Fraud-Detection

Data Mining IOE - Chapter 6 Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views50 pages

Anomaly-Fraud-Detection

Data Mining IOE - Chapter 6 Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Anomaly Detection

Anomaly/Outlier Detection
✔ What are anomalies/outliers?
– The set of data points that are
considerably different than the
remainder of the data

✔ Natural implication is that


anomalies are relatively rare
– One in a thousand occurs often if you have lots of data
– Context is important, e.g., freezing temps in Shrawan

✔ Can be important or a nuisance


– Unusually high blood pressure
– 200 pound, 2 year old
Importance of Anomaly Detection

Ozone Depletion History


✔ In 1985 three researchers (Farman,
Gardinar and Shanklin) were
puzzled by data gathered by the
British Antarctic Survey showing that
ozone levels for Antarctica had
dropped 10% below normal levels

✔ Why did the Nimbus 7 satellite,


which had instruments aboard for
recording ozone levels, not record
similarly low ozone concentrations?

✔ The ozone concentrations recorded


by the satellite were so low they Source:
were being treated as outliers by a http://www.epa.gov/ozone/science/hole/size.html
computer program and discarded!

3
Causes of Anomalies

✔ Data from different classes


– Measuring the weights of oranges, but a few grapes
are mixed in

✔ Natural variation
– Unusually tall people

✔ Data errors
– 200 pound 2 year old

4
Distinction Between Noise and Anomalies


Noise doesn’t necessarily produce unusual values or objects


Noise is not interesting


Noise and anomalies are related but distinct concepts

5
Model-based vs Model-free

Model-based Approaches

Model can be parametric or non-parametric

Anomalies are those points that don’t fit well

Anomalies are those points that distort the model

Model-free Approaches

Anomalies are identified directly from the data without
building a model

Often the underlying assumption is that the most of the points in
the data are normal

6
General Issues: Label vs Score


Some anomaly detection techniques provide only a
binary categorization


Other approaches measure the degree to which an
object is an anomaly

This allows objects to be ranked

Scores can also have associated meaning (e.g., statistical
significance)

7
Anomaly Detection Techniques

✔ Statistical Approaches

✔ Proximity-based
– Anomalies are points far away from other points

✔ Clustering-based
– Points far away from cluster centers are outliers
– Small clusters are outliers

✔ Reconstruction Based
8
Statistical Approaches
✔ Probabilistic definition of an outlier: An outlier is an
object that has a low probability with respect to a
probability distribution model of the data.
✔ Usually assume a parametric model describing the
distribution of the data (e.g., normal distribution)
✔ Apply a statistical test that depends on
– Data distribution
– Parameters of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)

9
Normal Distributions

One-dimensional
Gaussian

7
0.1
6
0.09
5
0.08
4
0.07
3

2
0.06
Two-dimensional
1
0.05 Gaussian
y

0 0.04

-1 0.03

-2 0.02

-3 0.01

-4
probability
-5 density

-4 -3 -2 -1 0 1 2 3 4 5
x

10
Strengths/Weaknesses of Statistical Approaches


Firm mathematical foundation.


Can be very efficient.


Good results if distribution is known.


In many cases, data distribution may not be known.


For high dimensional data, it may be difficult to estimate
the true distribution.

Anomalies can distort the parameters of the distribution.

11
Distance-Based Approaches


One of the simplest ways to define a proximity-based anomaly
score of a data instance x is to use the distance to its kth
nearest neighbor, dist(x,k).

If an instance x has many other instances located close to it
(characteristic of the normal class), it will have a low value of
dist(x,k).

On other hand, an anomalous instance x will be quite distant
from its k-neighboring instances and would thus have a high
value of dist(x, k).

12
One Nearest Neighbor - One Outlier

Figure shows a set of points in a two-
dimensional space that have been shaded
according to their distance to the kth nearest D 2

neighbor, dist(x, k) (where k = 5).


1.8


Note that point D has been correctly assigned a 1.6

high anomaly score, as it is located far away


1.4
from other instances.
1.2

0.8

0.6

0.4

Outlier Score
13
One Nearest Neighbor - Two Outliers
0.55

Note that dist(x,k) can be quite sensitive to the value of D
k. If k is too small, e.g., 1, then a small number of 0.5

outliers located close to each other can show a low


0.45
anomaly score.
0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05


For example, Figure shows anomaly scores using k =
1 for a set of normal points and two outliers that are
located close to each other (shading reflects anomaly Outlier Score
scores). Note that both D and its neighbor have a low
14
anomaly score.
Five Nearest Neighbors - Small Cluster

If k is too large, then it is possible for all objects in a 2
cluster that has fewer than k objects to become
anomalies.
D
1.8


For example, Figure below shows a data set that has 1.6

a small cluster of size 5 and a larger cluster of size 1.4


30.
1.2

0.8

0.6

0.4

Outlier Score

For k = 5, the anomaly score of all points in the 15
smaller cluster is very high.
Five Nearest Neighbors - Differing Density

D 1.8

1.6

1.4

1.2

C
1

0.8

0.6

0.4

0.2

Anomaly score based on the distance to the fifth nearest


neighbor, when there are clusters of varying densities.
Outlier Score
16
Strengths/Weaknesses of Distance-Based Approaches


Simple


Expensive – O(n2)


Sensitive to parameters


Sensitive to variations in density


Distance becomes less meaningful in high-
dimensional space

17
Density-Based Approaches

✔ Density-based Outlier: The outlier score of an


object is the inverse of the density around the
object.
– Can be defined in terms of the k nearest neighbors
– One definition: Inverse of distance to kth neighbor
– Another definition: Inverse of the average distance to k
neighbors
– DBSCAN definition

✔ If there are regions of different density, this


approach can have problems
18
Density-Based Approaches

✔ Density-based Outlier: The outlier score of an


object is the inverse of the density around the
object.
– Can be defined in terms of the k nearest neighbors
– One definition: Inverse of distance to kth neighbor
– Another definition: Inverse of the average distance to k
neighbors
– DBSCAN definition

✔ If there are regions of different density, this


approach can have problems
19
Density-Based Approaches


Consider the density of a point relative to that of
its k nearest neighbors.

We define the following measures of density that
are based on the two distance measures,

20
Relative Density Outlier Scores

In scenarios where the data contains
regions of varying densities, such
methods would not be able to correctly 6.85

identify anomalies, as the notion of a C


6

normal locality would change across


regions. 5

1.40 4
D

1.33
2
A

Outlier Score

21
Relative Density Outlier Scores

Assigning anomaly scores to points
according to dist(x, k) with k = 5 correctly
identifies point C to be an anomaly, but 6.85

shows a low score for point D. C


6

1.40 4
D

1.33
2
A


In fact, the score for D is much lower
than many points that are part of the Outlier Score
loose cluster.
22
Relative Density Outlier Scores

To correctly identify anomalies in such
data sets, we need a notion of density
that is relative to the densities of 6.85

neighboring instances. C
6

1.40 4
D

1.33
2
A


For example, point D in Figure has a
higher absolute density than point A, but Outlier Score
its density is lower relative to its nearest
23
neighbors.
Relative Density Outlier Scores

There are many ways to define the
relative density of an instance.

For a point x, One approach is to 6.85

compute the ratio of the average density C


6

of its k-nearest neighbors, y1 to yk to the


density of x, as follows: 5

1.40 4
D

1.33
2
A

Outlier Score

24
Relative Density Outlier Scores

The relative density of a point is high
when the average density of points in its
neighborhood is significantly higher than 6.85

the density of the point. C


6

1.40 4
D

1.33
2
A

Outlier Score

25
Relative Density-based: LOF approach

Note that by replacing density(x,k) with avg.density(x,k) in the
above equation, we can obtain a more robust measure of relative
density.

The above approach is similar to that used by the Local Outlier
Factor (LOF) score, which is a widely-used measure for detecting
anomalies using relative density.

26
Relative Density-based: LOF approach

✔ For each point, compute the density of its local neighborhood


✔ Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the
density of its nearest neighbors
✔ Outliers are points with largest LOF value

27
Relative Density-based: LOF approach

28
Strengths/Weaknesses of Density-Based Approaches


Simple.


Expensive – O(n2).


Sensitive to parameters.


Density becomes less meaningful in high-
dimensional space.

29
Clustering-Based Approaches

✔ An object is a cluster-based outlier if it


does not strongly belong to any cluster
– For prototype-based clusters, an
object is an outlier if it is not close
enough to a cluster center
 Outliers can impact the clustering produced
– For density-based clusters, an
object is an outlier if its density is
too low
 Can’t distinguish between noise and outliers
– For graph-based clusters, an object
is an outlier if it is not well
connected

30
Distance of Points from Closest Centroids


Here using K-means algorithm, the anomaly score of a point is computed
by the point’s distance from its closest centroid
31
Relative Distance of Points from Closest Centroid

Here using K-means algorithm, the anomaly score of a point is
computed by the point’s relative distance from its closest centroid,
where the relative distance is the ratio of the point’s distance from the
centroid to the median distance of all points in the cluster from the
centroid.


This approach is used to adjust for the large difference in density
32
between compact and loose clusters.
Strengths/Weaknesses of Clustering-Based Approaches


Simple


Many clustering techniques can be used


Can be difficult to decide on a clustering
technique


Can be difficult to decide on number of clusters


Outliers can distort the clusters
33
Reconstruction-Based Approaches

✔ Based on assumptions there are patterns in the


distribution of the normal class that can be captured
using lower-dimensional representations
✔ Reduce data to lower dimensional data
– E.g. Use Principal Components Analysis (PCA) or Auto-
encoders
✔ Measure the reconstruction error for each object
– The difference between original and reduced
dimensionality version
Reconstruction Error


Let be the original data object

Find the representation of the object in a lower
dimensional space

Project the object back to the original space

Call this object

Objects with large reconstruction errors are
anomalies.

35
Reconstruction of two-dimensional data

36
Basic Architecture of an Autoencoder

✔ An autoencoder is a multi-layer neural network


✔ The number of input and output neurons is equal
to the number of original attributes.

37
Strengths and Weaknesses


Does not require assumptions about distribution
of normal class


Can use many dimensionality reduction
approaches


The reconstruction error is computed in the
original space

This can be a problem if dimensionality is high

38
One Class SVM


Uses an SVM approach to classify normal objects


Uses the given data to construct such a model


This data may contain outliers


But the data does not contain class labels


How to build a classifier given one class?

39
How Does One-Class SVM Work?


Uses the “origin” trick

Use a Gaussian kernel

Every point mapped to a unit hypersphere


Every point in the same orthant (quadrant)


Aim to maximize the distance of the separating
plane from the origin

40
Two-dimensional One Class SVM

41
Equations for One-Class SVM

✔ Equation of hyperplane
✔ is the mapping to high dimensional space
✔ Weight vector is
✔ is fraction of outliers
✔ Optimization condition is the following

42
Finding Outliers with a One-Class SVM

✔ Decision boundary with

43
Finding Outliers with a One-Class SVM

✔ Decision boundary with

44
Strengths and Weaknesses


Strong theoretical foundation


Choice of is difficult


Computationally expensive

45
Information Theoretic Approaches

✔ Key idea is to measure how much information


decreases when you delete an observation

✔ Anomalies should show higher gain

✔ Normal points should have less gain

46
Information Theoretic Example


Survey of height and weight for 100 participants


Eliminating last group give a gain of
2.08 − 1.89 = 0.19
47
Strengths and Weaknesses


Solid theoretical foundation.


Theoretically applicable to all kinds of data.


Difficult and computationally expensive to
implement in practice.

48
Evaluation of Anomaly Detection

✔ If class labels are present, then use standard


evaluation approaches for rare class such as
precision, recall, or false positive rate
– FPR is also know as false alarm rate

✔ For unsupervised anomaly detection use


measures provided by the anomaly method
– E.g. reconstruction error or gain

✔ Can also look at histograms of anomaly scores.

49
Distribution of Anomaly Scores

✔ Anomaly scores should show a tail

50

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy