0% found this document useful (0 votes)
1 views41 pages

Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques

Chapter 12 discusses outlier analysis, including definitions, challenges, and various detection methods such as supervised, unsupervised, and semi-supervised approaches. It highlights techniques like clustering and statistical methods for identifying outliers in datasets, with applications in fraud detection and other fields. The chapter also covers specific methods like Grubb’s Test for univariate outliers and Mahalanobis distance for multivariate outliers.

Uploaded by

julybabies2804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views41 pages

Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques

Chapter 12 discusses outlier analysis, including definitions, challenges, and various detection methods such as supervised, unsupervised, and semi-supervised approaches. It highlights techniques like clustering and statistical methods for identifying outliers in datasets, with applications in fraud detection and other fields. The chapter also covers specific methods like Grubb’s Test for univariate outliers and Mahalanobis distance for multivariate outliers.

Uploaded by

julybabies2804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

1

Data Mining & Analytics


— Unit 5 —
— (Chapter 12) —

Outliers: Introduction, Challenges &


Detection Methods

2
Chapter 12. Outlier Analysis

 Outlier and Outlier Analysis


 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches

3
Outlier Detection Methods

4
Outlier Detection Methods- Supervised
Methods

5
Outlier Detection Methods- Supervised
Methods
• Supervised methods model data normality and
• abnormality.
Domain experts examine and label a sample of the
underlying data.
• Outlier detection can then be modeled as a
classification problem.

• The sample is used for training and testing.


• In some applications, the experts may label just
the normal objects, and any other objects not
matching the model
of normal objects are reported as outliers.
• Other methods model the outliers and treat
objects not matching the model of outliers as
6
Outlier Detection Methods-
UnSupervised Methods
 In some application scenarios, objects labeled as “normal” or
“outlier” are not available. Thus, an unsupervised learning
method has to be used.

 Many clustering methods can be adapted to act as


unsupervised outlier detection methods.
 The central idea is to find clusters first, and then the data
objects not belonging to any cluster are detected as outliers.
 However, such methods suffer from two issues.
 First, a data object not belonging to any cluster may be noise
instead of an outlier.
 Second, it is often costly to find clusters first and then find
outliers.

7
Outlier Detection Methods-
UnSupervised Methods
Clustering (DBSCAN, K-Means):
• DBSCAN (Density-Based Clustering): Finds points

that do not belong to any dense cluster.


• K-Means: Points far from cluster centroids can be

outliers.

8
Outlier Detection Methods-
UnSupervised Methods
Use Case : Detecting Fraudulent Credit Card
Transactions That Don’t Fit Into Usual Spending Clusters

Approach:
Clustering-Based Outlier Detection (Unsupervised)

When labeled fraudulent transactions are not available, we can use


clustering to find usual spending patterns (clusters) and identify
transactions that do not fit well into any cluster as potential fraud.

9
Outlier Detection Methods-
UnSupervised Methods
Clustering-Based Outlier Detection (Unsupervised)
1. Collect and Prepare Data
Include features such as:
•Transaction amount

•Time of transaction

•Merchant category
•Location

•Frequency of spending

10
Outlier Detection Methods- UnSupervised Methods
2. Apply Clustering Algorithm
Use K-Means or DBSCAN to group similar transactions.

•K-Means:
•Find the distance of each transaction to its assigned

cluster center.
•Transactions with large distances from the cluster

centroid may be fraudulent.


•DBSCAN:
•Identifies dense clusters of normal behavior.
•Any transaction not assigned to a cluster (labeled

as noise) is a potential outlier (possible fraud).

11
Outlier Detection Methods- UnSupervised Methods

Example Scenario

12
Outlier Detection Methods- Semi-
Supervised Methods
• Only normal data is available during training, and the model identifies
deviations.
• Examples: Manufacturing defects, disease detection.
Techniques:
1. One-Class SVM: Trains only on normal data and flags anything that

deviates.
1. Example: Training on non-fraudulent credit card transactions and

flagging any unusual activity.


2. Autoencoders (Neural Networks): Trains a model to reconstruct

normal data; outliers have high reconstruction error.


1. Example: Identifying faults in industrial machines by training on

normal vibration patterns.

13
Outlier Detection Methods- Semi-
Supervised Methods
 In many applications, although obtaining some labeled
examples is feasible, the number of such labeled examples is
often small. We may encounter cases where only a small set
of the normal and/or outlier objects are labeled, but most of
the data are unlabeled.
 Semi-supervised outlier detection methods were developed
to tackle such scenarios.
 when some labeled normal objects are available, we can use
them, together with unlabeled objects that are close by, to
train a model for normal objects.
 The model of normal objects then can be used to detect
outliers—those objects not fitting the model of normal
objects are classified as outliers.

14
Outlier Detection Methods

 (1) Statistical Methods


 (2) Proximity-Based Methods
 (3) Clustering-Based Methods

15
Outlier Detection (2): Proximity-Based Methods

16
Outlier Detection (3): Clustering-Based Methods

17
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches

18
Outlier Detection (1): Statistical Methods

19
Outlier Detection (1): Statistical Methods

 If the heights of adult males in a city follow a


normal distribution with a mean of 170 cm and
most people are between 160–180 cm, a man
who is 210 cm tall is very rare.
 The model says, "This doesn’t fit the expected
pattern," so it's flagged.

20
Statistical Approaches

21
Parametric Methods I: Detection Univariate
Outliers Based on Normal Distribution

22
Parametric Methods I: The Grubb’s Test

Detects one outlier at a time in a normally distributed dataset.

23
The Grubb’s Test (contd….)

where is the value taken by a t-distribution at a


significance level of α/(2N), and N is the #of objects in the data
set
24
Parametric Methods II: Detection of
Multivariate Outliers

25
Parametric Methods II: Detection of
Multivariate Outliers(contd)

 1.Calculate the mean vector from the multivariate data set.


 2.For each object o calculate MDist(o, ō), the Mahalanobis distance from o to ō.
 3. Detect outliers in the transformed univariate dataset , {MDist(o, ō) = (o – ō )| o € D}.
 4.If MDist(o, ō)is determined to be an outlier , then o is regarded as an outlier as well.

26
Parametric Methods II: Detection of
Multivariate Outliers(contd)

 is the value of o on the ith dimension.


 where Ei is the mean of the i-dimension among all objects, and n is the dimensionality
 If χ2 –statistic is large, then object oi is an outlier

27
Parametric Methods III: Using Mixture of
Parametric Distributions

28
Non-Parametric Methods: Detection Using Histogram

29
Other Methodologies of Data Mining

 Statistical Data Mining

30
Major Statistical Data Mining Methods

31
Statistical Data Mining (1)

32
33
Scientific and Statistical Data Mining (2)

34
Generalized linear models
Linear models

35
Mixed-effect models
• When there are multiple levels, such as patients seen by the
same doctor, the variability in the outcome can be thought of
as being either within group or between group.
• Patient level observations are not independent, as within a
given doctor patients are more similar.
• Units sampled at the highest level (in our example, doctors)
are independent. The figure below shows a sample where the
dots are patients within doctors, the larger circles.

36
Scientific and Statistical Data Mining (3)

37
Statistical Data Mining (4)

www.spss.com/datamine/factor.htm
38
Discriminant analysis

39
Statistical Data Mining (5)

40
Thank You!!!

41

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy