Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
2
Chapter 12. Outlier Analysis
3
Outlier Detection Methods
4
Outlier Detection Methods- Supervised
Methods
5
Outlier Detection Methods- Supervised
Methods
• Supervised methods model data normality and
• abnormality.
Domain experts examine and label a sample of the
underlying data.
• Outlier detection can then be modeled as a
classification problem.
7
Outlier Detection Methods-
UnSupervised Methods
Clustering (DBSCAN, K-Means):
• DBSCAN (Density-Based Clustering): Finds points
outliers.
8
Outlier Detection Methods-
UnSupervised Methods
Use Case : Detecting Fraudulent Credit Card
Transactions That Don’t Fit Into Usual Spending Clusters
Approach:
Clustering-Based Outlier Detection (Unsupervised)
9
Outlier Detection Methods-
UnSupervised Methods
Clustering-Based Outlier Detection (Unsupervised)
1. Collect and Prepare Data
Include features such as:
•Transaction amount
•Time of transaction
•Merchant category
•Location
•Frequency of spending
10
Outlier Detection Methods- UnSupervised Methods
2. Apply Clustering Algorithm
Use K-Means or DBSCAN to group similar transactions.
•K-Means:
•Find the distance of each transaction to its assigned
cluster center.
•Transactions with large distances from the cluster
11
Outlier Detection Methods- UnSupervised Methods
Example Scenario
12
Outlier Detection Methods- Semi-
Supervised Methods
• Only normal data is available during training, and the model identifies
deviations.
• Examples: Manufacturing defects, disease detection.
Techniques:
1. One-Class SVM: Trains only on normal data and flags anything that
deviates.
1. Example: Training on non-fraudulent credit card transactions and
13
Outlier Detection Methods- Semi-
Supervised Methods
In many applications, although obtaining some labeled
examples is feasible, the number of such labeled examples is
often small. We may encounter cases where only a small set
of the normal and/or outlier objects are labeled, but most of
the data are unlabeled.
Semi-supervised outlier detection methods were developed
to tackle such scenarios.
when some labeled normal objects are available, we can use
them, together with unlabeled objects that are close by, to
train a model for normal objects.
The model of normal objects then can be used to detect
outliers—those objects not fitting the model of normal
objects are classified as outliers.
14
Outlier Detection Methods
15
Outlier Detection (2): Proximity-Based Methods
16
Outlier Detection (3): Clustering-Based Methods
17
Chapter 12. Outlier Analysis
Outlier and Outlier Analysis
Outlier Detection Methods
Statistical Approaches
Proximity-Base Approaches
Clustering-Base Approaches
Classification Approaches
18
Outlier Detection (1): Statistical Methods
19
Outlier Detection (1): Statistical Methods
20
Statistical Approaches
21
Parametric Methods I: Detection Univariate
Outliers Based on Normal Distribution
22
Parametric Methods I: The Grubb’s Test
23
The Grubb’s Test (contd….)
25
Parametric Methods II: Detection of
Multivariate Outliers(contd)
26
Parametric Methods II: Detection of
Multivariate Outliers(contd)
27
Parametric Methods III: Using Mixture of
Parametric Distributions
28
Non-Parametric Methods: Detection Using Histogram
29
Other Methodologies of Data Mining
30
Major Statistical Data Mining Methods
31
Statistical Data Mining (1)
32
33
Scientific and Statistical Data Mining (2)
34
Generalized linear models
Linear models
35
Mixed-effect models
• When there are multiple levels, such as patients seen by the
same doctor, the variability in the outcome can be thought of
as being either within group or between group.
• Patient level observations are not independent, as within a
given doctor patients are more similar.
• Units sampled at the highest level (in our example, doctors)
are independent. The figure below shows a sample where the
dots are patients within doctors, the larger circles.
36
Scientific and Statistical Data Mining (3)
37
Statistical Data Mining (4)
www.spss.com/datamine/factor.htm
38
Discriminant analysis
39
Statistical Data Mining (5)
40
Thank You!!!
41