0% found this document useful (0 votes)

12 views32 pages

Unit-5 Outlier Analysis

Outliers are data objects that significantly deviate from normal patterns, and their detection is crucial in various applications like fraud detection and medical analysis. There are three types of outliers: global, contextual, and collective, each defined by different criteria of deviation. Outlier detection methods can be categorized into supervised, unsupervised, and semi-supervised approaches, utilizing statistical, proximity-based, and clustering techniques to identify anomalies in data.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views32 pages

Unit-5 Outlier Analysis

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 32

Data Mining:

Concepts and
Techniques
Outlier Analysis

1
What Are Outliers?
 Outlier: A data object that deviates significantly from the normal
objects as if it were generated by a different mechanism
 Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne

Gretzky, ...
 Outliers are different from the noise data
 Noise is random error or variance in a measured variable

 Noise should be removed before outlier detection

 Outliers are interesting: It violates the mechanism that generates the

normal data
 Outlier detection vs. novelty detection: early stage, outlier; but later
merged into the model
 Applications:
 Credit card fraud detection

 Telecom fraud detection

 Customer segmentation

 Medical analysis
2
Types of Outliers (I)
 Three kinds: global, contextual and collective outliers
 Global outlier (or point anomaly) Global Outlier
 Object is O if it significantly deviates from the rest of the data set
g
 Ex. Intrusion detection in computer networks
 Issue: Find an appropriate measurement of deviation

 Contextual outlier (or conditional outlier)

 Object is O if it deviates significantly based on a selected context
c
 Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
 Attributes of data objects should be divided into two groups

Contextual attributes: defines the context, e.g., time & location

Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
 Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
 Issue: How to define or formulate meaningful context?
3
Types of Outliers (II)
 Collective Outliers
 A subset of data objects collectively deviate
significantly from the whole data set, even if the
individual data objects may not be outliers
 Applications: E.g., intrusion detection: Collective Outlier

When a number of computers keep sending
denial-of-service packages to each other
 Detection of collective outliers

Consider not only behavior of individual objects, but also that of
groups of objects

Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure
on objects.
 A data set may have multiple types of outlier
 One object may belong to more than one type of outlier
4
Challenges of Outlier Detection
 Modeling normal objects and outliers properly
 Hard to enumerate all possible normal behaviors in an application

 The border between normal and outlier objects is often a gray area

 Application-specific outlier detection

 Choice of distance measure among objects and the model of

relationship among objects are often application-dependent

 E.g., clinic data: a small deviation could be an outlier; while in

marketing analysis, larger fluctuations

 Handling noise in outlier detection
 Noise may distort the normal objects and blur the distinction

between normal objects and outliers. It may help hide outliers and
reduce the effectiveness of outlier detection
 Understandability
 Understand why these are outliers: Justification of the detection

 Specify the degree of an outlier: the unlikelihood of the object being

generated by a normal mechanism

5
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches
 Proximity-Base Approaches
 Clustering-Base Approaches
 Classification Approaches
 Mining Contextual and Collective Outliers
 Outlier Detection in High Dimensional Data
 Summary
6
Outlier Detection I: Supervised Methods
 Two ways to categorize outlier detection methods:
 Based on whether user-labeled examples of outliers can be obtained:


Supervised, semi-supervised vs. unsupervised methods
 Based on assumptions about normal data and outliers:


Statistical, proximity-based, and clustering-based methods
 Outlier Detection I: Supervised Methods
 Modeling outlier detection as a classification problem


Samples examined by domain experts used for training & testing
 Methods for Learning a classifier for outlier detection effectively:


Model normal objects & report those not matching the model as
outliers, or

Model outliers and treat those not matching the model as normal
 Challenges


Imbalanced classes, i.e., outliers are rare: Boost the outlier class
and make up some artificial outliers

Catch as many outliers as possible, i.e., recall is more important
than accuracy (i.e., not mislabeling normal objects as outliers) 7
Outlier Detection II: Unsupervised
Methods
 Assume the normal objects are somewhat ``clustered'‘ into multiple
groups, each having some distinct features
 An outlier is expected to be far away from any groups of normal objects
 Weakness: Cannot detect collective outlier effectively
 Normal objects may not share any strong patterns, but the collective

outliers may share high similarity in a small area

 Ex. In some intrusion or virus detection, normal activities are diverse
 Unsupervised methods may have a high false positive rate but still

miss many real outliers.

 Supervised methods can be more effective, e.g., identify attacking

some key resources

 Many clustering methods can be adapted for unsupervised methods
 Find clusters, then outliers: not belonging to any cluster

 Problem 1: Hard to distinguish noise from outliers

 Problem 2: Costly since first clustering: but far less outliers than

normal objects

Newer methods: tackle outliers directly
8
Outlier Detection III: Semi-Supervised
Methods
 Situation: In many applications, the number of labeled data is often
small: Labels could be on outliers only, normal objects only, or both
 Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
 If some labeled normal objects are available
 Use the labeled examples and the proximate unlabeled objects to
train a model for normal objects
 Those not fitting the model of normal objects are detected as outliers
 If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
 To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods

9
Outlier Detection (1): Statistical Methods
 Statistical methods (also known as model-based methods) assume that the normal
data follow some statistical model (a stochastic model)
 The data not following the model are outliers.

 Example (right figure): First use Gaussian distribution

to model the normal data
 For each object y in region R, estimate g (y), the
D
probability of y fits the Gaussian distribution
 If g (y) is very low, y is unlikely generated by the
D
Gaussian model, thus an outlier
 Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
 There are rich alternatives to use various statistical models
 E.g., parametric vs. non-parametric

10
Outlier Detection (2): Proximity-Based
Methods
 An object is an outlier if the nearest neighbors of the object are far away, i.e., the
proximity of the object is significantly deviates from the proximity of most of the
other objects in the same data set

 Example (right figure): Model the proximity of an

object using its 3 nearest neighbors
 Objects in region R are substantially different
from other objects in the data set.
 Thus the objects in R are outliers
 The effectiveness of proximity-based methods highly relies on the
proximity measure.
 In some applications, proximity or distance measures cannot be
obtained easily.
 Often have a difficulty in finding a group of outliers which stay close to
each other
 Two major types of proximity-based outlier detection

Distance-based vs. density-based
11
Outlier Detection (3): Clustering-Based
Methods
 Normal data belong to large and dense clusters, whereas
outliers belong to small or sparse clusters, or do not belong
to any clusters
 Example (right figure): two clusters
 All points not in R form a large cluster
 The two points in R form a tiny cluster,
thus are outliers
 Since there are many clustering methods, there are many
clustering-based outlier detection methods as well
 Clustering is expensive: straightforward adaption of a
clustering method for outlier detection can be costly and
does not scale up well for large data sets

12
Chapter 12. Outlier Analysis
 Outlier and Outlier Analysis
 Outlier Detection Methods
 Statistical Approaches

13
Statistical Approaches
 Statistical approaches assume that the objects in a data set are
generated by a stochastic process (a generative model)
 Idea: learn a generative model fitting the given data set, and then
identify the objects in low probability regions of the model as outliers
 Methods are divided into two categories: parametric vs. non-
parametric
 Parametric method
 Assumes that the normal data is generated by a parametric

distribution with parameter θ

 The probability density function of the parametric distribution f(x, θ)

gives the probability that object x is generated by the distribution

 The smaller this value, the more likely x is an outlier

 Non-parametric method
 Not assume an a-priori statistical model and determine the model

from the input data

 Not completely parameter free but consider the number and nature

of the parameters are flexible and not fixed in advance

 Examples: histogram and kernel density estimation
14
Univariate Outliers Based on Normal
Distribution
 Univariate data: A data set involving only one attribute or variable
 Often assume that data are generated from a normal distribution, learn
the parameters from the input data, and identify the points with low
probability as outliers
 Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}

Use the maximum likelihood method to estimate μ and σ

 Taking derivatives with respect to μ and σ2, we derive the following

maximum likelihood estimates

 For the above data with n = 10, we have

 Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since

15
Parametric Methods I: The Grubb’s Test
 Univariate outlier detection: The Grubb's test (maximum normed
residual test) ─ another statistical method under normal distribution
 For each object x in a data set, compute its z-score: x is an outlier if

where is the value taken by a t-distribution at a

significance level of α/(2N), and N is the # of objects in the data
set

16
Parametric Methods II: Detection of
Multivariate Outliers
 Multivariate data: A data set involving two or more attributes or
variables
 Transform the multivariate outlier detection task into a univariate outlier
detection problem
 Method 1. Compute Mahalaobis distance

Let ō be the mean vector for a multivariate data set. Mahalaobis
distance for an object o to ō is MDist(o, ō) = (o – ō )T S –1(o – ō)
where S is the covariance matrix

Use the Grubb's test on this measure to detect outliers
 Method 2. Use χ2 –statistic:
 where Ei is the mean of the i-dimension among all objects, and n is
the dimensionality
 If χ2 –statistic is large, then object oi is an outlier
17
Parametric Methods III: Using Mixture of
Parametric Distributions
 Assuming data generated by a normal distribution
could be sometimes overly simplified
 Example (right figure): The objects between the two
clusters cannot be captured as outliers since they
are close to the estimated mean
 To overcome this problem, assume the normal data is generated by two
normal distributions. For any object o in the data set, the probability that
o is generated by the mixture of the two distributions is given by

where fθ1 and fθ2 are the probability density functions of θ1 and θ2
 Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
 An object o is an outlier if it does not belong to any cluster
18
Non-Parametric Methods: Detection Using
Histogram
 The model of normal data is learned from the
input data without any a priori structure.
 Often makes fewer assumptions about the data,
and thus can be applicable in more scenarios
 Outlier detection using histogram:

Figure shows the histogram of purchase amounts in transactions

A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
 Problem: Hard to choose an appropriate bin size for histogram

Too small bin size → normal objects in empty/rare bins, false positive

Too big bin size → outliers in some frequent bins, false negative
 Solution: Adopt kernel density estimation to estimate the probability density
distribution of the data. If the estimated density function is high, the object is
likely normal. Otherwise, it is likely an outlier.
19
Major Statistical Data Mining Methods

 Regression
 Generalized Linear
Model
 Analysis of Variance
 Mixed-Effect Models
 Factor Analysis
 Discriminant Analysis
 Survival Analysis 20
Statistical Data Mining (1)

 There are many well-established statistical techniques for data

analysis, particularly for numeric data
 applied extensively to data from scientific experiments and data

from economics and the social sciences

 Regression
 predict the value of a response
(dependent) variable from one or
more predictor (independent)
variables where the variables are
numeric
forms of regression: linear,
multiple, weighted, polynomial,
nonparametric, and robust

21
Scientific and Statistical Data Mining (2)

 Generalized linear models

 allow a categorical response variable (or

some transformation of it) to be related

to a set of predictor variables
 similar to the modeling of a numeric

response variable using linear

regression
 include logistic regression and Poisson

 regression
Mixed-effect models

For analyzing grouped data, i.e. data that can be classified
according to one or more grouping variables

Typically describe relationships between a response
variable and some covariates in data grouped according to
one or more factors
22
Scientific and Statistical Data Mining (3)

 Regression trees
 Binary trees used for classification
and prediction
 Similar to decision trees:Tests are
performed at the internal nodes
In a regression tree the mean of the
objective attribute is computed and
used as the predicted value
 Analysis of variance
 Analyze experimental data for two or
more populations described by a
numeric response variable and one
or more categorical variables
(factors)
23
Statistical Data Mining (4)
 Factor analysis
 determine which variables are

combined to generate a given factor

 e.g., for many psychiatric data, one

can indirectly measure other

quantities (such as test scores) that
reflect the factor of interest
 Discriminant analysis
 predict a categorical response

variable, commonly used in social

science
 Attempts to determine several

discriminant functions (linear

combinations of the independent
variables) that discriminate among
the groups defined by the response
variable
www.spss.com/datamine/factor.htm
24
Statistical Data Mining (5)

 Time series: many methods such as autoregression,

ARIMA (Autoregressive integrated moving-average
modeling), long memory time-series modeling
 Quality control: displays group summary charts

 Survival analysis
Predicts the
probability that a
patient undergoing a
medical treatment
would survive at least
to time t (life span
prediction)
25
Data Mining Applications

 Data mining: A young discipline with broad and diverse

applications
 There still exists a nontrivial gap between generic data

mining methods and effective and scalable data mining

tools for domain-specific applications
 Some application domains (briefly discussed here)
 Data Mining for Financial data analysis

 Data Mining for Retail and Telecommunication

Industries
 Data Mining in Science and Engineering

 Data Mining for Intrusion Detection and Prevention

 Data Mining and Recommender Systems

26
Data Mining for Financial Data Analysis (I)
 Financial data collected in banks and financial institutions
are often relatively complete, reliable, and of high quality
 Design and construction of data warehouses for
multidimensional data analysis and data mining
 View the debt and revenue changes by month, by
region, by sector, and by other factors
 Access statistical information such as max, min, total,
average, trend, etc.
 Loan payment prediction/consumer credit policy analysis
 feature selection and attribute relevance ranking

 Loan payment performance

 Consumer credit rating

27
Data Mining for Financial Data Analysis (II)

 Classification and clustering of customers for targeted

marketing
 multidimensional segmentation by nearest-neighbor,

classification, decision trees, etc. to identify customer

groups or associate a new customer to an appropriate
customer group
 Detection of money laundering and other financial crimes
 integration of from multiple DBs (e.g., bank

transactions, federal/state crime history DBs)

 Tools: data visualization, linkage analysis,

classification, clustering tools, outlier analysis, and

sequential pattern analysis tools (find unusual access
sequences)
28
Data Mining for Intrusion Detection and
Prevention
 Majority of intrusion detection and prevention systems use
 Signature-based detection: use signatures, attack patterns that are
preconfigured and predetermined by domain experts
 Anomaly-based detection: build profiles (models of normal
behavior) and detect those that are substantially deviate from the
profiles
 What data mining can help
 New data mining algorithms for intrusion detection
 Association, correlation, and discriminative pattern analysis help
select and build discriminative classifiers
 Analysis of stream data: outlier detection, clustering, model shifting
 Distributed data mining
 Visualization and querying tools

29
Data Mining and Recommender Systems
 Recommender systems: Personalization, making product
recommendations that are likely to be of interest to a user
 Approaches: Content-based, collaborative, or their hybrid
 Content-based: Recommends items that are similar to items the
user preferred or queried in the past
 Collaborative filtering: Consider a user's social environment,
opinions of other customers who have similar tastes or preferences
 Data mining and recommender systems
 Users C × items S: extract from known to unknown ratings to
predict user-item combinations
 Memory-based method often uses k-nearest neighbor approach
 Model-based method uses a collection of ratings to learn a model
(e.g., probabilistic models, clustering, Bayesian networks, etc.)
 Hybrid approaches integrate both to improve performance (e.g.,
using ensemble)
30
Summary
 We present a high-level overview of mining complex data types
 Statistical data mining methods, such as regression, generalized linear
models, analysis of variance, etc., are popularly adopted
 Researchers also try to build theoretical foundations for data mining
 Visual/audio data mining has been popular and effective
 Application-based mining integrates domain-specific knowledge with
data analysis techniques and provide mission-specific solutions
 Ubiquitous data mining and invisible data mining are penetrating our
data lives
 Privacy and data security are importance issues in data mining, and
privacy-preserving data mining has been developed recently
 Our discussion on trends in data mining shows that data mining is a
promising, young field, with great, strategic importance
31
References and Further Reading
 The books lists a lot of references for further reading. Here we only list a few books

 E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011

 S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data.
Morgan Kaufmann, 2002
 R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2ed., Wiley-Interscience, 2000
 D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning about a Highly
Connected World. Cambridge University Press, 2010.
 U. Fayyad, G. Grinstein, and A. Wierse (eds.), Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
 J. Han, M. Kamber, J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed.
2011
 T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, 2nd ed., Springer-Verlag, 2009
 D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press,
2009.
 B. Liu. Web Data Mining, Springer 2006.
 T. M. Mitchell. Machine Learning, McGraw Hill, 1997
 M. Newman. Networks: An Introduction. Oxford University Press, 2010.
 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann, 2nd ed. 2005
32

FS PLM 111 0008
No ratings yet
FS PLM 111 0008
14 pages
Data Mining Techniques and Methods
No ratings yet
Data Mining Techniques and Methods
11 pages
Business Case - Aerofit - Descriptive Statistics Probability (Final)
100% (1)
Business Case - Aerofit - Descriptive Statistics Probability (Final)
1 page
12 Outlier
No ratings yet
12 Outlier
55 pages
Unit 5
No ratings yet
Unit 5
47 pages
12 Outlier
No ratings yet
12 Outlier
18 pages
Outlier Detection Techniques
100% (2)
Outlier Detection Techniques
56 pages
Lecture 12
No ratings yet
Lecture 12
54 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
13 pages
12outlier 1
No ratings yet
12outlier 1
45 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
44 pages
Missing and Outlier
No ratings yet
Missing and Outlier
20 pages
Lec3. Outlier Analysis
No ratings yet
Lec3. Outlier Analysis
54 pages
Chapter 4 Part 2
No ratings yet
Chapter 4 Part 2
12 pages
Unit 5
No ratings yet
Unit 5
70 pages
07 Outlier Detection
No ratings yet
07 Outlier Detection
54 pages
12 Outlier
No ratings yet
12 Outlier
16 pages
Outlier Detection
No ratings yet
Outlier Detection
45 pages
ADII10 Analisa Outlier
No ratings yet
ADII10 Analisa Outlier
37 pages
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
No ratings yet
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
41 pages
Unit 5-2
No ratings yet
Unit 5-2
41 pages
741 Outlier Detection
No ratings yet
741 Outlier Detection
55 pages
Unit 5 - Lecture 1 - Outlier Detection
No ratings yet
Unit 5 - Lecture 1 - Outlier Detection
30 pages
Outlier Detection
No ratings yet
Outlier Detection
9 pages
Outlier Detection
No ratings yet
Outlier Detection
30 pages
Chapter 12. Outlier Analysis
No ratings yet
Chapter 12. Outlier Analysis
4 pages
Outlier Detection
No ratings yet
Outlier Detection
10 pages
Unit 4
No ratings yet
Unit 4
17 pages
Outliers EXTD
No ratings yet
Outliers EXTD
24 pages
ADII11 Metode Deteksi Outlier
No ratings yet
ADII11 Metode Deteksi Outlier
50 pages
Outlier Detection
No ratings yet
Outlier Detection
17 pages
Lecture 12 Outliers and Guidelines For Exercises
No ratings yet
Lecture 12 Outliers and Guidelines For Exercises
6 pages
Outliers ML
No ratings yet
Outliers ML
14 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Outlier Detection
No ratings yet
Outlier Detection
36 pages
Anomaly Detection and Outlier Analysis
No ratings yet
Anomaly Detection and Outlier Analysis
25 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
Outlier Analysis
No ratings yet
Outlier Analysis
28 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Methods To Detect Different Types of Outliers: March 2016
No ratings yet
Methods To Detect Different Types of Outliers: March 2016
7 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
No ratings yet
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
12 pages
A Survey On Outlier Detection Methods
No ratings yet
A Survey On Outlier Detection Methods
4 pages
Energy Conversion and Econom - 2023 - Patel - Taxonomy of Outlier Detection Methods For Power System Measurements
No ratings yet
Energy Conversion and Econom - 2023 - Patel - Taxonomy of Outlier Detection Methods For Power System Measurements
16 pages
Outlier Analysis
No ratings yet
Outlier Analysis
18 pages
Outlier Mining Techniques For Uncertain Data
No ratings yet
Outlier Mining Techniques For Uncertain Data
7 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
Anomaly or Outlier Detection
No ratings yet
Anomaly or Outlier Detection
14 pages
On Normalization and Algorithm Selection For Unsupervised Outlier Detection
No ratings yet
On Normalization and Algorithm Selection For Unsupervised Outlier Detection
34 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
4 - Outliers - +transformaations ML
No ratings yet
4 - Outliers - +transformaations ML
28 pages
Outlier or Anomaly Detection
No ratings yet
Outlier or Anomaly Detection
9 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Outlier
No ratings yet
Outlier
2 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
Unit V Outlier 2
No ratings yet
Unit V Outlier 2
13 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
Unit 4
No ratings yet
Unit 4
106 pages
21CSS303T Data Science Syllabus
No ratings yet
21CSS303T Data Science Syllabus
2 pages
21CSE356T-NLP-Unit 4.1
No ratings yet
21CSE356T-NLP-Unit 4.1
46 pages
21CSE356T-NLP - Unit 5
No ratings yet
21CSE356T-NLP - Unit 5
118 pages
NLP Unit-2 QB Updated
No ratings yet
NLP Unit-2 QB Updated
10 pages
DataScience Project-New
No ratings yet
DataScience Project-New
16 pages
Dsa Team 4 Project
No ratings yet
Dsa Team 4 Project
11 pages
Acritas - Patterns in Legal Spend Report 2017
No ratings yet
Acritas - Patterns in Legal Spend Report 2017
12 pages
Detecting Data Outliers
No ratings yet
Detecting Data Outliers
7 pages
Report PDF
No ratings yet
Report PDF
80 pages
Guidelines FOR Part Average Testing: Automotive Electronics Council
No ratings yet
Guidelines FOR Part Average Testing: Automotive Electronics Council
12 pages
Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection
No ratings yet
Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection
28 pages
Biological Variation - cclm-2022-1255
No ratings yet
Biological Variation - cclm-2022-1255
10 pages
Loss Function
No ratings yet
Loss Function
23 pages
Mathematical
No ratings yet
Mathematical
14 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Ise 390 Engineering Probability & Statistics I: Dr. Swain Book Club
No ratings yet
Ise 390 Engineering Probability & Statistics I: Dr. Swain Book Club
18 pages
Project Proposal
No ratings yet
Project Proposal
3 pages
Skittle Part 3
No ratings yet
Skittle Part 3
3 pages
Mid Semester Make-Up Data Mining Second Semester 2019-2020
No ratings yet
Mid Semester Make-Up Data Mining Second Semester 2019-2020
3 pages
CC Unit - 4 Imp Questions
No ratings yet
CC Unit - 4 Imp Questions
4 pages
Business Statistics Assignment Graphs and Written Answers
No ratings yet
Business Statistics Assignment Graphs and Written Answers
9 pages
SOP-000038295 Laboratory Investigations
No ratings yet
SOP-000038295 Laboratory Investigations
16 pages
ntXIiKXCDR6JFJWL - Learning To Use Regression Analysis
100% (1)
ntXIiKXCDR6JFJWL - Learning To Use Regression Analysis
26 pages
Credit Scoring and Data Mining
100% (1)
Credit Scoring and Data Mining
170 pages
Data Preprocessing - Data Cleaning
100% (2)
Data Preprocessing - Data Cleaning
29 pages
Harris' Chapter 4: Statistics
No ratings yet
Harris' Chapter 4: Statistics
22 pages
Statistical Data Treatment and Evaluation Lecture 1
No ratings yet
Statistical Data Treatment and Evaluation Lecture 1
16 pages
Exercise 2 Explore Data Patterns Using Space-Time Pattern Mining How Can I Print An Exercise To PDF Format
No ratings yet
Exercise 2 Explore Data Patterns Using Space-Time Pattern Mining How Can I Print An Exercise To PDF Format
8 pages
International Standard: Third Edition 2022-08
No ratings yet
International Standard: Third Edition 2022-08
116 pages
American Airlines Flight Arrival Delay Analysis
No ratings yet
American Airlines Flight Arrival Delay Analysis
11 pages
Evaluation & Assessment of RGU
No ratings yet
Evaluation & Assessment of RGU
19 pages
Influence of Turning Parameters On Residual Stress
No ratings yet
Influence of Turning Parameters On Residual Stress
31 pages
Data Mining
No ratings yet
Data Mining
35 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit-5 Outlier Analysis

Uploaded by

Unit-5 Outlier Analysis

Uploaded by

Data Mining:

 Noise should be removed before outlier detection

 Outliers are interesting: It violates the mechanism that generates the

 Telecom fraud detection

 Contextual outlier (or conditional outlier)

 Application-specific outlier detection

relationship among objects are often application-dependent

marketing analysis, larger fluctuations

 Specify the degree of an outlier: the unlikelihood of the object being

generated by a normal mechanism

outliers may share high similarity in a small area

miss many real outliers.

some key resources

 Problem 1: Hard to distinguish noise from outliers

 Example (right figure): First use Gaussian distribution

 Example (right figure): Model the proximity of an

distribution with parameter θ

gives the probability that object x is generated by the distribution

from the input data

of the parameters are flexible and not fixed in advance

 Taking derivatives with respect to μ and σ2, we derive the following

 For the above data with n = 10, we have

where is the value taken by a t-distribution at a

 There are many well-established statistical techniques for data

from economics and the social sciences

 Generalized linear models

some transformation of it) to be related

response variable using linear

combined to generate a given factor

can indirectly measure other

variable, commonly used in social

discriminant functions (linear

 Time series: many methods such as autoregression,

 Data mining: A young discipline with broad and diverse

mining methods and effective and scalable data mining

 Data Mining for Retail and Telecommunication

 Data Mining for Intrusion Detection and Prevention

 Data Mining and Recommender Systems

 Loan payment performance

 Consumer credit rating

 Classification and clustering of customers for targeted

classification, decision trees, etc. to identify customer

transactions, federal/state crime history DBs)

classification, clustering tools, outlier analysis, and

 E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.