0% found this document useful (0 votes)
41 views10 pages

Data Minning Unit 4-1

The document discusses different types of outliers including global outliers, collective outliers, and contextual outliers. It also discusses challenges of outlier detection such as scalability, high dimensionality, and interpretability. Statistical approaches for outlier detection include standard deviation, interquartile range, box plots, z-scores, and distance-based methods.

Uploaded by

yadavchilki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views10 pages

Data Minning Unit 4-1

The document discusses different types of outliers including global outliers, collective outliers, and contextual outliers. It also discusses challenges of outlier detection such as scalability, high dimensionality, and interpretability. Statistical approaches for outlier detection include standard deviation, interquartile range, box plots, z-scores, and distance-based methods.

Uploaded by

yadavchilki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DATA MINNING UNIT 4 :

Outlier is a data object that deviates significantly from the rest of the data objects and
behaves in a different manner. They can be caused by measurement or execution errors.
The analysis of outlier data is referred to as outlier analysis or outlier mining.

An outlier cannot be termed as a noise or error. Instead, they are suspected of not being
generated by the same method as the rest of the data objects.

Outliers are of three types, namely –

1. Global (or Point) Outliers


2. Collective Outliers
3. Contextual (or Conditional) Outliers

1. Global Outliers

1. Definition: Global outliers are data points that deviate significantly from the overall
distribution of a dataset.
2. Causes: Errors in data collection, measurement errors, or truly unusual events can
result in global outliers.
3. Impact: Global outliers can distort data analysis results and affect machine learning
model performance.
4. Detection: Techniques include statistical methods (e.g., z-score, Mahalanobis
distance), machine learning algorithms (e.g., isolation forest, one-class SVM), and data
visualization techniques.
5. Handling: Options may include removing or correcting outliers, transforming data, or
using robust methods.
6. Considerations: Carefully considering the impact of global outliers is crucial for
accurate data analysis and machine learning model outcomes.

The red data point is a global outlier.

2. Collective Outliers

1. Definition: Collective outliers are groups of data points that collectively deviate
significantly from the overall distribution of a dataset.
2. Characteristics: Collective outliers may not be outliers when considered individually,
but as a group, they exhibit unusual behavior.
3. Detection: Techniques for detecting collective outliers include clustering algorithms,
density-based methods, and subspace-based approaches.
4 Impact: Collective outliers can represent interesting patterns or anomalies in data that
may require special attention or further investigation.
5. Handling: Handling collective outliers depends on the specific use case and may
involve further analysis of the group behavior, identification of contributing factors, or
considering contextual information.
6. Considerations: Detecting and interpreting collective outliers can be more complex
than individual outliers, as the focus is on group behavior rather than individual data
points. Proper understanding of the data context and domain knowledge is crucial for
effective handling of collective outliers.

The red data points as a whole are collective outliers.


3. Contextual Outliers

1. Definition: Contextual outliers are data points that deviate significantly from the
expected behavior within a specific context or subgroup.
2. Characteristics: Contextual outliers may not be outliers when considered in the entire
dataset, but they exhibit unusual behavior within a specific context or subgroup.
3. Detection: Techniques for detecting contextual outliers include contextual clustering,
contextual anomaly detection, and context-aware machine learning approaches.
4. Contextual Information: Contextual information such as time, location, or other
relevant factors are crucial in identifying contextual outliers.
5. Impact: Contextual outliers can represent unusual or anomalous behavior within a
specific context, which may require further investigation or attention.
6. Handling: Handling contextual outliers may involve considering the contextual
information, contextual normalization or transformation of data, or using context-specific
models or algorithms.
7. Considerations: Proper understanding of the context and domain-specific knowledge
is crucial for accurate detection and interpretation of contextual outliers, as they may vary
based on the specific context or subgroup being considered.

A low temperature value in June is a contextual outlier because the


same value in December is not an outlier.

Outlier detection challenges :


Outlier detection poses several challenges due to the complexity and variability of real-world datasets.
Some of the key challenges include:

4. Scalability: Outlier detection algorithms must be able to handle large-scale datasets efficiently. As
dataset sizes increase, the computational complexity of outlier detection algorithms can become a
bottleneck, requiring scalable and parallelizable approaches to maintain acceptable performance.
5. High Dimensionality: Many real-world datasets have high dimensionality, meaning they contain
a large number of features or attributes. In high-dimensional spaces, the notion of distance and
similarity becomes less intuitive, making it challenging to define and detect outliers accurately.
6. Data Quality: Outlier detection algorithms are sensitive to noise and errors in the data. Noisy or
corrupted data can lead to false positives (normal data incorrectly classified as outliers) or false
negatives (outliers missed by the algorithm), reducing the effectiveness of outlier detection
methods.
7. Complex Data Patterns: Outliers may not always exhibit simple patterns or deviations from the
norm. In some cases, outliers may be part of complex data patterns or clusters, making them
difficult to detect using traditional statistical methods or distance-based approaches.
8. Imbalanced Data: In datasets where outliers are rare compared to normal data points,
imbalanced class distributions can pose a challenge for outlier detection algorithms. Traditional
statistical methods may struggle to distinguish outliers from the majority class, leading to biased
results.
9. Concept Drift: In dynamic or evolving environments, the underlying data distribution may change
over time, leading to concept drift. Outlier detection models trained on historical data may become
less effective when applied to new data, requiring continuous monitoring and adaptation to detect
emerging outliers.
10. Interpretability: While outlier detection algorithms can effectively identify anomalies in the data,
interpreting the reasons behind outliers and understanding their significance can be challenging.
Domain knowledge and contextual information are often necessary to interpret the implications of
detected outliers accurately.
11. Computational Cost: Some outlier detection algorithms, especially those based on complex
models or iterative optimization techniques, can be computationally expensive. Balancing the
trade-off between computational cost and detection accuracy is essential, especially for real-time
or resource-constrained applications.

statistical approaches for outlier detection:


In data mining, outlier detection is a crucial step in understanding and cleaning datasets. There are
several statistical approaches for outlier detection, each with its own strengths and weaknesses. Here
are some commonly used statistical methods:

12. Standard Deviation Method:


• Outliers are often defined as data points that lie beyond a certain number of standard deviations from
the mean.
• Data points that fall more than a certain number of standard deviations away from the mean are
considered outliers.
13. Interquartile Range (IQR) Method:
• The interquartile range is the range between the first quartile (25th percentile) and the third quartile
(75th percentile) of the data.
• Outliers are identified as points that fall below the first quartile minus a specified multiplier times the
IQR or above the third quartile plus the same multiplier times the IQR.
14. Box Plot Method:
• Box plots visually represent the distribution of data based on quartiles.
• Outliers are identified as points that fall outside of the whiskers of the box plot, typically defined as 1.5
times the IQR.
15. Z-Score:
• Z-score measures how many standard deviations a data point is from the mean.
• Outliers are identified as data points with an absolute Z-score greater than a threshold (often 2 or 3).
16. Density-Based Methods:
• These methods identify outliers based on the density of data points in a neighborhood.
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular density-based
clustering algorithm that can be used for outlier detection.
17. Robust Statistical Methods:
• These methods are less sensitive to outliers compared to traditional statistical methods.
• Examples include robust regression techniques like RANSAC (RANdom SAmple Consensus) and robust
estimation of covariance matrices.
18. Distance-Based Methods:
• These methods compute the distance between data points and identify outliers based on unusually
large distances.
• For example, the k-nearest neighbors algorithm can be used to detect outliers based on the distance to
the kth nearest neighbor.

Proximity-Based Methods in Data Mining


Proximity-based methods are an important technique in data mining. They are employed to

find patterns in large databases by scanning documents for certain keywords and phrases.

They are highly prevalent because they do not require expensive hardware or much storage

space, and they scale up efficiently as the size of databases increases.

Advantages of Proximity-Based Methods:

1. Proximity-based methods make use of machine learning techniques, in which algorithms


are trained to respond to certain patterns.
2. Using a random sample of documents, the machine learning algorithm analyzes the
keywords and phrases used in them and makes predictions about the probability that
these words appear together across all documents.
3. Proximity can be calculated by calculating a similarity score between two collections of
training data and then comparing these scores. The algorithm then tries to compute the
maximum similarity score for two distinct sets of training items.

Disadvantages of Proximity-Based Methods:

1. Important words may not be as close in proximity as we expected.


2. Over-segmentation of documents into phrases. To counter these problems, a lexical chain-
based algorithm has been proposed.
Proximity-based methods perform very well for finding sets of documents that contain certain
words based on background knowledge. But performance is limited when the background
knowledge has not been pre-classified into categories.
To find sets of documents containing certain categories, one must assign categorical values to
each document and then run proximity-based methods on these documents as training data,
hoping for accurate representations of the categories.
One way to identify outliers is by calculating their distance from the rest of the data set in is
known as density-based outlier detection.
Types of Proximity-Based Outlier Detection Methods:
• Distance-based outlier detection methods: A distance-based outlier detection method is
a statistical technique. Such methods typically measure distances between individual data
points and the rest of their respective groups. Many approaches also have a configurable
error threshold for determining when a point is an outlier. Many distance-based outliers
methods have been developed. The methods use distance statistics such as Euclidean,
Manhattan, or Mahalanobis distance for calculating distances between individual points and
to detect outliers. The following three outlier detection methods have been selected based
on their performance:
• WLSMV (Weighted Least Squares Minimization) method
• SVM (Support Vector Machines) method,
• RMSProp method.
• Density-based Outlier detection methods: A density-based outlier detection method is
used for checking the density of an entity object and its closest objects. Key applications of
this method are used in many applications including Malware Detection, Awareness,
Behavior Analysis, and Network Intrusion Detection. There are some limitations to density-
based outlier detection methods that are effective until it is determined that the outliers
being detected are not necessarily outliers but just a part of a much larger distribution of
data. A limitation with using density-based outlier detection methods is that the density
function must be defined and clearly understood before implementation and the proper
value set.

Clustering-based approaches for outlier detection:


Clustering-based approaches for outlier detection in data mining involve leveraging
clustering algorithms to group data points into clusters and then identifying outliers as data
points that do not belong to any cluster or form clusters of their own. These approaches
utilize the notion that outliers often lie in sparsely populated regions of the data or exhibit
dissimilarities with the majority of the data points. Here are some common clustering-based
methods for outlier detection:

1. **Density-Based Outlier Detection**:


- **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**:
DBSCAN groups together densely connected data points into clusters and identifies outliers
as noise points that do not belong to any cluster. Outliers are typically located in regions
with low density or insufficient neighbors to form a cluster.

2. **Distance-Based Outlier Detection**:


- **K-Means Clustering**: K-Means partitions the dataset into K clusters based on the
similarity of data points to the cluster centroids. Outliers are identified as data points that
are not assigned to any cluster or have large distances to the nearest cluster centroid.
- **Hierarchical Clustering**: Hierarchical clustering builds a tree-like hierarchy of
clusters by recursively merging or splitting clusters based on their similarity. Outliers can
be identified as data points that form singleton clusters or lie in separate branches of the
hierarchical tree.

3. **Graph-Based Outlier Detection**:


- **Minimum Spanning Tree (MST)**: MST constructs a tree that connects all data
points with minimum total edge weights. Outliers are identified as data points with high
edge weights or long distances to the rest of the tree.
- **Isolation Forest**: Isolation Forest builds an ensemble of decision trees to isolate
outliers by recursively partitioning the dataset into subsets. Outliers are identified as data
points that require fewer splits to isolate them from the rest of the dataset.

4. **Subspace Clustering-Based Outlier Detection**:


- **CLIQUE (CLustering In QUEst)**: CLIQUE identifies clusters in subspaces of the data
that exhibit high density. Outliers are detected as data points that do not belong to any
dense subspace or have low membership probabilities in any subspace cluster.

5. **Cluster-Based Outlier Factor**:


- **COF (Cluster-Based Outlier Factor)**: COF measures the degree to which a data
point deviates from its cluster members in terms of distance. Outliers are identified as data
points with high COF scores, indicating significant deviations from the cluster center.

6. **Probabilistic Model-Based Outlier Detection**:


- **EM Clustering (Expectation-Maximization)**: EM clustering models the dataset as
a mixture of multivariate Gaussian distributions. Outliers are identified as data points with
low posterior probabilities of belonging to any cluster or having low likelihoods under the
mixture model.

These clustering-based approaches provide effective methods for outlier detection in


various types of datasets and can be adapted to different domains and applications. By
leveraging clustering algorithms and analyzing the clustering structure of the data, these
methods can effectively identify anomalies and outliers that deviate from the majority of
the data points.

Classification-based approaches for outlier detection:


Classification-based approaches for outlier detection in data mining involve training a
supervised learning model to classify data points as either normal or outlier. These approaches
require labeled training data, where outliers are explicitly identified or labeled, and then use
this labeled data to train a classifier. Here's how classification-based outlier detection works:

1. **Labeling Data**: The first step in classification-based outlier detection is to label the data.
This typically involves identifying outliers in the dataset and assigning them a label (e.g., 1 for
outliers, 0 for normal data points). The labeled dataset is then used for training the classifier.

2. **Feature Extraction and Selection**: Next, features are extracted from the data and
selected based on their relevance to outlier detection. Feature engineering techniques may be
applied to transform or combine raw features to improve the performance of the classifier.

3. **Training a Classifier**: A supervised learning classifier, such as logistic regression,


decision trees, random forests, support vector machines (SVM), or neural networks, is trained
using the labeled dataset. The classifier learns to distinguish between normal data points and
outliers based on the features extracted from the data.

4. **Model Evaluation**: The trained classifier is evaluated using evaluation metrics such as
accuracy, precision, recall, F1-score, or area under the ROC curve (AUC). Cross-validation
techniques may be used to assess the generalization performance of the model and identify
potential overfitting.

5. **Predicting Outliers**: Once the classifier is trained and evaluated, it can be used to
predict outliers in new, unseen data. Data points that are classified as outliers by the classifier
are flagged as anomalous and may require further investigation or monitoring.

6. **Model Tuning and Optimization**: The performance of the classifier may be further
improved by tuning hyperparameters, optimizing feature selection, or using ensemble
techniques to combine multiple classifiers. Iterative refinement may be performed to enhance
the robustness and accuracy of the outlier detection model.

Classification-based outlier detection offers several advantages, including the ability to


leverage labeled data for training, the flexibility to handle different types of features and data
distributions, and the potential for high accuracy and interpretability. However, it also has
limitations, such as the reliance on labeled training data, the need for feature engineering, and
the risk of overfitting to the training data. Careful consideration of these factors is essential
when applying classification-based approaches to outlier detection in real-world applications.

Detecting outliers in multidimensional data:


Detecting outliers in multidimensional data (also known as multivariate outlier detection) requires
techniques that can handle multiple dimensions simultaneously. Here are some methods commonly used
for outlier detection in multidimensional data:

1. **Mahalanobis Distance**:
- Mahalanobis distance measures the distance of a data point from the centroid of the data distribution,
taking into account the covariance structure of the data.
- Points with a Mahalanobis distance exceeding a certain threshold are considered outliers.

2. **Principal Component Analysis (PCA)**:


- PCA is a dimensionality reduction technique that can be used for outlier detection by projecting the
data onto a lower-dimensional subspace.
- Outliers can be identified as data points with large reconstruction errors or points that lie far from the
principal components.

3. **Isolation Forest**:
- Isolation Forest is an ensemble method that constructs isolation trees to isolate outliers.
- Outliers are identified as data points that require fewer splits to isolate them from the rest of the data.

4. **Local Outlier Factor (LOF)**:


- LOF measures the density of data points relative to their neighbors, identifying points with
significantly lower density as outliers.
- It considers the local neighborhood of each data point to detect outliers.

5. **One-Class SVM**:
- One-Class Support Vector Machine (SVM) learns a decision boundary around the majority of the data
points and identifies outliers as points lying outside this boundary.
- It is particularly useful when only normal data is available for training.

6. **Clustering-Based Approaches**:
- Clustering algorithms such as k-means or DBSCAN can be used to cluster the data and identify outliers
as points that do not belong to any cluster or form singleton clusters.

7. **Robust Covariance Estimation**:


- Robust covariance estimation techniques, such as Minimum Covariance Determinant (MCD), estimate
the covariance matrix of the data while downweighting the influence of outliers.
- Outliers are detected based on their influence on the estimated covariance matrix.

8. **Probabilistic Models**:
- Probabilistic models such as Gaussian Mixture Models (GMMs) can be used to model the distribution
of the data and identify outliers as points with low probability under the model.

When applying these methods to multidimensional data, it's essential to consider the characteristics of
the data, such as its distribution, dimensionality, and the presence of correlation between variables.
Additionally, it's often beneficial to use multiple approaches in combination or to customize the method
to the specific properties of the dataset.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy