Data Minning Unit 4-1
Data Minning Unit 4-1
Outlier is a data object that deviates significantly from the rest of the data objects and
behaves in a different manner. They can be caused by measurement or execution errors.
The analysis of outlier data is referred to as outlier analysis or outlier mining.
An outlier cannot be termed as a noise or error. Instead, they are suspected of not being
generated by the same method as the rest of the data objects.
1. Global Outliers
1. Definition: Global outliers are data points that deviate significantly from the overall
distribution of a dataset.
2. Causes: Errors in data collection, measurement errors, or truly unusual events can
result in global outliers.
3. Impact: Global outliers can distort data analysis results and affect machine learning
model performance.
4. Detection: Techniques include statistical methods (e.g., z-score, Mahalanobis
distance), machine learning algorithms (e.g., isolation forest, one-class SVM), and data
visualization techniques.
5. Handling: Options may include removing or correcting outliers, transforming data, or
using robust methods.
6. Considerations: Carefully considering the impact of global outliers is crucial for
accurate data analysis and machine learning model outcomes.
2. Collective Outliers
1. Definition: Collective outliers are groups of data points that collectively deviate
significantly from the overall distribution of a dataset.
2. Characteristics: Collective outliers may not be outliers when considered individually,
but as a group, they exhibit unusual behavior.
3. Detection: Techniques for detecting collective outliers include clustering algorithms,
density-based methods, and subspace-based approaches.
4 Impact: Collective outliers can represent interesting patterns or anomalies in data that
may require special attention or further investigation.
5. Handling: Handling collective outliers depends on the specific use case and may
involve further analysis of the group behavior, identification of contributing factors, or
considering contextual information.
6. Considerations: Detecting and interpreting collective outliers can be more complex
than individual outliers, as the focus is on group behavior rather than individual data
points. Proper understanding of the data context and domain knowledge is crucial for
effective handling of collective outliers.
1. Definition: Contextual outliers are data points that deviate significantly from the
expected behavior within a specific context or subgroup.
2. Characteristics: Contextual outliers may not be outliers when considered in the entire
dataset, but they exhibit unusual behavior within a specific context or subgroup.
3. Detection: Techniques for detecting contextual outliers include contextual clustering,
contextual anomaly detection, and context-aware machine learning approaches.
4. Contextual Information: Contextual information such as time, location, or other
relevant factors are crucial in identifying contextual outliers.
5. Impact: Contextual outliers can represent unusual or anomalous behavior within a
specific context, which may require further investigation or attention.
6. Handling: Handling contextual outliers may involve considering the contextual
information, contextual normalization or transformation of data, or using context-specific
models or algorithms.
7. Considerations: Proper understanding of the context and domain-specific knowledge
is crucial for accurate detection and interpretation of contextual outliers, as they may vary
based on the specific context or subgroup being considered.
4. Scalability: Outlier detection algorithms must be able to handle large-scale datasets efficiently. As
dataset sizes increase, the computational complexity of outlier detection algorithms can become a
bottleneck, requiring scalable and parallelizable approaches to maintain acceptable performance.
5. High Dimensionality: Many real-world datasets have high dimensionality, meaning they contain
a large number of features or attributes. In high-dimensional spaces, the notion of distance and
similarity becomes less intuitive, making it challenging to define and detect outliers accurately.
6. Data Quality: Outlier detection algorithms are sensitive to noise and errors in the data. Noisy or
corrupted data can lead to false positives (normal data incorrectly classified as outliers) or false
negatives (outliers missed by the algorithm), reducing the effectiveness of outlier detection
methods.
7. Complex Data Patterns: Outliers may not always exhibit simple patterns or deviations from the
norm. In some cases, outliers may be part of complex data patterns or clusters, making them
difficult to detect using traditional statistical methods or distance-based approaches.
8. Imbalanced Data: In datasets where outliers are rare compared to normal data points,
imbalanced class distributions can pose a challenge for outlier detection algorithms. Traditional
statistical methods may struggle to distinguish outliers from the majority class, leading to biased
results.
9. Concept Drift: In dynamic or evolving environments, the underlying data distribution may change
over time, leading to concept drift. Outlier detection models trained on historical data may become
less effective when applied to new data, requiring continuous monitoring and adaptation to detect
emerging outliers.
10. Interpretability: While outlier detection algorithms can effectively identify anomalies in the data,
interpreting the reasons behind outliers and understanding their significance can be challenging.
Domain knowledge and contextual information are often necessary to interpret the implications of
detected outliers accurately.
11. Computational Cost: Some outlier detection algorithms, especially those based on complex
models or iterative optimization techniques, can be computationally expensive. Balancing the
trade-off between computational cost and detection accuracy is essential, especially for real-time
or resource-constrained applications.
find patterns in large databases by scanning documents for certain keywords and phrases.
They are highly prevalent because they do not require expensive hardware or much storage
1. **Labeling Data**: The first step in classification-based outlier detection is to label the data.
This typically involves identifying outliers in the dataset and assigning them a label (e.g., 1 for
outliers, 0 for normal data points). The labeled dataset is then used for training the classifier.
2. **Feature Extraction and Selection**: Next, features are extracted from the data and
selected based on their relevance to outlier detection. Feature engineering techniques may be
applied to transform or combine raw features to improve the performance of the classifier.
4. **Model Evaluation**: The trained classifier is evaluated using evaluation metrics such as
accuracy, precision, recall, F1-score, or area under the ROC curve (AUC). Cross-validation
techniques may be used to assess the generalization performance of the model and identify
potential overfitting.
5. **Predicting Outliers**: Once the classifier is trained and evaluated, it can be used to
predict outliers in new, unseen data. Data points that are classified as outliers by the classifier
are flagged as anomalous and may require further investigation or monitoring.
6. **Model Tuning and Optimization**: The performance of the classifier may be further
improved by tuning hyperparameters, optimizing feature selection, or using ensemble
techniques to combine multiple classifiers. Iterative refinement may be performed to enhance
the robustness and accuracy of the outlier detection model.
1. **Mahalanobis Distance**:
- Mahalanobis distance measures the distance of a data point from the centroid of the data distribution,
taking into account the covariance structure of the data.
- Points with a Mahalanobis distance exceeding a certain threshold are considered outliers.
3. **Isolation Forest**:
- Isolation Forest is an ensemble method that constructs isolation trees to isolate outliers.
- Outliers are identified as data points that require fewer splits to isolate them from the rest of the data.
5. **One-Class SVM**:
- One-Class Support Vector Machine (SVM) learns a decision boundary around the majority of the data
points and identifies outliers as points lying outside this boundary.
- It is particularly useful when only normal data is available for training.
6. **Clustering-Based Approaches**:
- Clustering algorithms such as k-means or DBSCAN can be used to cluster the data and identify outliers
as points that do not belong to any cluster or form singleton clusters.
8. **Probabilistic Models**:
- Probabilistic models such as Gaussian Mixture Models (GMMs) can be used to model the distribution
of the data and identify outliers as points with low probability under the model.
When applying these methods to multidimensional data, it's essential to consider the characteristics of
the data, such as its distribution, dimensionality, and the presence of correlation between variables.
Additionally, it's often beneficial to use multiple approaches in combination or to customize the method
to the specific properties of the dataset.