Unit 1 Data Mining
Unit 1 Data Mining
Unit 1
Data Mining
Introduction Data Mining
Definition:
Data mining is the process of discovering patterns, correlations, anomalies, and insights
from large datasets using various techniques from statistics, machine learning, and database
systems. It involves extracting useful information from raw data to help make informed decisions,
predict future trends, and understand complex phenomena.
Or
Data mining is the process of searching and analyzing a large batch of raw data in order to identify
patterns and extract useful information.
Data mining is widely used across various industries including finance, retail, healthcare,
marketing, and telecommunications to gain insights, improve decision-making processes, and
drive business value. However, it's important to ensure that data mining activities are conducted
ethically and in compliance with privacy regulations to protect individuals' rights and data privacy.
1
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
2
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
3. Data Selection and Transformation: Here, relevant data subsets are selected for analysis
based on the mining goals. The selected data may also undergo transformation to better suit
the mining algorithms.
4. Data Mining Engine: This is the core component where various data mining algorithms are
applied to the prepared data to discover patterns, trends, and insights.
5. Pattern Evaluation: Once patterns are discovered, they need to be evaluated for their
relevance, validity, and usefulness. This step often involves statistical techniques and
domain expertise.
6. Knowledge Presentation: Finally, the discovered knowledge is presented to users in a
comprehensible format, such as reports, visualizations, or dashboards, to aid in decision-
making.
Throughout this process, feedback loops may exist where insights gained from the data mining results
inform subsequent data selection, cleaning, or mining steps, creating a continuous improvement cycle.
3
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
decision trees, neural networks, support vector machines, or clustering algorithms. The
selected models are then trained on the prepared data.
7. Model Evaluation: Trained models need to be evaluated to assess their performance and
generalization ability. This involves using evaluation metrics such as accuracy, precision, recall, or
F1-score, and techniques such as cross-validation to ensure robustness.
8. Model Deployment: Once a satisfactory model is obtained, it is deployed into production to make
predictions or generate insights on new, unseen data. This may involve integrating the model into
existing systems or workflows.
9. Monitoring and Maintenance: Deployed models should be regularly monitored to ensure they
continue to perform effectively over time. This may involve monitoring for concept drift (changes in
the underlying data distribution) and updating the model or its parameters as necessary.
Throughout the entire data mining process, it's essential to maintain a clear focus on the business
objectives and involve domain experts at each stage to ensure that the insights gained are relevant and
actionable.
4
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
For example, if we want to classify a database based on the data model, we need to select either
relational, transactional, object-relational or data warehouse mining systems.
5
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
6
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
7. Knowledge Presentation: The final stage involves presenting the discovered knowledge in a
meaningful and comprehensible manner to stakeholders. This may include visualizations, reports,
dashboards, or interactive tools that facilitate decision-making and action.
Throughout the KDD process, it is essential to ensure that ethical and legal considerations are
addressed, particularly concerning data privacy, security, and confidentiality. Additionally, iterative
refinement and validation of the results are critical to ensure the reliability and robustness of the
discovered knowledge.
7
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
In summary, KDD provides the overarching framework and methodology for discovering
knowledge from data, while data mining specifically refers to the application of algorithms to extract
patterns or insights from data as part of the KDD process.
1. Goal identification: Develop and understand the application domain and the relevant prior
knowledge and identify the KDD process's goal from the customer perspective.
2. Creating a target data set: Selecting the data set or focusing on a set of variables or data
samples on which the discovery was made.
3. Data cleaning and pre-processing: Basic operations include removing noise if appropriate,
collecting the necessary information to model or account for noise, deciding on
strategies for handling missing data fields, and accounting for time sequence information and
known changes.
4. Data reduction and projection: Finding useful features to represent the data depending on
the purpose of the task. The effective number of variables under consideration may be reduced
through dimensionality reduction methods or conversion, or invariant representations for
the data can be found.
5. Matching process objectives: KDD with step 1 a method of mining particular.
For example: summarization, classification, regression, clustering, and others.
6. Modelling and exploratory analysis and hypothesis selection: Choosing the algorithms or
data mining and selecting the method or methods to search for data patterns. This process
includes deciding which model and parameters may be appropriate (e.g., definite data models are
different models on the real vector) and the matching of data mining methods, particularly with the
general approach of the KDD process (for example, the end-user might be more interested in
understanding the model in its predictive capabilities).
8
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
7. Data Mining: The search for patterns of interest in a particular representational form or a set of
these representations, including classification rules or trees, regression, and clustering.
The user can significantly aid the data mining method to carry out the preceding steps properly.
8. Presentation and evaluation: Interpreting mined patterns, possibly returning to some of the
steps between steps 1 and 7 for additional iterations. This step may also involve the visualization
of the extracted patterns and models or visualization of the data given the models
drawn.
9. Taking action on the discovered knowledge: Using the knowledge directly, incorporating
the knowledge in another system for further action, or simply documenting and
reporting to stakeholders. This process also includes checking and resolving potential conflicts
with previously believed knowledge (or extracted).
9
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
It Facilitates the automated discovery of hidden patterns as well as the prediction of trends and
behaviors.
It can be induced in the new system as well as the existing platforms.
It is a quick process that makes it easy for new users to analyze enormous amounts of data in a
short time.
10
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
The main difference between Knowledge Discovery in Databases (KDD) and Data Mining lies in their
scope and focus:
1. Knowledge Discovery in Databases (KDD):
o KDD is a comprehensive process that encompasses all stages of extracting useful
knowledge from data.
o It includes steps such as data selection, pre-processing, transformation, data mining,
interpretation, and evaluation.
o KDD emphasizes the entire process of knowledge discovery, from understanding the
problem domain to interpreting the results in a meaningful way.
2. Data Mining:
o Data mining is a specific step within the KDD process, focusing on applying algorithms
to extract patterns or knowledge from large datasets.
o It is primarily concerned with the analysis of data to identify meaningful patterns,
trends, or relationships that may not be immediately apparent.
o Data mining techniques include supervised learning, unsupervised learning,
clustering, classification, regression, and association rule mining.
In essence, KDD provides the overarching framework and methodology for discovering knowledge
from data, while data mining specifically refers to the application of algorithms to extract patterns or
insights from data as part of the KDD process.
11
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
In summary, DBMS is a software system used to manage databases, while data mining is a process
used to analyse data and extract valuable insights from it. DBMS provides the infrastructure for storing
and managing data, while data mining helps in uncovering patterns and knowledge from the stored data.
12
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
13
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
1. Classification: This technique is used to categorize data into predefined classes or labels based on
input features. Examples include decision trees, Fuzzy Logic, SVM(Support Vector Machine).
2. Clustering: Clustering involves grouping similar data points together based on their
characteristics or features. Common clustering algorithms include k-means clustering,
hierarchical clustering, and DBSCAN (Density-Based Spatial Clustering of Applications
with Noise).
3. Association Rule Mining: Association rule mining discovers relationships or associations
between variables in large datasets. It is commonly used in market basket analysis to
identify patterns in customer purchasing behavior. The Apriority algorithm is a popular
technique for association rule mining.
4. Regression Analysis: Regression analysis is used to predict the value of a continuous target
variable based on input features. Techniques include linear regression, logistic regression,
and polynomial regression.
5. Anomaly Detection: Anomaly detection identifies outliers or anomalies in data that deviate from
normal patterns. It is used for detecting fraud, network intrusions, and equipment failures.
Techniques include statistical methods, clustering-based approaches, and machine learning
algorithms such as isolation forests and one-class SVM.
6. Sequential Pattern Mining: Sequential pattern mining discovers patterns that occur
sequentially or temporally in data. It is used in applications such as analyzing customer
behavior over time or identifying patterns in sequences of events.
Examples include the Prefix Span algorithm and the GSP (Generalized Sequential Pattern)
algorithm.
7. Text Mining: Text mining techniques extract useful information from unstructured text data. This
includes tasks such as sentiment analysis, topic modeling, named entity recognition, and
document classification. Techniques such as natural language processing (NLP) and
machine learning algorithms are commonly used in text mining.
8. Dimensionality Reduction: Dimensionality reduction techniques aim to reduce the number
of variables in the data while preserving its essential structure. Principal Component
Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Singular
Value Decomposition (SVD) are common dimensionality reduction methods.
These are just a few examples of the many data mining techniques available, each with its strengths,
limitations, and suitable applications. Choosing the appropriate technique depends on factors such as
the nature of the data, the problem domain, and the specific objectives of the analysis.
14
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
15
UNIT 1
Divya S R, Assistant Professor, Department of Computer Science,
AES National Degree College, Gauribidanur
16
UNIT 1