Data Mining Notes Jntuh Compress
Data Mining Notes Jntuh Compress
Info
Data Mining Notes Jntuh
DATA MINING
Notes By Jayanth
Units Page no
1) Data Mining 2
2) Association Rule Mining 16
3) Classification 35
4) Clustering and 41
Applications
5) Advanced Concepts 53
Notes By Jayanth
Data Mining Notes Jntuh
UNIT-1
DataMining
● Data mining is the process of extracting valuable and actionable insights from large
volumes of data. It involves analyzing and exploring vast datasets to discover patterns,
relationships, and trends that are not immediately apparent.
● Data mining utilizes various techniques, such as statistical analysis, machine learning
algorithms, and pattern recognition, to uncover hidden information and make informed
decisions.
Notes By Jayanth
Data Mining Notes Jntuh
Data–Types of Data
Flat files: Flat files are actually the most common data source for data mining algorithms,
especially at the research level. Flat files are simple data files in text or binary format with a
structure known by the data mining algorithm to be applied. The data in these files can be
transactions, time-series data, scientific measurements, etc.
Data Warehouse:
● A data warehouse is a large centralized repository that consolidates data from various
sources within an organization.
● It is designed to support analytical processing and decision-making.
● Data mining can be performed on data warehouses to extract insights and knowledge.
Notes By Jayanth
Data Mining Notes Jntuh
Transactional Data:
● Transactional data captures records of individual transactions or activities, such as
customer purchases, financial transactions, online interactions, and user behavior.
● Data mining techniques can be applied to transactional data to discover patterns, detect
anomalies, and make predictions. For example, analyzing transactional data can help
identify customer behavior patterns, recommend products, or detect fraudulent activities.
Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. In general, data mining tasks can be classified into two categories:
• Descriptive
• predictive
Descriptive mining tasks involve analyzing the data in a database to understand its general
properties and characteristics. This includes summarizing the data, identifying patterns,
and gaining insights into its distribution and relationships.
Descriptive mining aims to provide a comprehensive overview of the data without making any
predictions or inferences about future outcomes.
Predictive mining tasks, on the other hand, focus on using the current data to make
predictions or inferences about future outcomes.
By applying statistical and machine learning techniques
Notes By Jayanth
Data Mining Notes Jntuh
1) Concept/class description:
Data can be associated with classes or concepts,Data characterization and data discrimination
are two approaches used in data analysis to understand and describe a given set of data.
Data characterization
● Summarizing the general characteristics or features of a specific class of data, often
referred to as the target class.
● The goal is to provide a concise and informative description of the data, highlighting its
interesting properties.
● For example, in a student dataset, data characterization could involve summarizing the
characteristics of students who have obtained more than 75% in every semester. This
characterization could result in a general profile of such students, outlining their common
attributes.
Data discrimination :
● Comparing the general features of the target class with those of one or more contrasting
classes. The objective is to identify the distinguishing properties or patterns that
differentiate the target class from others.
● Continuing with the student dataset example, data discrimination might involve
comparing the general features of students with high GPAs to those with low GPAs. This
comparison could reveal insights such as "75% of the students with high GPAs are
fourth-year computing science students, while 65% of the students with low GPAs are
not."
Association Mining:
Notes By Jayanth
Data Mining Notes Jntuh
Correlation Mining:
● Correlation mining identifies the statistical relationships or dependencies between
different variables in a dataset.
● It measures how changes in one variable relate to changes in another variable.
● For example, in a marketing campaign dataset, correlation mining can reveal if there is
a correlation between the amount spent on advertising and the increase in sales.
Classification:
● Classification is a predictive analysis technique that assigns categorical labels or classes
to instances based on their features.
● It involves learning from labeled training data to build a model that can classify new,
unseen instances into predefined classes.
● For example, in email spam detection, a classification model can be trained on labeled
emails (spam or not spam) to predict the class of incoming emails.
Regression:
● Regression is a predictive analysis technique that predicts continuous numerical values
based on input features.
● It aims to find a mathematical relationship between the input variables and the output
variable.
● For example, in house price prediction, regression can be used to build a model that
predicts the price of a house based on factors like its area, number of rooms, location,
etc.
4) Cluster analysis
● Cluster analysis is a data mining technique used to group similar data points together
based on their characteristics or attributes.
● It helps in identifying patterns and structures within the data that may not be immediately
apparent.
● Similarity measures, such as distance metrics, are used to determine the similarity
between data points and form clusters.
Notes By Jayanth
6
Data Mining None
14
Data Mining None
4
Data Mining None
STM-spectrum m - dgsdhshb
14
Data Mining None
Data Mining Notes Jntuh
● Example: Imagine you have a dataset of customer data, including information such as
age, income, and purchasing behavior. Cluster analysis can be used to group customers
with similar characteristics into clusters, such as creating a cluster for young, high-
income customers who frequently make online purchases.
5) Outlyer Analysis
Outlier analysis is a data mining technique used to identify and analyze data points that deviate
significantly from the normal or expected patterns within a dataset. Here are some points to
explain outlier analysis:
6) Correlation analysis
Correlation analysis is a technique use to measure the association between two
variables. A correlation coefficient (r) is a statistic used for measuring the strength of a
supposed linear association between two variables. Correlations range from -1.0 to +1.0
in value.
A correlation coefficient of 1.0 indicates a perfect positive relationship in which high
values of one variable are related perfectly to high values in the other variable, and
conversely, low values on one variable are perfectly related to low values on the other
variable.
A correlation coefficient of 0.0 indicates no relationship between the two variables. That
is, one cannot use the scores on one variable to tell anything about the scores on the
second variable.
A correlation coefficient of -1.0 indicates a perfect negative relationship in which high
values of one variable are related perfectly to low values in the other variables, and
conversely, low values in one variable are perfectly related to high values on the other
variable.
Notes By Jayanth
Data Mining Notes Jntuh
Interestingness Patterns
Support: The pattern should occur frequently enough in the dataset to be considered
interesting. High support indicates that the pattern is not an isolated occurrence.
Confidence: The pattern should have a high level of confidence or accuracy, indicating that the
observed relationships are reliable and not due to chance.
Novelty: The pattern should reveal new or previously unknown information, providing insights
that were not evident before.
Actionability: The pattern should be actionable, meaning it can be utilized to make informed
decisions or drive actions that lead to desirable outcomes.
Interpretability: The pattern should be easily interpretable and understandable by domain
experts or end-users.
Notes By Jayanth
Data Mining Notes Jntuh
● The user or analyst needs to define what is considered interesting based on the specific
problem, domain knowledge, and the goals of the data mining task.
Data mining systems can be classified based on various criteria. Here are some common
classification approaches:
Notes By Jayanth
Data Mining Notes Jntuh
Data mining primitives refer to the fundamental components or elements involved in the data
mining process. These primitives provide the necessary instructions and specifications to
perform effective data mining tasks. They include:
Task-relevant data:
● This primitive focuses on identifying the specific data that will be used for mining.
● It involves selecting the relevant database or data warehouse, specifying conditions to
choose the appropriate data, determining the relevant attributes or dimensions for
exploration, and providing instructions for data ordering or grouping.
Background knowledge:
● This primitive allows users to incorporate their domain knowledge or existing knowledge
about the data being mined.
● Users can provide additional information, rules, or constraints to guide the knowledge
discovery process and evaluate the patterns that are discovered.
● For example, incorporating concept hierarchies or domain-specific rules can assist in
pattern interpretation.
Notes By Jayanth
Data Mining Notes Jntuh
● Users specify functions that help differentiate between uninteresting and valuable
patterns.
● Interestingness measures can be based on factors such as simplicity, certainty, utility,
novelty, or domain-specific requirements.
Integration of a Data Mining System with a Data Warehouse refers to the process of combining
and utilizing data mining techniques within a data warehouse environment.
There are different architectures for integrating a data mining system with a database or data
warehouse system. Here are the differences between these architectures:
No coupling:
● In this architecture, the data mining system operates independently of the database or
data warehouse system.
● The data mining system obtains the initial data set from flat files or other sources,
without utilizing the functionalities of the database or data warehouse system.
● This architecture is considered a poor design choice since it lacks integration and
misses out on the benefits provided by database systems.
Loose coupling:
● In this architecture, the data mining system is not tightly integrated with the database or
data warehouse system.
● The database or data warehouse system is used as the source of the initial data set for
mining and may be used for storing the results.
● While this architecture allows the data mining system to leverage the flexibility and
efficiency of the database system, it may face scalability and performance challenges,
especially with large datasets.
Semitight coupling:
● This architecture involves implementing some data mining primitives or functions within
the database or data warehouse system.
● Operations like aggregation, sorting, or pre-computation of statistical functions are
efficiently performed in the database or data warehouse system.
Notes By Jayanth
Data Mining Notes Jntuh
● Frequently used intermediate mining results can be pre-computed and stored in the
database or data warehouse, improving the performance of the data mining system.
Tight coupling:
● In this architecture, the database or data warehouse system is fully integrated as part of
the data mining system.
● The data mining sub-system is treated as a functional component of the overall
information system.
● Tight coupling enables optimized data mining query processing, leading to efficient
implementations, high system performance, and an integrated information processing
environment.
Major issues in data mining can be categorized into the following Types:
Notes By Jayanth
Data Mining Notes Jntuh
5. Performance Issues:
● - Efficiency and scalability of data mining algorithms are crucial for effectively processing
large volumes of data.
● - Parallelization and distributed algorithms can help improve the performance of data
mining tasks.
Data Preprocessing
Data preprocessing refers to the process of transforming raw data into a format that is suitable
for further analysis or processing. It is a common practice in data mining to improve the quality
and usability of the data for users.
Data Quality Improvement: Real-world data is often dirty, meaning it can be incomplete, noisy,
or inconsistent. Preprocessing helps improve data quality, which in turn improves the quality of
mining results. High-quality data is crucial for making reliable and accurate decisions based on
the data.
Handling Incomplete Data: Incomplete data refers to missing attribute values or attributes of
interest. Preprocessing techniques can handle missing values by filling them in based on certain
criteria or imputing them using statistical methods.
Dealing with Noisy Data: Noisy data contains errors or outliers, which can negatively impact
the accuracy of mining results. Preprocessing methods can identify and handle noisy data by
smoothing or removing outliers to ensure the data is more reliable.
Resolving Inconsistent Data: Inconsistent data arises when there are discrepancies in codes,
names, or values. Data preprocessing techniques can detect and resolve such inconsistencies,
ensuring data integrity and consistency.
asks involved in data preprocessing
Notes By Jayanth
Data Mining Notes Jntuh
Data Cleaning:
● Handling missing values: Dealing with cases where some data points have no values by
filling them in or removing them.
● Smoothing noisy data: Removing or reducing random errors or outliers in the data.
● Removing outliers: Identifying and eliminating data points that significantly deviate from
the overall pattern.
● Resolving inconsistencies: Correcting discrepancies or conflicts in codes, names, or
values across the data.
Data Integration:
● Combining data from multiple sources: Bringing together data from different databases,
files, or data cubes into a single, unified format for analysis.
Data Transformation:
● Normalizing data: Scaling the values of different attributes to a common range, ensuring
they are on the same scale for accurate analysis.
● Aggregating data: Summarizing or grouping data to a higher level of abstraction, such as
calculating averages or totals.
Data Reduction:
● Reducing data volume: Applying techniques to reduce the size of the dataset without
losing essential information.
● Preserving important information: Ensuring that the reduced dataset still retains key
patterns, trends, or characteristics present in the original data.
Notes By Jayanth
Data Mining Notes Jntuh
Data Discretization:
● Converting continuous numerical data into categories or intervals: Grouping numerical
data into discrete ranges or classes, making it suitable for certain types of analysis or
algorithms that require categorical input.
Notes By Jayanth
Data Mining Notes Jntuh
UNIT-2
Association Rule Mining, also known as frequent pattern mining, aims to discover relationships,
associations, or correlations among items or objects in large databases. It involves finding
frequent patterns or itemsets that occur frequently in a given dataset, satisfying a minimum
support and confidence threshold.
Notes By Jayanth
Data Mining Notes Jntuh
Notes By Jayanth
Data Mining Notes Jntuh
Transaction (T):
A transaction, denoted as T, is a subset of the item set I. It represents a single occurrence or
instance of a collection of items that are associated with each other. In other words, a
transaction is a record or entry in the dataset that contains a set of items.
For example, let's consider a dataset containing information about customer purchases
in a supermarket. The set of items (I) could include various products like milk, bread,
eggs, and so on. The dataset (D) would consist of multiple transactions, each
representing a specific customer's purchase. Each transaction (T) would be a set of
items, such as {milk, bread}, {eggs, bread}, {milk, eggs, bread}, and so on. The
transaction identifier (TID) would provide a unique identifier for each transaction,
allowing us to distinguish and refer to specific purchases.
Notes By Jayanth
Data Mining Notes Jntuh
Support:
Support measures the frequency or prevalence of an itemset in a dataset. It indicates the
proportion or percentage of transactions that contain a specific itemset. A higher support value
indicates that the itemset is more frequently occurring in the dataset.
Mathematically, support (s) is calculated as the number of transactions containing the itemset
divided by the total number of transactions in the dataset. It can also be represented as a
percentage.
Confidence:
Confidence measures the reliability or strength of an association rule. It indicates the conditional
probability of the consequent (B) given the antecedent (A). In other words, it measures how
often the items in B appear in transactions that already contain A.
Mathematically, confidence (c) is calculated as the number of transactions containing both A
and B divided by the number of transactions containing A. It can also be represented as a
percentage.
Notes By Jayanth
Data Mining Notes Jntuh
Mining Methods
Level-wise search: The Apriori algorithm employs a level-wise search approach, where it
iteratively explores higher-level itemsets based on the frequent itemsets discovered in the
previous iteration. It starts with finding frequent 1-itemsets and then uses them to find frequent
2-itemsets, which are used to find frequent 3-itemsets, and so on.
Finding frequent 1-itemsets: In the first iteration, the algorithm scans the entire database to
count the occurrences of each item and identifies the items that satisfy a minimum support
threshold. Support is a measure of how frequently an itemset appears in the dataset. The set of
frequent 1-itemsets is denoted as L1.
Generating frequent k-itemsets: The frequent 1-itemsets found in the previous step are used
as a basis to generate candidate itemsets of size k (k > 1). These candidate itemsets are
generated by combining frequent (k-1)-itemsets that share the same prefix. For example, if {A,
B} and {A, C} are frequent 2-itemsets, their combination {A, B, C} is a candidate 3-itemset.
Scanning and pruning: After generating the candidate itemsets of size k, the algorithm scans
the database once again to count their occurrences. Any candidate itemset that does not meet
the minimum support threshold is pruned, as it cannot be a frequent itemset. The remaining
frequent k-itemsets are added to the set Lk.
Notes By Jayanth
Data Mining Notes Jntuh
Iteration until no more frequent itemsets: Steps 3 and 4 are repeated iteratively until no more
frequent k-itemsets can be found. At each iteration, the algorithm generates candidate itemsets,
scans the database, prunes non-frequent itemsets, and adds the frequent itemsets to Lk.
Notes By Jayanth
Data Mining Notes Jntuh
Notes By
Data Mining Notes Jntuh
Hash-based itemset counting: Use hashing techniques to efficiently count the occurrences of
itemsets. This allows for early elimination of itemsets that do not meet the support threshold,
reducing the number of itemsets that need to be considered.
Transaction reduction: Eliminate transactions that do not contain any frequent k-itemsets.
These transactions do not contribute to the discovery of frequent itemsets and can be ignored in
subsequent scans, reducing the amount of data to process.
Partitioning: Divide the database into partitions and determine the frequent itemsets separately
for each partition. This helps to identify potentially frequent itemsets that are frequent in at least
one partition, reducing the search space.
Sampling: Perform mining on a subset of the given data instead of the entire dataset. By using
a lower support threshold and employing methods to ensure the completeness of the results,
sampling can provide approximate frequent itemsets while reducing the computational cost.
Dynamic itemset counting: Add new candidate itemsets only when all of their subsets are
estimated to be frequent. This avoids generating candidate itemsets that are unlikely to be
frequent, reducing the number of candidates to be considered.
Notes By Jayanth
Data Mining Notes Jntuh
Notes By Jayanth
Data Mining Notes Jntuh
Notes By Jayanth
Data Mining Notes Jntuh
Notes By Jayanth
Data Mining Notes Jntuh
Notes By Jayanth
Data Mining Notes Jntuh
FP-GROWTH Algorithm
The FP-Growth algorithm is a frequent pattern mining algorithm that efficiently discovers
frequent itemsets in a dataset. It avoids the need for generating candidate itemsets like the
Apriori algorithm by using a data structure called the FP-Tree.
Overall, the FP-Growth algorithm is an efficient and effective method for mining frequent
itemsets, allowing for valuable insights and pattern discovery in large datasets.
Benefits OF FP TREE
Completeness:
The FP-Tree structure ensures completeness by preserving the complete information for
frequent pattern mining.
It never breaks a long pattern of any transaction, meaning it retains the sequential order of items
in transactions without any loss of information.
This completeness allows for accurate analysis and discovery of frequent patterns in the
dataset.
Compactness:
The FP-Tree structure helps in reducing irrelevant information by eliminating infrequent items
from the tree.
Infrequent items are pruned from the tree, resulting in a compact representation of the dataset.
This compactness reduces the memory space required to store the dataset and subsequent
mining operations.
Notes By Jayanth
Data Mining Notes Jntuh
Size Efficiency:
The FP-Tree structure is typically smaller in size compared to the original database.
The tree structure itself, excluding node-links and counts, never exceeds the size of the original
database.
This size efficiency reduces memory consumption and speeds up the mining process, especially
for large datasets.
Mining various kinds of association rules refers to the process of discovering different types of
relationships and patterns within a dataset. Association rule mining aims to find associations,
correlations, or dependencies among items or attributes in the data.
The different kinds of association rules that can be mined include:
Notes By Jayanth
Data Mining Notes Jntuh
● For instance, it may uncover relationships between specific products sold in a particular
region during a specific period.
● This method enables a comprehensive understanding of complex relationships across
various dimensions.
Correlation Analysis
Where:
● X and Y are the values of the two variables being analyzed.
● X and Ȳ are the means of X and Y, respectively.
● Σ represents the summation of values across all observations.
Notes By Jayanth
Data Mining Notes Jntuh
A positive value indicates a positive linear relationship, meaning that as one variable increases,
the other tends to increase as well.
A negative value indicates a negative linear relationship, where as one variable increases, the
other tends to decrease.
A value of 0 suggests no linear relationship between the variables
Knowledge Type:
● Knowledge type refers to the type of prior knowledge or domain expertise that is used to
guide the association mining process.
● Example: If we are mining associations in a healthcare dataset, the knowledge type
could include medical domain knowledge about symptoms, diseases, and treatments.
Data Constraints:
● Data constraints are conditions or restrictions applied to the dataset to filter or focus the
association mining process.
● Example: We might apply a data constraint to consider only transactions made within a
specific time period, such as the last month or year.
3D Level Constraints:
● 3D level constraints involve applying constraints to associations in a three-dimensional
space, considering multiple dimensions or attributes simultaneously.
● Example: In a retail dataset, we can apply a 3D level constraint to find associations
between products, customer demographics, and geographic locations. This helps
identify specific patterns for different customer segments and regions.
Interestingness Constraints:
Notes By Jayanth
Data Mining Notes Jntuh
Rule Constraints:
● Rule constraints specify additional conditions or requirements that association rules
must satisfy.
● Example: We can set a rule constraint that an association rule should have a certain
item in the antecedent (left-hand side) or consequent (right-hand side). For example, we
might require an association rule to include the item "milk" in the consequent.
Graph Pattern Mining in Data Mining (DM) refers to the process of discovering meaningful
patterns or relationships within graph-structured data. Graphs consist of nodes (vertices) and
edges that represent relationships or connections between the nodes. Graph pattern mining
aims to uncover frequent subgraphs or graph patterns that occur frequently in a given dataset.
Apriori-based Approach:
The Apriori-based approach for graph pattern mining is inspired by the traditional Apriori
algorithm used for association rule mining. It involves a level-wise search strategy where
frequent subgraphs of increasing size are generated and evaluated. The algorithm starts by
identifying frequent individual nodes and edges in the graph, and then uses these frequent
subgraphs to generate larger subgraphs. The process continues iteratively until no more
frequent subgraphs can be found. The support threshold is used to determine the minimum
occurrence frequency required for a subgraph to be considered frequent.
Applications:
Notes By Jayanth
Data Mining Notes Jntuh
social network analysis, biological network analysis, web mining, and fraud detection, to
uncover important patterns and relationships within complex networks
Sequential Pattern Mining (SPM) is a data mining technique that focuses on discovering
sequential patterns or temporal relationships in sequential data. It involves analyzing sequences
of events or items over time to identify frequent patterns that occur in a specific order. Here are
the key points to understand about Sequential Pattern Mining:
Frequent Sequential Patterns: SPM aims to find frequent sequential patterns, which are the
patterns that occur frequently above a predefined support threshold. These patterns provide
insights into the regularities or recurring sequences in the data.
Algorithms: There are different algorithms for sequential pattern mining, such as the Apriori-
based algorithm and the PrefixSpan algorithm. These algorithms employ different strategies to
efficiently discover frequent patterns from large sequential datasets.
Example: Let's consider a retail dataset with customer transaction sequences. Each transaction
sequence represents the items purchased by a customer over time. A frequent sequential
pattern could be {A, B, C}, indicating that customers often buy items A, B, and C in that order.
This information can be valuable for market basket analysis and personalized
recommendations.
Applications: Sequential Pattern Mining finds applications in various domains. For instance, it
is used in market basket analysis to understand customer purchasing behavior, in web usage
mining to analyze user navigation patterns, in healthcare for analyzing patient treatment
sequences, and in manufacturing for optimizing process workflows.
Notes By Jayanth
Data Mining Notes Jntuh
UNIT-3
Prediction: Prediction, also known as regression, aims to estimate or predict a numerical value
or a continuous outcome based on input variables or features. It is also a supervised learning
task where the model learns from labeled data to predict a numerical value for new instances.
For example, predicting housing prices based on factors like location, size, and number of
rooms.
Applications: Classification and prediction have numerous applications in various fields. They
are used in customer segmentation for targeted marketing, credit scoring for assessing
creditworthiness, disease diagnosis based on patient symptoms, stock market forecasting, and
many other domains where making predictions or classifications is crucial for decision-making.
Notes By Jayanth
Data Mining Notes Jntuh
Decision tree induction is a popular data mining technique that involves constructing a tree-like
model to make decisions or predictions based on input data. Each attribute of a decision tree
serves a specific purpose in the process. Here's an explanation of each attribute:
Root Node: The topmost node of the decision tree represents the entire dataset. It is the
starting point for making decisions and contains the attribute that best splits the data based on
certain criteria.
Internal Nodes: Internal nodes represent decision points in the tree where different attributes
are evaluated to determine the next step. Each internal node contains an attribute and
corresponding splitting criteria.
Branches: Branches emanating from internal nodes represent the possible outcomes or values
of the attribute being evaluated. They lead to subsequent internal nodes or leaf nodes.
Leaf Nodes: Leaf nodes represent the final decision or prediction made by the decision tree.
They do not contain any further attributes. Instead, they indicate the class label or outcome
associated with a specific combination of attribute values.
Information Gain: Information Gain is a measure that quantifies the amount of information
gained by splitting the data based on a particular attribute. It is based on the concept of entropy,
which represents the impurity or disorder in the dataset. The attribute with the highest
Information Gain is selected as the splitting attribute at each node.
Notes By Jayanth
Data Mining Notes Jntuh
Bayesian classification
Bayesian classification is a probabilistic approach used in data mining and machine learning for
classifying instances based on their probabilities of belonging to different classes. Here's an
explanation of Bayesian classification in simple points with an example:
Bayesian approach: Bayesian classification is based on Bayes' theorem, which calculates the
probability of a hypothesis given the observed evidence. It uses prior knowledge and updates it
with new evidence to make predictions.
Example: Suppose we have a dataset of emails labeled as "spam" or "non-spam" along with
their attributes like the presence of certain keywords, length of the email, and the number of
exclamation marks. Bayesian classification can be used to classify new emails as spam or non-
spam based on the probabilities of these attributes given each class. For example, if the
probability of an email being spam given the presence of specific keywords is higher than the
probability of it being non-spam, the classifier will classify it as spam.
Naive Bayes Classifier: The Naive Bayes classifier is a machine learning algorithm used for
classification tasks.
Independence Assumption: The classifier assumes that the features or attributes used for
classification are independent of each other. This means that the presence or absence of one
feature does not affect the presence or absence of another feature.
Notes By Jayanth
Data Mining Notes Jntuh
● The Rule-Based Classifier is a classification algorithm that makes use of IF-THEN rules
to predict the class of new instances. The rules are structured as IF conditions are met,
THEN predict a certain class.
● Rules are represented as IF-THEN statements, where the IF part specifies the
conditions or criteria based on the input features, and the THEN part indicates the class
or category to which the instance belongs.
● Rule-Based Classifier uses predefined rules to classify instances. It generates rules from
training data, evaluates their quality, selects the most relevant rules, and applies them to
classify new instances
In this example, we have a rule that predicts whether a person will play tennis based on the
outlook and temperature. The IF part specifies the conditions, and the THEN part indicates the
predicted class.
Outlook = Sunny
Temperature = Hot
Humidity = High
Wind = Weak
Notes By Jayanth
Data Mining Notes Jntuh
To classify this instance, we check if the conditions of any of the rules are satisfied. In this case,
the conditions of our example rule are met (Outlook = Sunny and Temperature = Hot).
Therefore, we predict that the person will play tennis (PlayTennis = Yes)
Lazy learners
Lazy learners, also known as instance-based learners or memory-based learners, are a type of
machine learning algorithm used in data mining. Unlike eager learners that build a model during
the training phase, lazy learners do not construct a specific model. Instead, they memorize the
training instances and make predictions based on the stored instances at the time of testing.
Here are some key points to understand lazy learners:
Notes By Jayanth
Data Mining Notes Jntuh
Notes By Jayanth
Data Mining Notes Jntuh
UNIT-4
Cluster analysis, also known as clustering, is a technique in data mining used to group similar
objects together based on their inherent characteristics or patterns. It aims to discover natural
groupings within a dataset without any prior knowledge of the class labels or target variables.
Clustering Scalability: Cluster analysis algorithms should be able to handle large datasets
efficiently. Scalability refers to the ability to process and analyze increasingly larger amounts of
data without a significant increase in computational time and resources.
Algorithm Usability with Multiple Types of Data: Clustering algorithms should be applicable
to various types of data, including numerical, categorical, or mixed data. Different algorithms are
designed to handle different types of data, ensuring versatility in accommodating diverse
datasets.
Dealing with Unstructured Data: Cluster analysis is useful for discovering patterns and
structures within unstructured data. Unstructured data refers to information that does not have a
predefined data model or organization. Clustering algorithms can help identify hidden
relationships and groupings in such data.
Interoperability: Cluster analysis algorithms should be interoperable with other data mining
techniques and tools. They should be able to integrate seamlessly into the data analysis
workflow, allowing for the combination of clustering results with other analyses and
visualizations.
DATA STRUCTURES:-
Data Matrix:
● A data matrix is a structured representation of the data used in cluster analysis. It is a
table-like structure where rows represent objects (e.g., observations, samples) and
columns represent variables (e.g., features, attributes). Each cell in the matrix holds the
value of a specific variable for a particular object.
● For example, let's consider a dataset of houses where each row represents a house and
each column represents a variable such as price, size, number of rooms, etc. The data
matrix would have houses as rows and variables as columns, with each cell containing
the corresponding values.
● Data matrices are commonly used in clustering algorithms that rely on the values of
variables to determine similarities or dissimilarities between objects.
Notes By Jayanth
Data Mining Notes Jntuh
Dissimilarity Matrix:
● A dissimilarity matrix, also known as a distance matrix, represents the dissimilarities or
distances between pairs of objects in a dataset. Instead of directly providing the values
of variables, it focuses on capturing the dissimilarity or distance measures between
objects.
● In a dissimilarity matrix, each row and column represent an object, and the cell values
represent the dissimilarity or distance between the corresponding pair of objects. The
values in the dissimilarity matrix are typically calculated using distance metrics such as
Euclidean distance, Manhattan distance, or correlation distance.
● For instance, if we have a set of images and want to cluster them based on visual
similarity, we can create a dissimilarity matrix by calculating the distances between pairs
of images using image comparison techniques.
● Dissimilarity matrices are commonly used in hierarchical clustering and density-based
clustering algorithms, where the focus is on measuring the dissimilarities between
objects rather than analyzing the specific values of variables.
Binary Variables:
● Symmetric Binary Variables: Symmetric binary variables are categorical variables that
can take on only two distinct values, typically represented as 0 and 1. The two values
have equal importance and are interchangeable. Examples of symmetric binary
variables include gender (male/female), presence/absence of a certain characteristic, or
yes/no responses.
Notes By Jayanth
Data Mining Notes Jntuh
Categorical Variables:
● Nominal Variables: Nominal variables are categorical variables without any inherent
order or ranking. They represent distinct categories that are not numerically related.
Examples of nominal variables include colors (red, blue, green), different types of fruits,
or categories of products.
● Ordinal Variables: Ordinal variables are categorical variables with an inherent order or
ranking between categories. The categories have a meaningful relationship in terms of
their order but not necessarily in terms of the exact numerical difference between them.
Examples of ordinal variables include ratings (e.g., low, medium, high), educational
levels (e.g., elementary, middle school, high school), or levels of satisfaction (e.g., very
dissatisfied, dissatisfied, neutral, satisfied, very satisfied).
Mixed Variables: Mixed variables refer to datasets that include a combination of different types
of variables, such as a mix of interval-scaled, binary, and categorical variables. In many real-
world datasets, it is common to have variables of different types. For example, a dataset
containing information about customers may include age (interval-scaled), gender (binary), and
occupation (categorical).
Partitioning Methods
Partitioning methods are a class of clustering algorithms used in data analysis to group objects
into distinct partitions or clusters. These algorithms aim to find an optimal partitioning of the
data, where objects within each cluster are more similar to each other than to objects in other
clusters. Here's a simple explanation of partitioning methods:
K-means Clustering: K-means is one of the most widely used partitioning methods. It aims to
partition the data into K clusters, where K is a predefined number chosen by the user. The
algorithm starts by randomly selecting K initial cluster centroids and then iteratively assigns
each object to the nearest centroid and recalculates the centroids based on the mean values of
the assigned objects. This process continues until convergence, where the assignment of
objects to clusters remains unchanged.
For example, if we have a dataset of customer information and want to cluster them into three
groups based on their purchasing behavior, we can use K-means clustering to find three distinct
clusters of customers.
Notes By Jayanth
Data Mining Notes Jntuh
Hierarchical clustering
Hierarchical clustering is a class of clustering algorithms that organizes data objects into a
hierarchical structure, often represented as a tree-like structure called a dendrogram. This
method creates a hierarchical sequence of nested clusters, where clusters at higher levels
encompass smaller clusters at lower levels. There are two main types of hierarchical methods:
agglomerative and divisive.
Agglomerative clustering starts with each data point as an individual cluster and iteratively
merges the closest pairs of clusters until a single cluster containing all the data points is formed.
For example, let's say we have a dataset of animals, and we want to cluster them based on
their characteristics. Agglomerative hierarchical clustering would start by considering each
animal as a separate cluster, and then it would progressively merge the closest pairs of clusters,
grouping similar animals together at each step.
Notes By Jayanth
Data Mining Notes Jntuh
3 modes:-
Single Linkage:
● Single linkage, also known as the nearest neighbor linkage, measures the similarity or
distance between two clusters by considering the shortest distance between any pair of
objects belonging to the two clusters. In other words, it looks at the closest neighboring
points between clusters. The distance between clusters is defined as the minimum
distance between any two points from different clusters.
● For example, if we have three clusters A, B, and C, and we are using single linkage, the
distance between clusters A and B would be the shortest distance between any point in
A and any point in B.
Complete Linkage:
● Complete linkage, also known as the farthest neighbor linkage, measures the similarity
or distance between two clusters by considering the maximum distance between any
pair of objects belonging to the two clusters. It looks at the farthest neighboring points
between clusters. The distance between clusters is defined as the maximum distance
between any two points from different clusters.
● For example, if we have three clusters A, B, and C, and we are using complete linkage,
the distance between clusters A and B would be the maximum distance between any
point in A and any point in B.
Average Linkage:
● Average linkage measures the similarity or distance between two clusters by considering
the average distance between all pairs of objects belonging to the two clusters. It takes
into account the distances between all points from different clusters and calculates their
average. The distance between clusters is defined as the average distance between any
two points from different clusters.
● For example, if we have three clusters A, B, and C, and we are using average linkage,
the distance between clusters A and B would be the average distance between all points
in A and all points in B.
Notes By Jayanth
Data Mining Notes Jntuh
Divisive clustering takes the opposite approach of agglomerative clustering. It starts with a
single cluster containing all data points and recursively splits it into smaller clusters until each
data point becomes a separate cluster.
For instance, consider a dataset of plants, and we want to cluster them based on their
botanical features. Divisive hierarchical clustering would begin with a single cluster containing
all plants and then successively split it into subclusters, resulting in a hierarchical structure of
smaller and more specific clusters.
Notes By Jayanth
Data Mining Notes Jntuh
Density–Based Methods
Density-based methods are a class of clustering algorithms that group data points based on
their density within the dataset. One popular density-based algorithm is DBSCAN (Density-
Based Spatial Clustering of Applications with Noise). DBSCAN is particularly effective at
identifying clusters of arbitrary shape and handling noise or outliers. Let's delve into DBSCAN
and its inputs and types of data points:
DBSCAN Inputs:
● Dataset: The input to DBSCAN is a dataset consisting of data points. Each data point is
represented by its feature values or coordinates in a multi-dimensional space.
● Epsilon (ε): Epsilon is a parameter in DBSCAN that defines the maximum distance
between two data points for them to be considered as neighbors. It determines the
neighborhood size around each data point.
● Minimum Points (MinPts): MinPts is another parameter that specifies the minimum
number of data points required within the ε-neighborhood for a point to be considered a
core point. Core points play a crucial role in forming clusters.
● Core Points: Core points are data points within the dataset that have a sufficient
number of neighboring points within the ε-neighborhood (specified by MinPts). These
points are considered as central to their respective clusters.
● Border Points: Border points are data points that have fewer neighboring points than
the required MinPts within the ε-neighborhood. They are not dense enough to be core
points but are within the neighborhood of a core point. Border points can be part of a
cluster but are less central than core points.
● Noise Points: Noise points, also known as outliers, are data points that have fewer
neighboring points than the required MinPts within the ε-neighborhood and are not within
the neighborhood of any core point. They are considered as noise or non-clustered
points.
Notes By Jayanth
Data Mining Notes Jntuh
DBSCAN Algorithm:
Grid–Based Methods
Grid-based methods are a class of clustering algorithms that partition the data space into a grid
or a set of cells. These methods are efficient for handling large datasets by reducing the
computational complexity of clustering. One popular grid-based algorithm is STING (Statistical
Information Grid).
Notes By Jayanth
Data Mining Notes Jntuh
Grid Construction:
● The algorithm starts by dividing the entire data space into a rectangular grid of cells.
● The number and size of the cells can be pre-defined based on the characteristics of the
dataset or adaptively determined.
● Each cell in the grid represents a spatial region in the data space.
Statistical Information:
● For each cell, STING computes statistical information (e.g., mean, standard deviation)
about the data objects contained within that cell.
● The statistical information provides a summary of the data distribution within each cell.
Hierarchical Structure:
● STING constructs a hierarchical structure by recursively partitioning cells into smaller
subcells.
● The partitioning is based on statistical measures such as variance or entropy, aiming to
maximize the homogeneity of the objects within each cell.
● This process continues until a stopping criterion is met, such as reaching a minimum cell
size or a desired level of clustering granularity.
Cluster Extraction:
● At each level of the hierarchy, STING identifies clusters by analyzing the statistical
properties of the cells.
● Clusters can be defined based on thresholds or statistical tests applied to the information
of the cells.
● The hierarchical structure of the grid allows for different levels of clustering granularity,
enabling users to explore clusters at various resolutions.
Outlier Analysis
Outlier analysis, also known as outlier detection, is the process of identifying and examining
data points that deviate significantly from the majority of the dataset. Outliers are data points
that exhibit different characteristics or behaviors compared to the rest of the data. Outlier
analysis is important in various fields, including data mining, statistics, and anomaly detection. It
helps in understanding unusual patterns, detecting errors or anomalies, and making informed
decisions. There are different approaches to outlier analysis, including statistical and proximity-
based methods.
Notes By Jayanth
Data Mining Notes Jntuh
Outlier Detection
● Parametric Methods: Parametric methods assume a specific distribution for the data and
use statistical techniques to identify outliers. These methods estimate the parameters of
the assumed distribution and detect outliers based on their deviation from the expected
values. Examples of parametric methods include Z-score, Grubbs' test, and Dixon's test.
● Non-parametric Methods: Non-parametric methods make minimal assumptions about
the distribution of data and focus on ranking or ordering the data points. These methods
use statistical ranks or order statistics to identify outliers. Examples of non-parametric
methods include the Median Absolute Deviation (MAD), percentile-based methods, and
the box plot.
Notes By Jayanth
Data Mining Notes Jntuh
Types of Outliers
Global/Point Outliers:
● Global outliers, also known as point outliers, are individual data points that
significantly deviate from the majority of the dataset. These outliers are isolated
and distinct from other data points, and they have a noticeable impact on the
overall distribution. Global outliers can arise due to measurement errors, data
entry mistakes, or rare events. They are typically easy to detect because they
stand out from the rest of the data.
● For example, in a dataset of students' exam scores, a global outlier may
represent a student who achieved an extremely high or low score compared to
other students.
Collective Outliers:
Notes By Jayanth
Data Mining Notes Jntuh
Conditional Outliers:
Notes By Jayanth
Data Mining Notes Jntuh
UNIT5
Data Streams in Data Mining is extracting knowledge and valuable insights from a
continuous stream of data using stream processing software. Data Streams in Data Mining
can be considered a subset of general concepts of machine learning, knowledge extraction,
and data mining. In Data Streams in Data Mining, data analysis of a large amount of data
needs to be done in real-time. The structure of knowledge is extracted in data steam
mining represented in the case of models and patterns of infinite streams of information.
Notes By Jayanth
Data Mining Notes Jntuh
● Mining time-series data involves analyzing data that is recorded over time.
● Time-series data consists of observations or measurements taken at regular intervals,
● such as stock prices, temperature readings, or sensor data. The goal of mining time-
series data is to discover patterns, trends, or anomalies that can provide valuable
insights for forecasting, prediction, or anomaly detection.
Predictive Analytics: Time-series data mining can be used to build predictive models that
forecast future trends, patterns, or events based on historical data. This is valuable in various
domains, such as finance, weather forecasting, stock market analysis, and energy consumption
prediction.
Anomaly Detection: Time-series data mining techniques can identify unusual patterns or
outliers in the data, indicating potential anomalies or abnormalities. This is beneficial in
detecting fraud, network intrusion, equipment failure, and other abnormal events.
Pattern Recognition: Time-series data mining can uncover recurring patterns, periodicities, or
trends in the data. This is useful in fields like signal processing, sensor data analysis, and
biological signal analysis.
Resource Optimization: Mining time-series data can help optimize resource allocation and
utilization by analyzing patterns and trends in data related to resource consumption, production,
or demand. This is applicable in industries such as manufacturing, logistics, and energy
management.
Time-Dependent: Time-series data is inherently dependent on time, with data points ordered
chronologically. The temporal dimension is a crucial aspect of time-series analysis.
Irregular Sampling: Time-series data can have irregular or uneven sampling intervals, where
data points are not uniformly spaced in time. Dealing with irregular sampling requires
specialized techniques for interpolation or handling missing data.
Notes By Jayanth
Data Mining Notes Jntuh
Seasonality and Trends: Time-series data can exhibit periodic patterns, seasonality, or long-
term trends. Identifying and modeling these patterns are important for accurate analysis and
prediction.
Noise and Outliers: Time-series data can be subject to noise or contain outliers, which can
affect the accuracy of analysis and modeling
Market Basket Analysis: Sequence pattern mining can be used in market basket analysis to
uncover frequently occurring sequences of items purchased together. This information is
valuable for cross-selling, product recommendation systems, and optimizing store layouts.
Web Usage Mining: Mining sequence patterns in web usage data can reveal the sequential
navigation patterns of website visitors. This information can be used for personalization,
improving website design, and identifying bottlenecks or anomalies in user behavior.
Fraud Detection: Mining sequence patterns in transactional databases can assist in detecting
fraudulent activities or patterns of suspicious behavior. By identifying abnormal or fraudulent
sequences of events, such as unauthorized transactions or unusual access patterns, fraud can
be detected and prevented.
Notes By Jayanth
Data Mining Notes Jntuh
Data Mining Notes Jntuh
Data Mining Notes Jntuh
Data Mining Notes Jntuh
Data Mining Notes Jntuh
Data Mining Notes Jntuh
Unstructured Data: Text mining deals with unstructured textual data, which lacks a predefined
structure or format. Analyzing and extracting insights from unstructured data pose challenges
due to the absence of a standardized schema or organization.
Text Preprocessing: Before applying text mining techniques, textual data usually undergoes
preprocessing steps. These steps involve tokenization, removing stop words, stemming or
lemmatization, and handling noise, punctuation, and special characters.
Feature Extraction: Text mining involves extracting meaningful features from text, such as
word frequencies, n-grams, part-of-speech tags, or semantic representations. These features
serve as inputs for machine learning algorithms or statistical analysis.
Language and Context: Text mining considers the linguistic and contextual aspects of textual
data. It deals with challenges like word ambiguity, language variations, sarcasm, irony, and
understanding the meaning of words and phrases in different contexts.
Statistical and Machine Learning Techniques: Text mining employs a range of statistical and
machine learning techniques. These include text classification algorithms (e.g., Naive Bayes,
Support Vector Machines), clustering algorithms (e.g., k-means, hierarchical clustering), topic
modeling methods (e.g., LDA), and sentiment analysis models (e.g., lexicon-based or machine
learning-based approaches).
Integration with NLP: Text mining techniques often leverage natural language processing
(NLP) techniques, such as part-of-speech tagging, named entity recognition, parsing, and
dependency analysis, to enhance the analysis and understanding of textual data.
Notes By Jayanth