UNIT3
UNIT3
Data Mining
elwanzita20s@gmail.com 1
Data mining is the process of extracting knowledge or insights from large
amounts of data using various statistical and computational techniques.
OR
Data mining is the process of sorting through large data sets to identify patterns
and relationships that can help solve business problems through data analysis.
The data can be structured, semi-structured or unstructured, and can be stored in
various forms such as databases, data warehouses, and data lakes.
Data mining techniques and tools help enterprises to predict future trends and
make more informed business decisions.
Data mining is a key part of data analytics and one of the core disciplines in data
science, which uses advanced analytics techniques to find useful information in
data sets.
At a more granular level, data mining is a step in the knowledge discovery in
databases (KDD) process, a data science methodology for gathering, processing
and analyzing data.
elwanzita20s@gmail.com 2
Data mining and KDD are sometimes referred to interchangeably, but they're
more commonly seen as distinct things.
The process of data mining relies on the effective implementation of data
collection, warehousing and processing.
Data mining can be used to describe a target data set, predict outcomes, detect
fraud or security issues, learn more about a user base, or detect bottlenecks and
dependencies.
It can also be performed automatically or semi-automatically. Data mining is
more useful today due to the growth of big data and data warehousing.
Data specialists who use data mining must have coding and programming
language experience, as well as statistical knowledge to clean, process and
interpret data.
elwanzita20s@gmail.com 3
Data mining is a crucial component of successful analytics initiatives in
organizations.
Data specialists can use the information it generates in business intelligence (BI)
and advanced analytics applications that involve analysis of historical data, as
well as real-time analytics applications that examine streaming data as it's created
or collected.
Effective data mining aids in various aspects of planning business strategies and
managing operations.
This includes customer-facing functions, such as marketing, advertising, sales and
customer support, as well as manufacturing, supply chain management (SCM),
finance and human resources (HR).
Data mining supports fraud detection, risk management, cyber security planning
and many other critical business use cases. It also plays an important role in other
areas, including healthcare, government, scientific research, mathematics and
sports.
elwanzita20s@gmail.com 4
How does Data Mining work
elwanzita20s@gmail.com 5
Data scientists and other skilled BI and analytics professionals typically perform data
mining. But data-savvy business analysts, executives and workers who function as
citizen data scientists in an organization can also perform data mining.
The core elements of data mining include machine learning and statistical analysis,
along with data management tasks done to prepare data for analysis.
The use of machine learning algorithms and artificial intelligence (AI) tools has
automated more of the process.
These tools have also made it easier to mine massive data sets, such as customer
databases, transaction records and log files from web servers, mobile apps and sensors.
Although the number of stages can differ depending on how granular an organization
wants each step to be, the data mining process can generally be broken down into the
following four primary stages:
Data gathering. Identify and assemble relevant data for an analytics application. The
data might be located in different source systems, a data warehouse or a data lake, an
increasingly common repository in big data environments that contain a mix of
structured and unstructured data.
External data sources can also be used. Wherever the data comes from, a data scientist
often moves it to a data lake for the remaining steps in the process.
elwanzita20s@gmail.com 6
Data preparation. This stage includes a set of steps to get the data ready to be mined.
Data preparation starts with data exploration, profiling and pre-processing, followed by
data cleansing work to fix errors and other data quality issues, such as duplicate or
missing values.
Data transformation is also done to make data sets consistent, unless a data scientist
wants to analyze unfiltered raw data for a particular application.
Data mining. Once the data is prepared, a data scientist chooses the appropriate data
mining technique and then implements one or more algorithms to do the mining.
These techniques, for example, could analyze data relationships and detect patterns,
associations and correlations.
In machine learning applications, the algorithms typically must be trained on sample
data sets to look for the information being sought before they're run against the full set
of data.
Data analysis and interpretation. The data mining results are used to create analytical
models that can help drive decision-making and other business actions.
The data scientist or another member of a data science team must also communicate the
findings to business executives and users, often through data visualization and the use
of data storytelling techniques.
elwanzita20s@gmail.com 7
1. Define Problem.
Clearly define the objectives and goals of your data mining project. Determine what you
want to achieve and how mining data can help in solving the problem or answering
specific questions.
2. Collect Data.
Gather relevant data from various sources, including databases, files, APIs, or online
platforms. Ensure that the collected data is accurate, complete, and representative of the
problem domain.
Modern analytics and BI tools often have data integration capabilities. Otherwise, you’ll
need someone with expertise in data management to clean, prepare, and integrate the
data.
elwanzita20s@gmail.com 8
3. Prepare Data.
Clean and preprocess your collected data to ensure its quality and suitability for
analysis. This step involves tasks such as removing duplicate or irrelevant
records, handling missing values, correcting inconsistencies, and transforming the
data into a suitable format.
4. Explore Data.
Explore and understand your data through descriptive statistics, visualization
techniques, and exploratory data analysis. This step helps in identifying patterns,
trends, and outliers in the dataset and gaining insights into the underlying data
characteristics.
5. Select predictors.
This step, also called feature selection/engineering, involves identifying the
relevant features (variables) in the dataset that are most informative for the task.
This may involve eliminating irrelevant or redundant features and creating new
features that better represent the problem domain. 9
elwanzita20s@gmail.com
6. Select Model.
Choose an appropriate model or algorithm based on the nature of the problem, the
available data, and the desired outcome.
Common techniques include decision trees, regression, clustering, classification,
association rule mining, and neural networks. If you need to understand the
relationship between the input features and the output prediction, you may want a
simpler model like linear regression.
If you need a highly accurate prediction and explainability is less important, a
more complex model such as a deep neural network may be better.
7. Train Model.
Train your selected model using the prepared dataset. This involves feeding the
model with the input data and adjusting its parameters or weights to learn from
the patterns and relationships present in the data.
elwanzita20s@gmail.com 10
8. Evaluate Model.
Assess the performance and effectiveness of your trained model using a
validation set or cross-validation. This step helps in determining the model's
accuracy, predictive power, or clustering quality and whether it meets the desired
objectives.
You may need to adjust the hyperparameters to prevent overfitting and improve
the performance of your model.
9. Deploy Model.
Deploy your trained model into a real-world environment where it can be used
to make predictions, classify new data instances, or generate insights. This may
involve integrating the model into existing systems or creating a user-friendly
interface for interacting with the model.
10. Monitor & Maintain Model.
Continuously monitor your model's performance and ensure its accuracy and
relevance over time. Update the model as new data becomes available, and refine
the data mining process based on feedback and changing requirements.
elwanzita20s@gmail.com 11
Data Mining Techniques
elwanzita20s@gmail.com 12
elwanzita20s@gmail.com 13
Association rule mining focuses on discovering interesting relationships or
patterns among a set of items in transactional or market basket data.
It helps identify frequently co-occurring items and generates rules such as "if X,
then Y" to reveal associations between items. This simple Venn diagram shows
the associations between item sets X and Y of a dataset.
elwanzita20s@gmail.com 14
Classification is a technique used to categorize data into predefined classes or
categories based on the features or attributes of the data instances.
It involves training a model on labeled data and using it to predict the class labels
of new, unseen data instances.
elwanzita20s@gmail.com 15
Data Prediction is a two-step process, similar to that of data classification. Although, for
prediction, we do not utilize the phrasing of “Class label attribute” because the attribute for
which values are being predicted is consistently valued(ordered) instead of categorical
(discrete-esteemed and unordered).
The attribute can be referred to simply as the predicted attribute. Prediction can be viewed as
the construction and use of a model to assess the class of an unlabeled object, or to assess the
value or value ranges of an attribute that a given object is likely to have.
Clustering is a technique used to group similar data instances together based on their
intrinsic characteristics or similarities. It aims to discover natural patterns or structures in the
data without any predefined classes or labels.
elwanzita20s@gmail.com 16
Regression is employed to predict numeric or continuous values based on the
relationship between input variables and a target variable. It aims to find a
mathematical function or model that best fits the data to make accurate
predictions.
elwanzita20s@gmail.com 17
Outlier Detection. A database may contain data objects that do not comply with the
general behavior or model of the data. These data objects are Outliers.
The investigation of OUTLIER data is known as OUTLIER MINING. An outlier may
be detected using statistical tests which assume a distribution or probability model for
the data, or using distance measures where objects having a small fraction of “close”
neighbors in space are considered outliers.
Rather than utilizing factual or distance measures, deviation-based techniques
distinguish exceptions/outlier by inspecting differences in the principle attributes of
items in a group.
elwanzita20s@gmail.com 18
Genetic algorithms are adaptive heuristic search algorithms that belong to the
larger part of evolutionary algorithms. Genetic algorithms are based on the
ideas of natural selection and genetics.
These are intelligent exploitation of random search provided with historical data
to direct the search into the region of better performance in solution space. They
are commonly used to generate high-quality solutions for optimization problems
and search problems.
Genetic algorithms simulate the process of natural selection which means those
species who can adapt to changes in their environment are able to survive and
reproduce and go to the next generation. In simple words, they simulate “survival
of the fittest” among individuals of consecutive generations for solving a
problem.
Each generation consist of a population of individuals and each individual
represents a point in search space and possible solution. Each individual is
represented as a string of character/integer/float/bits. This string is analogous to
the Chromosome.
elwanzita20s@gmail.com 19
Advantages of Data Mining
elwanzita20s@gmail.com 20
Better Decision Making:
Data mining helps to extract useful information from large datasets, which can be
used to make informed and accurate decisions. By analyzing patterns and
relationships in the data, businesses can identify trends and make predictions that
help them make better decisions.
Improved Marketing:
Data mining can help businesses identify their target market and develop effective
marketing strategies. By analyzing customer data, businesses can identify
customer preferences and behavior, which can help them create targeted
advertising campaigns and offer personalized products and services.
Increased Efficiency:
Data mining can help businesses streamline their operations by identifying
inefficiencies and areas for improvement. By analyzing data on production
processes, supply chains, and employee performance, businesses can identify
bottlenecks and implement solutions that improve efficiency and reduce costs.
elwanzita20s@gmail.com 21
Fraud Detection:
Data mining can be used to identify fraudulent activities in financial transactions,
insurance claims, and other areas. By analyzing patterns and relationships in the
data, businesses can identify suspicious behavior and take steps to prevent fraud.
Customer Retention:
Data mining can help businesses identify customers who are at risk of leaving and
develop strategies to retain them. By analyzing customer data, businesses can
identify factors that contribute to customer churn and take steps to address those
factors.
Competitive Advantage:
Data mining can help businesses gain a competitive advantage by identifying new
opportunities and emerging trends. By analyzing data on customer behavior,
market trends, and competitor activity, businesses can identify opportunities to
innovate and differentiate themselves from their competitors.
elwanzita20s@gmail.com 22
Improved Healthcare:
Data mining can be used to improve healthcare outcomes by analyzing patient
data to identify patterns and relationships. By analyzing medical records and
other patient data, healthcare providers can identify risk factors, diagnose diseases
earlier, and develop more effective treatment plans.
elwanzita20s@gmail.com 23
Disadvantages of Data
Mining
elwanzita20s@gmail.com
24
Data Quality:
Data mining relies heavily on the quality of the data used for analysis. If the data
is incomplete, inaccurate, or inconsistent, the results of the analysis may be
unreliable.
Data Privacy and Security:
Data mining involves analyzing large amounts of data, which may include
sensitive information about individuals or organizations. If this data falls into the
wrong hands, it could be used for malicious purposes, such as identity theft or
corporate espionage.
Ethical Considerations:
Data mining raises ethical questions around privacy, surveillance, and
discrimination. For example, the use of data mining to target specific groups of
individuals for marketing or political purposes could be seen as discriminatory or
manipulative.
elwanzita20s@gmail.com 25
Technical Complexity:
Data mining requires expertise in various fields, including statistics, computer
science, and domain knowledge. The technical complexity of the process can be a
barrier to entry for some businesses and organizations.
Cost:
Data mining can be expensive, particularly if large datasets need to be analyzed.
This may be a barrier to entry for small businesses and organizations.
Interpretation of Results:
Data mining algorithms generate large amounts of data, which can be difficult to
interpret. It may be challenging for businesses and organizations to identify
meaningful patterns and relationships in the data.
Dependence on Technology:
Data mining relies heavily on technology, which can be a source of risk.
Technical failures, such as hardware or software crashes, can lead to data loss or
corruption.
elwanzita20s@gmail.com 26
Applications of Data Mining
elwanzita20s@gmail.com 27
Organizations in the following industries use data mining as part of their analytics
applications:
Retail. Online retailers mine customer data and internet clickstream records to
help them target marketing campaigns, ads and promotional offers to individual
shoppers. Data mining and predictive modeling also power the recommendation
engines that suggest possible purchases to website visitors, as well as inventory
and SCM activities.
Financial services. Banks and credit card companies use data mining tools to
build financial risk models, detect fraudulent transactions, and vet loan and credit
applications. Data mining also plays a key role in marketing and identifying
potential upselling opportunities with existing customers.
Insurance. Insurers rely on data mining to aid in pricing insurance policies and
deciding whether to approve policy applications, as well as for risk modeling and
managing prospective customers.
Manufacturing. Data mining applications for manufacturers include efforts to
improve uptime and operational efficiency in production plants, supply chain
performance and product safety.
elwanzita20s@gmail.com 28
Entertainment. Streaming services analyze what users are watching or listening to
and make personalized recommendations based on their viewing and listening
habits. Likewise, individuals might data mine software to learn more about it.
Healthcare. Data mining helps doctors diagnose medical conditions, treat patients,
and analyze X-rays and other medical imaging results. Medical research also
depends heavily on data mining, machine learning and other forms of analytics.
HR. HR departments typically work with large amounts of data. This includes
retention, promotion, salary and benefit data. Data mining compares this data to
better help HR processes.
Social media. Social media companies use data mining to gather large amounts of
data about users and their online activities. This data is controversially either used
for targeted advertising or might be sold to third parties.
elwanzita20s@gmail.com 29
Data mining vs. data analytics and data warehousing
Data mining is sometimes considered synonymous with data analytics. But it's
predominantly seen as a specific aspect of data analytics that automates the
analysis of large data sets to discover information that otherwise couldn't be
detected.
That information can then be used in the data science process and in other BI and
analytics applications.
Data warehousing supports data mining efforts by providing repositories for the
data sets.
Traditionally, historical data has been stored in enterprise data warehouses or
smaller data marts built for individual business units or to hold specific subsets of
data.
Now, though, data mining applications are often served by data lakes that store
both historical and streaming data and are based on big data platforms, like
Hadoop and Spark; NoSQL databases; or cloud object storage services. 30
elwanzita20s@gmail.com
Data mining history and origins
Data warehousing, BI and analytics technologies began to emerge in the late
1980s and early 1990s, increasing organizations' abilities to analyze the growing
amounts of data that they were creating and collecting.
The term data mining was first used in 1983 by economist Michael Lovell and
saw wider use by 1995 when the First International Conference on Knowledge
Discovery and Data Mining was held in Montreal.
The event was sponsored by the Association for the Advancement of Artificial
Intelligence, which also held the conference annually for the next three years.
Since 1999, the Special Interest Group for Knowledge Discovery and Data
Mining within the Association for Computing Machinery has primarily organized
the ACM SIGKDD conference.
The technical journal, Data Mining and Knowledge Discovery, published its first
issue in 1997. It's published bimonthly and contains peer-reviewed articles on
data mining and knowledge discovery theories, techniques and practices. 31
elwanzita20s@gmail.com
Another peer-reviewed publication, American Journal of Data Mining and
Knowledge Discovery, was launched in 2016.
Data mining and process mining can both help organizations improve their
performance. But how do these technologies compare? Learn more about their
similarities and differences.
elwanzita20s@gmail.com 32
KDD (Knowledge Discovery
in Databases) Process
elwanzita20s@gmail.com 33
KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets.
The KDD process is an iterative process and it requires multiple iterations of the
above steps to extract accurate knowledge from the data.
The following steps are included in KDD process:
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.
elwanzita20s@gmail.com 34
Data Integration
Data integration is defined as heterogeneous data from multiple sources combined in a
common source(DataWarehouse). Data integration using Data Migration tools, Data
Synchronization tools and ETL(Extract-Load-Transformation) process.
Data Selection
Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection. For this we can use Neural
network, Decision Trees, Naive bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two
step process:
Data Mapping: Assigning elements from source base to destination to capture
transformations.
Code generation: Creation of the actual transformation program.
elwanzita20s@gmail.com 35
Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and decides
purpose of model using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It find interestingness score of
each pattern, and uses summarization and Visualization to make data
understandable by user.
elwanzita20s@gmail.com 36
The Knowledge Presentation for the KDD Process in Steps
elwanzita20s@gmail.com 37
Advantages of KDD
i. Improves decision-making: KDD provides valuable insights and knowledge
that can help organizations make better decisions.
ii. Increased efficiency: KDD automates repetitive and time-consuming tasks
and makes the data ready for analysis, which saves time and money.
iii. Better customer service: KDD helps organizations gain a better understanding
of their customers’ needs and preferences, which can help them provide better
customer service.
iv. Fraud detection: KDD can be used to detect fraudulent activities by
identifying patterns and anomalies in the data that may indicate fraud.
v. Predictive modeling: KDD can be used to build predictive models that can
forecast future trends and patterns.
elwanzita20s@gmail.com 38
Disadvantages of KDD
i. Privacy concerns: KDD can raise privacy concerns as it involves collecting and
analyzing large amounts of data, which can include sensitive information about
individuals.
ii. Complexity: KDD can be a complex process that requires specialized skills and
knowledge to implement and interpret the results.
iii. Unintended consequences: KDD can lead to unintended consequences, such as bias
or discrimination, if the data or models are not properly understood or used.
iv. Data Quality: KDD process heavily depends on the quality of data, if data is
not accurate or consistent, the results can be misleading
v. High cost: KDD can be an expensive process, requiring significant investments in
hardware, software, and personnel.
vi. Overfitting: KDD process can lead to overfitting, which is a common problem in
machine learning where a model learns the detail and noise in the training data to
the extent that it negatively impacts the performance of the model on new unseen
data.
elwanzita20s@gmail.com 39
Difference Between KDD and Data Mining
elwanzita20s@gmail.com 40
Stages of Data Mining
Process
elwanzita20s@gmail.com 41
Data mining is a powerful process that involves extracting valuable information
from large datasets.
It’s a multi-step procedure that includes the following stages:
1. Business Understanding
This is where you define the objectives and requirements of the data mining
project from a business perspective. It’s about understanding the problem and
what you aim to achieve.
2. Data Understanding
At this stage, you collect initial data and get acquainted with it to identify data
quality issues, discover first insights into the data, or detect interesting subsets to
form hypotheses for hidden information.
3. Data Preparation
This involves cleaning and preprocessing the data. This involves selecting,
cleaning, constructing, integrating, and formatting the data to ensure it’s ready for
analysis.
elwanzita20s@gmail.com 42
4. Modeling
In this phase, you select and apply various modeling techniques using machine
learning algorithms. The goal is to create models that can predict trends or
patterns.
5. Evaluation
You assess the model to ensure it meets the business objectives. This involves
evaluating the results of the modeling process to determine its effectiveness.
6. Deployment
The final step is to deploy the model into operation. This could mean
implementing it into a decision-making process or simply documenting the
results.
elwanzita20s@gmail.com 43
Task Primitives
elwanzita20s@gmail.com 44
A data mining task Primitives can be specified in the form of a data mining query,
which is input to the data mining system.
They are used to construct data mining processes which are input to the system.
These primitives interact or communicate with the data mining system to do the
data mining process.
The set of task-relevant data to be mined
This specifies the portions of the database or the set of data in which the user is
interested. This includes the database attributes or data warehouse dimensions of
interest (the relevant attributes or dimensions).
In a relational database, the set of task-relevant data can be collected via a
relational query involving operations like selection, projection, join, and
aggregation.
elwanzita20s@gmail.com 45
Kind of Knowledge to be mined
After preparing the data for further processing, it is crucial to initially identify the
type of knowledge that needs to be extracted from the available data so different
data mining functions are performed on relevant data such as Characterization,
Discrimination, Association and correlation analysis, Classification, prediction,
and Clustering.
Let us understand a few types of knowledge to be mined:
a. Descriptive Knowledge: From the available dataset, a mall owner wants to
find out the characteristics of its customers, like what they buy, what their
expense range is, do they prefer branded or non-branded items, etc. Such
type of information is called Descriptive knowledge.
b. Predictive Knowledge: Based on the application applicant’s past financial
records, the Bank analyses whether the respective person applying for a loan
is capable of repaying their loan and whether the bank accepts or rejects a
loan application. Such information is called predictive knowledge.
elwanzita20s@gmail.com 46
a. Associative knowledge: Based on available customer data, the Mall owner wants to find
out which products are best suited to each other so that if kept together in a single
section, there is a high probability that customers will buy both of them. Such type of
information is called Associative knowledge.
The background knowledge is to be used in the discovery process.
During the data mining process, past knowledge is used to guide the data mining
processes, such as concept hierarchies and user beliefs about relationships in data.
Some skills and features used are as follows:
a. Domain Knowledge: While extracting data, the person should have a proper
knowledge of the domain from which data is being extracted. This includes
understanding standard industry terms, its business and working methodology
b. Past Data: Data that is already extracted can serve as a foundation and help in
understanding the type of data and the industry.
c. Existing models and equations: Existing models also help in understanding how
the data works and the expected outcomes of the data. Interestingness measures
and thresholds for pattern evaluation: 47
elwanzita20s@gmail.com
4. Interestingness of measures and Threshold for pattern evaluation
a) Interestingness of measures: Multiple factors like
i. Simplicity: A factor contributing to the interestingness of a pattern is the
pattern's overall simplicity for human comprehension. For example, the more
complex the structure of a rule is, the more difficult it is to interpret, and
hence, the less interesting it is likely to be.
ii. Support /Utility (frequent patterns),
iii. Certainty/confidence (strength of association between elements of data),
iv. Lift (unexpected confidence), and
v. Interest (combo of support, confidence, lift) are taken into consideration to
identify the patterns in the data set
vi. Novelty: Novel patterns are those that contribute new information or
increased performance to the given pattern set. For example -> A data
exception. Another strategy for detecting novelty is to remove redundant
patterns
elwanzita20s@gmail.com 48
b). Threshold for pattern evaluation: After identifying the patterns according to
interestingness, thresholds are used as filters to get only the desired ones.
Thresholds are predefined values, which were initially set by data miners through
which the required patterns can be filtered according to use case and industry.
If high thresholds are set, only a few patterns will meet the threshold values, and
fewer patterns can be extracted.
5. Representation for visualizing the discovered pattern: Discovered patterns are
sometimes hard to understand, but if viewed using diagrams, it becomes quite
easy.
Visualization techniques are used to represent data, which helps to understand
important relationships and patterns within data. These visualization techniques
are Rules, tables, reports, charts, graphs, decision trees, and cubes.
elwanzita20s@gmail.com
49
Scenario Example of Data mining Task Primitive
Scenario – Mall, sales data, and its analysis.
Set of task-relevant data: The mall, over a period of time, has collected a large
amount of data like data on its employees, data on mall infrastructure,
logistics, Customer information, sales records, Festive offers, etc.
From this data, find out the relevant data which will help to analyze the graph of
sales.
This most important data will be sales done in the last few months and customer
information. This data will be gathered, and unwanted components will be
removed. After this, the Formatting of data will be done.
Kind of knowledge: From the data, experts will find the pattern from customer
priorities with respect to product type, product price range, shopping
frequency, etc.
elwanzita20s@gmail.com 50
Background knowledge: Experts will introduce their skills about how the mall
sales function and what extra measures must be taken to improve sales. If they
have any past experience with other malls, any specific models through which
sales can be improved.
Interestingness of measures and thresholds for pattern evaluation:
Support – Calculating the most items sold and the least items sold from the
mall.
Confidence – Identifying if product B is sold along with product A.
Lift – Calculating Unexpected sales of Product B and when product A is sold.
6. Data Visualization: Visualization of extracted patterns using charts and
diagrams is presented for a better understanding of the relation between data.
Below is the representation of sales revenue in a year using a line chart and a bar
chart:
elwanzita20s@gmail.com 51
Advantages of Data Mining Task Primitive
Gives a broader perspective of data extraction and analysis.
Data is well extracted, formatted, and cleaned to give precise and required data
only.
Experts understand the sales of a particular business easily.
Improves efficiency and gives a long-term improvement path to the industry.
Conclusion
In any type of industry, data is the most crucial component. This data allows us to
map a graph of the industry and its future. Hence, it is necessary that data should
be used with care. To do this, primitive data mining techniques are used to extract
data from different sources format, and clean data for further use.
elwanzita20s@gmail.com 52
Knowledge Representation
elwanzita20s@gmail.com 53
Knowledge representation is the presentation of knowledge to the user for
visualization in terms of trees, tables, rules, graphs, charts, or matrices.
It refers to the methods and structures used to represent the patterns, insights, and
knowledge discovery from data mining processes in a meaningful and
interpretable manner.
Knowledge representation enables us to extract meaningful insights from large
and complex datasets.
elwanzita20s@gmail.com 54
Features of Knowledge
Representation
elwanzita20s@gmail.com 55
1. Expressiveness: The representation should be capable of expressing diverse types of
knowledge, including concepts, facts, rules, and relationships.
2. Inference: The representation should support reasoning and inference, allowing for
the derivation of new knowledge from existing knowledge.
3. Efficiency: It should be computationally efficient to store, retrieve, and manipulate
knowledge within the representation.
4. Scalability: The representation should be able to scale to large datasets and handle
complex knowledge structures effectively.
5. Flexibility: It should be flexible enough to adapt to changes and updates in the
knowledge domain.
6. Interoperability: The representation should facilitate interoperability between
different systems and domains, allowing for the integration of knowledge from diverse
sources.
7. Transparency: The representation should be transparent and understandable to
humans, enabling users to interpret and reason about the knowledge effectively.
8. Domain-specificity: Knowledge representations should be tailored to specific
domains to capture the intricacies and nuances of the domain effectively.
56
elwanzita20s@gmail.com
Importance of Knowledge
Representation
elwanzita20s@gmail.com 57
Facilitating Reasoning and Inference: It enables systems to store and manipulate
information in a structured format.
Supporting Problem Solving: Representing knowledge in a structured form
provides a framework for problem-solving algorithms to operate.
Enabling Learning and Adaptation: It is essential for machine learning and other
forms of artificial intelligence to learn from data and adapt to new situations.
Facilitating Communication: Representing knowledge in a structured format
allows for easier communication and sharing of information between different
systems and individuals.
Supporting Natural Language Understanding: Effective knowledge representation
is essential for natural language understanding systems.
Enhancing Decision-Making: Knowledge representation supports decision-
making processes.
elwanzita20s@gmail.com 58
Knowledge Representation
Techniques in Data Mining
elwanzita20s@gmail.com 59
Graph-based Representations
Graphs are often used to represent complex relationships and dependencies in the
data.
Nodes represent entities, and edges represent connections between them.
Graph-based representations are particularly useful for tasks like social network
analysis, where relationships between individuals or entities are of interest. For
instance, a graph representation of a social network can reveal clusters of friends
or influencers within a community.
Graph based Representations include decision trees and graphs and networks.
Decision trees
A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label.
elwanzita20s@gmail.com 60
Benefits of having decision trees :
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not. Each internal
node represents a test on an attribute. Each leaf node represents a class.
elwanzita20s@gmail.com 61
Graphs & Networks
A graph is a non-linear kind of data structure made up of nodes or vertices and
edges. The edges connect any two nodes in the graph, and the nodes are also
known as vertices.
Represent relationships between entities using nodes and edges, which can
capture complex interconnections.
elwanzita20s@gmail.com 62
Features of Graph-Based Representations
Expressiveness: Graphs can represent complex relationships and structures.
Flexibility: Graphs can model various types of relationships, making them highly
flexible for representing different kinds of information.
Efficiency: Graph algorithms enable efficient querying, traversal, and
manipulation of the graph structure.
Scalability: Graph-based representations can scale to large datasets and handle
dynamic information effectively.
Visualization: Graphs can be visualized graphically, making it easier for users to
understand and interpret the relationships and structures within the knowledge
graph.
elwanzita20s@gmail.com 63
Importance of Graph-Based Representations
Knowledge Integration: Graph-based representations facilitate the integration of
diverse types of knowledge from different sources.
Reasoning and Inference: Graph-based representations support reasoning and
inference.
Recommendation Systems: Graph-based representations are widely used in
recommendation systems.
Network Analysis: Graph-based representations are essential for network analysis
tasks.
Semantic Web: Graph-based representations are fundamental to the Semantic
Web initiative.
elwanzita20s@gmail.com 64
Logical Representations
One common approach is to use logical representations, such as propositional
logic or first-order logic, to encode the knowledge.
These logical representations allow for precise and formal descriptions of the
data, enabling reasoning and inference to be applied to the mined knowledge.
For example, if we have a logical representation of customer purchase behavior,
we can infer patterns like "Customers who bought A also tend to buy B."
Logical Representations include Association Rules and Classification Rules
Association Rules
Represent relationships between variables in the form of "If X, then Y."
Association rules are if-then statements that show the probability of relationships
between data items within large data sets in various types of databases.
For example, if 75% of people who buy cereal also buy milk, then there is a
discernible pattern in transactional data that customers who buy cereal often buy
milk.
elwanzita20s@gmail.com 65
How association rules work
An association rule has two parts: an antecedent (if) and a consequent (then). An
antecedent is an item found within the data. A consequent is an item found in
combination with the antecedent. The if-then statements form item sets, which are
the basis for calculating association rules made up of two or more items in a data
set.
Data pros search data for frequently occurring if-then statements. They then look
for support for the statement in terms of how often it appears and confidence in it
from the number of times it's found to be true.
Association rules are typically created from item sets that include many items and
are well represented in data sets. However, if rules are built from analyzing all
possible item sets or too many sets of items, too many rules result, and they have
little meaning.
elwanzita20s@gmail.com
66
Features Of Logical Representations
Formalism: Logical representations provide a formal and rigorous framework for
representing knowledge.
Expressiveness: Logical representations can express complex relationships, rules,
and constraints using logical formulas or rules.
Inference: Logical representations support deductive reasoning and inference.
Consistency: Logical representations ensure consistency in knowledge
representation.
Importance Of Logical Representations
Reasoning: Logical representations enable automated reasoning and inference.
Expert Systems: Logical representations are fundamental to expert systems and
rule-based systems.
Semantic Web: Logical representations play a crucial role in the Semantic Web
initiative.
Natural Language Processing: Logical representations are used in natural
language processing (NLP).
elwanzita20s@gmail.com 67
Semantic Networks
Semantic networks are a way of representing knowledge or information in a
structured form using nodes (or vertices) to represent concepts or entities, and
links (or edges) to represent relationships between these concepts.
Each node typically represents a specific concept, while the links represent
various types of relationships, such as "is-a," "part-of," "located-in," or any other
relevant connection between concepts.
These networks enable computers to understand the meaning of words and
concepts by organizing them in a way that reflects their semantic relationships,
making it easier to process and manipulate information for tasks like natural
language understanding, reasoning, and knowledge representation.
elwanzita20s@gmail.com 68
Features of Semantic Networks
Nodes and Edges: Semantic networks consist of nodes representing concepts or
entities and edges representing relationships between them.
Hierarchy: Semantic networks often incorporate hierarchical relationships, where
nodes can be organized into different relationships.
Directed or Undirected: Edges in semantic networks can be directed or
undirected. Directed edges indicate specific relationships between concepts, such
as "is-a" or "part-of," while undirected edges can represent symmetrical
relationships, such as similarity or association.
Multiple Relationships: Semantic networks can accommodate multiple types of
relationships between nodes.
Scalability: Semantic networks can scale to represent large amounts of knowledge
while maintaining their structural integrity and semantic coherence.
elwanzita20s@gmail.com 69
Importance Of Semantic Networks
Knowledge Representation: They provide a structured way to represent
knowledge by organizing concepts and their relationships.
Natural Language Processing (NLP): Semantic networks play a crucial role in
NLP tasks.
Artificial Intelligence (AI): Semantic networks are fundamental to many AI
applications.
Information Retrieval: Semantic networks enhance information retrieval systems
by organizing information in a meaningful way.
Semantic Web: Semantic networks are foundational to the Semantic Web
initiative.
elwanzita20s@gmail.com 70
Ontologies
elwanzita20s@gmail.com 71
Represent knowledge as a set of concepts and the relationships between those
concepts within a specific domain.
Ontologies are used to facilitate data integration, sharing, and analysis by offering
a clear and consistent schema for representing domain knowledge.
Applications of ontologies include data integration, semantic search and retrieval,
knowledge discovery and data interoperability.
Key features of Ontologies
Concepts (Classes):Fundamental entities or categories in a domain, such as
"Customer," "Product," "Order" in an e-commerce ontology.
Relationships (Properties):Associations between concepts, such as "Customer
places Order" or "Product belongs to Category."
Hierarchies:Concept hierarchies (taxonomies) showing how more general
concepts encompass more specific ones. For example, "Vehicle" might
encompass "Car," "Truck," and "Motorcycle."
Instances:Specific examples of concepts, such as a particular customer named
"John Doe" or a specific product called "Laptop." elwanzita20s@gmail.com 72
Features Of Ontologies
Conceptual Clarity: Ontologies provide a clear and well-defined
conceptualization of a domain.
Formalism: Ontologies are typically represented using formal languages such as
RDF (Resource Description Framework), OWL (Web Ontology Language), or
Description Logics..
Expressiveness: Ontologies support rich and expressive representations.
Modularity: Ontologies can be modularized into smaller, reusable components,
making it easier to manage and maintain.
Importance Of Ontologies
Semantic Interoperability: Ontologies enable semantic interoperability.
Knowledge Sharing and Reuse: Ontologies support knowledge sharing and
reuse.
elwanzita20s@gmail.com 73
Semantic Search and Retrieval: Ontologies enhance search and retrieval
processes.
Intelligent Reasoning: Ontologies support intelligent reasoning and inference.
Semantic Web: Ontologies are fundamental to the Semantic Web initiative, where
they serve as the building blocks for organizing and structuring knowledge on the
web.
Attribute value Representations
Attribute-value representation is a method used in data mining knowledge
representation that involves describing data objects using a set of attributes (also
known as features or variables) and their corresponding values.
Each attribute represents a characteristic or property of the data object, and the
value denotes the specific instance of that attribute for the given object.
Data objects are typically represented as rows in a table (or records in a database),
where each row corresponds to a single data object, and each column corresponds
to an attribute. This tabular format is common in datasets used for data mining.
elwanzita20s@gmail.com 74
Features Of Attribute Value Representations
Flexibility: Attribute-value representations allow for flexible and extensible
modeling of knowledge.
Simplicity: Attribute-value representations are relatively simple and intuitive.
Expressiveness: Attribute-value representations can express a wide range of
knowledge.
Efficiency: Attribute-value representations can be efficiently stored and processed
using data structures, making them suitable for managing and manipulating
structured data.
elwanzita20s@gmail.com 75
Importance Of Attribute Value Representations
Data Modeling: Attribute-value representations are widely used for modeling
structured data.
Information Retrieval: Attribute-value representations facilitate information
retrieval by providing a structured framework for indexing and querying data.
Machine Learning: Attribute-value representations are commonly used in
machine learning algorithms.
Knowledge Representation: Attribute-value representations can be used for
representing domain knowledge in knowledge-based systems.
elwanzita20s@gmail.com 76
Data Visualization Representations
Visualization means using graphical representations such as heat maps, scatter
plots, and bar charts to visualize data distributions, correlations, and trends.
These techniques help in understanding complex data sets, identifying trends,
spotting anomalies, and making data-driven decisions.
Features Of Data Visualization Representations
Visual Encoding: Data visualization representations use visual elements to encode
data attributes, making it easier for users to perceive and interpret patterns in the
data.
Multidimensionality: Data visualization representations can visualize data with
multiple dimensions or attributes.
Scalability: Data visualization representations can scale to large datasets.
Customization: Users can often customize data visualization representations.
elwanzita20s@gmail.com 77
Visualization Techniques
Some common visualization techniques used in data mining are :
Histograms
Bar Charts
Line Graphs
Scatter Plots
Pie Charts
Box Plots
elwanzita20s@gmail.com
78
Histogram
A histogram is a graph that shows the frequency of numerical data using
rectangles. The height of a rectangle (the vertical axis) represents the distribution
frequency of a variable (the amount, or how often that variable appears).
It is a graphical representation of the distribution of numerical data, showing the
frequency of data points in successive intervals (bins).
Use Case: Understanding the distribution of a single variable, such as the age
distribution of a population.
elwanzita20s@gmail.com 79
Bar Chart
A bar chart or bar graph is a chart or graph that presents categorical data with
rectangular bars with heights or lengths proportional to the values that they
represent.
The bars can be plotted vertically or horizontally.
A chart with rectangular bars representing the frequency or value of different
categories.
Use Case: Comparing quantities across different categories, such as sales figures
for different products.
elwanzita20s@gmail.com 80
Line Graph
A line graph also known as a line plot or a line chart is a graph that uses lines to
connect individual data points. A line graph displays quantitative values over a
specified time interval.
The graph represents quantitative data between two changing variables with a line
or curve that joins a series of successive data points.
Use Case: Visualizing trends over time, such as stock price movements or
temperature changes.
elwanzita20s@gmail.com 81
Importance Of Data Visualization Representations
Insight Discovery: Data visualization representations enable users to explore and
discover insights in the data that might not be apparent from raw data tables or
numerical summaries alone.
Communication: Data visualization representations provide an effective means of
communicating complex data and insights.
Decision Making: Data visualization representations support data-driven
decision-making .
User Engagement: Interactive data visualization representations engage users
more actively in the data analysis process.
elwanzita20s@gmail.com 82
Applications Of Knowledge Representation
Artificial Intelligence (AI) and Expert Systems: KR is foundational in AI for
building expert systems that emulate
human expertise in specific domains.
Natural Language Processing (NLP): KR plays a crucial role in NLP for
representing the meaning of natural language text.
Semantic Web: KR is fundamental to the Semantic Web initiative.
Database Systems: KR techniques are employed in database systems for
representing complex data structures, relationships, and constraints. management
and retrieval.
Robotics and Autonomous Systems: KR is used in robotics and autonomous
systems to represent knowledge about the environment, tasks, and actions.
elwanzita20s@gmail.com 83
Healthcare Informatics: KR is applied in healthcare informatics for organizing
and managing medical knowledge, patient data, and clinical guidelines.
Education and E-Learning: KR techniques are utilized in educational systems and
e-learning platforms.
Information Retrieval and Recommender Systems: KR is employed in
information retrieval systems and recommender systems to represent user
preferences, item attributes, and semantic relationships.
Cybersecurity and Fraud Detection: KR techniques are applied in cybersecurity
and fraud detection systems.
Environmental Modeling and Sustainability: KR is used in environmental
modeling and sustainability initiatives.
elwanzita20s@gmail.com 84
Data Processing
elwanzita20s@gmail.com 85
Data processing occurs when data is collected and translated into usable
information. Usually performed by a data scientist or team of data scientists, it is
important for data processing to be done correctly as not to negatively affect the
end product, or data output.
Data processing starts with data in its raw form and converts it into a more
readable format (graphs, documents, etc.), giving it the form and context
necessary to be interpreted by computers and utilized by employees throughout an
organization.
elwanzita20s@gmail.com 86
Stages of Data Processing
elwanzita20s@gmail.com 87
1. Data collection
Collecting data is the first step in data processing. Data is pulled from available
sources, including data lakes and data warehouses.
It is important that the data sources available are trustworthy and well-built so the
data collected (and later used as information) is of the highest possible quality.
2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data
preparation, often referred to as “pre-processing” is the stage at which raw data is
cleaned up and organized for the following stage of data processing.
During preparation, raw data is diligently checked for any errors. The purpose of
this step is to eliminate bad data (redundant, incomplete, or incorrect data) and
begin to create high-quality data for the best business intelligence.
elwanzita20s@gmail.com 88
3. Data input
The clean data is then entered into its destination (perhaps a CRM like Salesforce
or a data warehouse like Redshift), and translated into a language that it can
understand.
Data input is the first stage in which raw data begins to take the form of usable
information.
4. Processing
During this stage, the data inputted to the computer in the previous stage is
actually processed for interpretation.
Processing is done using machine learning algorithms, though the process itself
may vary slightly depending on the source of data being processed (data lakes,
social networks, connected devices etc.) and its intended use (examining
advertising patterns, medical diagnosis from connected devices, determining
customer needs, etc.).
elwanzita20s@gmail.com 89
5. Data output/interpretation
The output/interpretation stage is the stage at which data is finally usable to non-
data scientists. It is translated, readable, and often in the form of graphs, videos,
images, plain text, etc.).
Members of the company or institution can now begin to self-serve the data for
their own data analytics projects.
6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is
then stored for future use. While some information may be put to use
immediately, much of it will serve a purpose later on.
Plus, properly stored data is a necessity for compliance with data protection
legislation
elwanzita20s@gmail.com 90
Data Cleaning
elwanzita20s@gmail.com 91
Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset.
When combining multiple data sources, there are many opportunities for data to
be duplicated or mislabeled.
If data is incorrect, outcomes and algorithms are unreliable, even though they
may look correct,There is no one absolute way to prescribe the exact steps in the
data cleaning process because the processes will vary from dataset to dataset.
But it is crucial to establish a template for your data cleaning process so you
know you are doing it the right way every time.
How to Clean Data
1: Remove duplicate or irrelevant observations
Remove unwanted observations from your dataset, including duplicate
observations or irrelevant observations.
elwanzita20s@gmail.com 92
Duplicate observations will happen most often during data collection. When you
combine data sets from multiple places, scrape data, or receive data from clients
or multiple departments, there are opportunities to create duplicate data.
De-duplication is one of the largest areas to be considered in this process.
Irrelevant observations are when you notice observations that do not fit into the
specific problem you are trying to analyze
Step 2: Fix structural errors
Structural errors are when you measure or transfer data and notice strange naming
conventions or incorrect capitalization. These inconsistencies can cause
mislabeled categories or classes.
Step 3: Filter unwanted outliers
Often there will be one-off observations where at a glance they do not appear to
fit within the data you are analyzing.
elwanzita20s@gmail.com 93
If you have a legitimate reason to remove an outlier like improper data-entry
doing so will help the performance of the data you are working with. However
sometimes it is the appearance of an outlier that will prove a theory you are
working on.
note: just because an outlier exists, doesn’t mean it is incorrect. This step is
needed to determine the validity of that number. If an outlier proves to be
irrelevant for analysis or is a mistake consider removing it.
Step 4: Handle missing data
As first option you can drop observations that have missing values but doing this
will drop or lose information.
As a second option you can input missing values based on other observations
Note: there is an opportunity to lose integrity of the data because you may be
operating from assumptions and not actual observations.
3. As a third option you might alter the way the data is used to effectively
navigate null values.
elwanzita20s@gmail.com 94
Step 5: Validate and QA
At the end of the data cleaning process you should be able to answer these
questions as a part of basic validation:
Does the data make sense?
Does the data follow the appropriate rules for its field?
Does it prove or disprove your working theory or bring any insight to light?
Can you find trends in the data to help you form your next theory?
If not is that because of a data quality issue?
elwanzita20s@gmail.com 95
Characteristics of Quality Data
1. Validity.
-The degree to which your data conforms to defined business rules or constraints.
2. Accuracy.
-Ensure your data is close to the true values.
3. Completeness.
- The degree to which all required data is known.
4. Consistency.
-Ensure your data is consistent within the same dataset and/or across multiple
data sets.
5. Uniformity.
-The degree to which the data is specified using the same unit of measure.
elwanzita20s@gmail.com 96
Data Transformation
elwanzita20s@gmail.com 97
Data transformation is the process of converting, cleaning, and structuring data
into a usable format that can be analyzed to support decision making processes
and to propel the growth of an organization.
Transformation is an essential step in many processes, such as data
integration, migration and warehousing . The process of data transformation can
be:
elwanzita20s@gmail.com 98
How is Data Transformation Used
1.Discovery
The first step is to identify and understand data in its original source format with
the help of data profiling tools. Finding all the sources and data types that need to
be transformed.
This step helps in understanding how the data needs to be transformed to fit into
the desired format.
2. Mapping
The transformation is planned during the data mapping phase. This includes
determining the current structure, and the consequent transformation that is
required, then mapping the data to understand at a basic level, the way individual
fields would be modified, joined or aggregated.
3.Code generation
The code, which is required to run the transformation process, is created in this
step using a data transformation platform or tool.
elwanzita20s@gmail.com 99
4. Execution
The data is finally converted into the selected format with the help of the code.
The data is extracted from the source(s), which can vary from structured to
streaming, telemetry to log files.
Next, transformations are carried out on data, such as aggregation, format
conversion or merging, as planned in the mapping stage.
The transformed data is then sent to the destination system which could be a
dataset or a data warehouse.
5. Review
The transformed data is evaluated to ensure the conversion has had the desired
results in terms of the format of the data.
It must also be noted that not all data will need transformation, at times it can be
used as is.
elwanzita20s@gmail.com 100
Features Selection
elwanzita20s@gmail.com 101
Feature Selection
It is a process of automatically or manually selecting the subset of most
appropriate and relevant features to be used in model building
Feature selection is performed by either including the important features or
excluding the irrelevant features in the dataset without changing them.
Types of features selection techniques
There mainly three types of features selection techniques
i. Wrapper method.
ii. Filter Methods.
iii. Embedded Methods.
elwanzita20s@gmail.com 102
Wrapper Methodology
In wrapper methodology, selection of features is done by considering it as a
search problem, in which different combinations are made, evaluated, and
compared with other combinations.
Filter method
In Filter Method, features are selected on the basis of statistics measures. This
method does not depend on the learning algorithm and chooses the features as a
pre-processing step.
Embedded method
Embedded methods combined the advantages of both filter and wrapper methods
by considering the interaction of features along with low computational cost.
These are fast processing methods similar to the filter method but more accurate
than the filter method.
elwanzita20s@gmail.com 103
Benefits of features selection
It helps in the simplification of the model so that it can be easily interpreted by
the researchers.
It reduces the training time.
It reduces overfitting hence enhance the generalization.
elwanzita20s@gmail.com 104
Dimensionality Reduction
elwanzita20s@gmail.com 105
Dimensionality reduction is a process used in data analysis and machine
learning to reduce the number of variables under consideration. It simplifies the
dataset while retaining as much information as possible.
OR
Dimensionality reduction is a technique used to reduce the number of features in
a dataset while retaining as much of the important information as possible.
In other words, it is a process of transforming high-dimensional data into a lower-
dimensional space that still preserves the essence of the original data.
Dimensionality reduction can help to mitigate these problems by reducing the
complexity of the model and improving its generalization performance.
There are two main approaches to dimensionality reduction:
1. Feature Selection
2.Feature Extraction.
elwanzita20s@gmail.com 106
Feature Selection:
Feature selection involves selecting a subset of the original features that are most
relevant to the problem at hand. The goal is to reduce the dimensionality of the
dataset while retaining the most important features.
This involves selecting a subset of the most important features based on some
criteria, such as:
Filter Methods: Use statistical measures to score and select features (e.g., chi-
square test, correlation coefficient).
Wrapper Methods: Use a predictive model to score feature subsets and select the
best-performing combination (e.g., recursive feature elimination).
Embedded Methods: Perform feature selection as part of the model training
process (e.g., Lasso regression).
elwanzita20s@gmail.com 107
Feature Extraction:
Feature extraction involves creating new features by combining or transforming
the original features. The goal is to create a set of features that captures the
essence of the original data in a lower-dimensional space
This transforms the data from a high-dimensional space to a lower-dimensional
space. Key techniques include:
Principal Component Analysis (PCA): A linear technique that transforms the data
to a new coordinate system where the greatest variance lies on the first axis, the
second greatest on the second axis, and so on.
Linear Discriminant Analysis (LDA): Similar to PCA but also considers class
labels and aims to maximize the separation between multiple classes.
t-Distributed Stochastic Neighbor Embedding (t-SNE): A non-linear technique
primarily used for visualization in 2 or 3 dimensions.
Autoencoders: Neural networks that learn to compress data into a lower-
dimensional representation and then reconstruct it back to the original form.
elwanzita20s@gmail.com 108
Advantages
1.It helps in data compression, and hence reduced storage space.
2.It reduces computation time.
3.It also helps remove redundant features, if any.
4.Improved Visualization: High dimensional data is difficult to visualize, and
dimensionality reduction techniques can help in visualizing the data in 2D or 3D,
which can help in better understanding and analysis.
5.Overfitting Prevention: High dimensional data may lead to overfitting in
machine learning models, which can lead to poor generalization performance.
Dimensionality reduction can help in reducing the complexity of the data, and
hence prevent overfitting.
6.Feature Extraction: Dimensionality reduction can help in extracting important
features from high dimensional data, which can be useful in feature selection for
machine learning models.
7.Data Preprocessing: Dimensionality reduction can be used as a preprocessing
step before applying machine learning algorithms to reduce the dimensionality of
the data and hence improve the performance of the model.
elwanzita20s@gmail.com 109
Disadvantages
1.It may lead to some amount of data loss.
2.PCA tends to find linear correlations between variables, which is sometimes
undesirable.
3.PCA fails in cases where mean and covariance are not enough to define datasets.
4.We may not know how many principal components to keep- in practice, some thumb
rules are applied.
5.Interpretability: The reduced dimensions may not be easily interpretable, and it may
be difficult to understand the relationship between the original features and the reduced
dimensions.
6.Overfitting: In some cases, dimensionality reduction may lead to overfitting,
especially when the number of components is chosen based on the training data.
7.Sensitivity to outliers: Some dimensionality reduction techniques are sensitive to
outliers, which can result in a biased representation of the data.
8.Computational complexity: Some dimensionality reduction techniques, such as
manifold learning, can be computationally intensive, especially when dealing with large
datasets.
elwanzita20s@gmail.com 110
Frequent Pattern Mining in Data Mining
Frequent pattern mining in data mining is the process of identifying patterns or
associations within a dataset that occur frequently. This is typically done by
analyzing large datasets to find items or sets of items that appear together
frequently.
elwanzita20s@gmail.com 111
There are several different algorithms used for frequent pattern mining, including:
Apriori algorithm: This is one of the most commonly used algorithms for frequent
pattern mining. It uses a “bottom-up” approach to identify frequent itemsets and
then generates association rules from those itemsets.
Apriori is one of the most widely used algorithms for association rule mining. It
generates frequent item sets from a given dataset by pruning infrequent item sets
iteratively.
The Apriori algorithm is based on the concept that if an item set is frequent, then
all of its subsets must also be frequent.
The algorithm first identifies the frequent items in the dataset, then generates
candidate item sets of length two from the frequent items, and so on until no more
frequent item sets can be generated.
The Apriori algorithm is computationally expensive, especially for large datasets
with many items.
elwanzita20s@gmail.com 112
ECLAT algorithm:
This algorithm uses a “depth-first search” approach to identify frequent itemsets.
It is particularly efficient for datasets with a large number of items.
Eclat (Equivalence Class Clustering and Bottom-up Lattice Traversal) is a
frequent itemset mining algorithm based on the vertical data format. The
algorithm first converts the dataset into a vertical data format, where each item
and the transaction ID in which it appears are stored.
Eclat then performs a depth-first search on a tree-like structure, representing the
dataset's frequent itemsets. The algorithm is efficient regarding both memory
usage and runtime, especially for sparse datasets.
elwanzita20s@gmail.com 113
FP-growth algorithm:
This algorithm uses a “compression” technique to find frequent patterns
efficiently. It is particularly efficient for datasets with a large number of
transactions.
Frequent pattern mining has many applications, such as Market Basket Analysis,
Recommender Systems, Fraud Detection, and many more.
It is based on the concept of frequent pattern growth. It is faster than the Apriori
algorithm, especially for large datasets.
The FP-Growth algorithm builds a compact representation of the dataset called a
frequent pattern tree (FP-tree), which is used to mine frequent item sets. The
algorithm scans the dataset only twice, first to build the FP-tree and then to mine
the frequent itemsets.
The FP-Growth algorithm can handle datasets with both discrete and continuous
attributes.
elwanzita20s@gmail.com 114
Applications of Frequent pattern mining
Frequent pattern mining has several applications in different areas, including:
1.Market Basket Analysis: This is the process of analyzing customer purchasing
patterns in order to identify items that are frequently bought together.
This information can be used to optimize product placement, create targeted
marketing campaigns, and make other business decisions.
2. Recommender Systems: Frequent pattern mining can be used to identify
patterns in user behavior and preferences in order to make personalized
recommendations.
3.Fraud Detection: Frequent pattern mining can be used to identify abnormal
patterns of behavior that may indicate fraudulent activity.
4.Network Intrusion Detection: Network administrators can use frequent pattern
mining to detect patterns of network activity that may indicate a security threat.
elwanzita20s@gmail.com 115
5.Medical Analysis: Frequent pattern mining can be used to identify patterns in
medical data that may indicate a particular disease or condition.
6.Text Mining: Frequent pattern mining can be used to identify patterns in text
data, such as keywords or phrases that appear frequently together in a document.
7.Web usage mining: Frequent pattern mining can be used to analyze patterns of
user behavior on a website, such as which pages are visited most frequently or
which links are clicked on most often.
8.Gene Expression: Frequent pattern mining can be used to analyze patterns of
gene expression in order to identify potential biomarkers for different diseases.
elwanzita20s@gmail.com 116
Association:
Definition: Association in data mining refers to finding relationships between
variables in a dataset. It helps us understand how one variable is related to
another. This is particularly useful in fields like market analysis, where
understanding customer behavior is crucial for business success.
Example : In a grocery store dataset, association analysis might reveal that
customers who buy bread often also buy butter. This insight can lead to strategic
decisions like placing butter next to bread in the store to increase sales.
In an online retail dataset, association analysis could show that customers who
purchase a laptop often also buy a laptop bag and a mouse. This information can
be used to bundle products together or provide targeted recommendations to
customers during their shopping experience.
elwanzita20s@gmail.com 117
Importance:
Understanding associations helps businesses make strategic decisions, such as
product placement and marketing strategies. It enables businesses to identify
cross-selling opportunities and optimize their operations based on customer
behavior.
Technique: Association rule mining, like the Apriori algorithm, is commonly used
to discover these relationships. This algorithm scans transaction data to find
patterns and associations between items that frequently co-occur in transactions.
Types of Associations
Here are the most common types of associations used in data mining:
Itemset Associations: Itemset association is the most common type of association
analysis, which is used to discover relationships between items in a dataset. In
this type of association, a collection of one or more items that frequently co-occur
together is called an itemset.
For example, in a supermarket dataset, itemset association can be used to identify
items that are frequently purchased together, such as bread and butter. 118
Sequential Associations: Sequential association is used to identify patterns that
occur in a specific sequence or order.
This type of association analysis is commonly used in applications such as
analyzing customer behaviour on e-commerce websites or studying weblogs.
For example, in the weblogs dataset, a sequential association can be used to
identify the sequence of pages that users visit before making a purchase.
Graph-based Associations: Graph-based association is a type of association
analysis that involves representing the relationships between items in a dataset as
a graph.
In this type of association, each item is represented as a node in the graph, and the
edges between nodes represent the co-occurrence or relationship between items.
The graph-based association is used in various applications, such as social
network analysis, recommendation systems, and fraud detection.
For example, in a social network dataset, identifying groups of users with similar
interests or behaviours.
elwanzita20s@gmail.com 119
Association Rule Mining
Here are the most commonly used algorithms to implement association rule
mining in data mining:
Apriori Algorithm - Apriori is one of the most widely used algorithms for
association rule mining. It generates frequent item sets from a given dataset
by pruning infrequent item sets iteratively. The Apriori algorithm is based
on the concept that if an item set is frequent, then all of its subsets must also
be frequent.
The algorithm first identifies the frequent items in the dataset, then
generates candidate itemsets of length two from the frequent items, and so
on until no more frequent itemsets can be generated. The Apriori algorithm
is computationally expensive, especially for large datasets with many items.
FP-Growth Algorithm - FP-Growth is another popular algorithm for
association rule mining that is based on the concept of frequent pattern
growth. It is faster than the Apriori algorithm, especially for large datasets.
elwanzita20s@gmail.com 120
The FP-Growth algorithm builds a compact representation of the dataset
called a frequent pattern tree (FP-tree), which is used to mine frequent item
sets.
The algorithm scans the dataset only twice, first to build the FP-tree and
then to mine the frequent itemsets. The FP-Growth algorithm can handle
datasets with both discrete and continuous attributes.
Eclat Algorithm - Eclat (Equivalence Class Clustering and Bottom-up
Lattice Traversal) is a frequent itemset mining algorithm based on the
vertical data format.
The algorithm first converts the dataset into a vertical data format, where
each item and the transaction ID in which it appears are stored.
Eclat then performs a depth-first search on a tree-like structure, representing
the dataset's frequent itemsets. The algorithm is efficient regarding both
memory usage and runtime, especially for sparse datasets.
121
elwanzita20s@gmail.com
Correlation Analysis In Data Mining
Correlation Analysis is a data mining technique used to identify the degree to
which two or more variables are related or associated with each other. Correlation
refers to the statistical relationship between two or more variables, where the
variation in one variable is associated with the variation in another variable.
In other words, it measures how changes in one variable are related to changes in
another variable. Correlation can be positive, negative, or zero, depending on the
direction and strength of the relationship between the variables.
For example,, we are studying the relationship between the hours of study and the
grades obtained by students.
If we find that as the number of hours of study increases, the grades obtained also
increase, then there is a positive correlation between the two variables. On the
other hand, if we find that as the number of hours of study increases, the grades
obtained decrease, then there is a negative correlation between the two variables.
If there is no relationship between the two variables, we would say that there is
zero correlation.
elwanzita20s@gmail.com 122
Why is Correlation Analysis Important?
Correlation analysis is important because it allows us to measure the strength and
direction of the relationship between two or more variables.
This information can help identify patterns and trends in the data, make
predictions, and select relevant variables for analysis.
By understanding the relationships between different variables, we can gain
valuable insights into complex systems and make informed decisions based on
data-driven analysis.
elwanzita20s@gmail.com 123
Benefits of correlation analysis
1.Identifying Relationships - Correlation analysis helps identify the relationships
between different variables in a dataset. By quantifying the degree and direction
of the relationship, we can gain insights into how changes in one variable are
likely to affect the other.
2.Prediction - Correlation analysis can help predict one variable's values based on
another variable's values. Building models based on correlations can predict
future outcomes and make informed decisions.
3.Feature Selection - Correlation analysis can also help select the most relevant
features for a particular analysis or model. By identifying the features that are
highly correlated with the outcome features, we can focus on those features and
exclude the irrelevant ones, improving the accuracy and efficiency of the analysis
or model.
4.Quality Control - Correlation analysis is useful in quality control applications,
where it can be used to identify correlations between different process variables
and identify potential sources of quality problems.
elwanzita20s@gmail.com 124
Here are some examples of the most common use cases for association and correlation
in data mining -
1.Market Basket Analysis - Association mining is commonly used in retail and e-
commerce industries to identify patterns in customer purchase behavior. By analyzing
transaction data, businesses can uncover product associations and make informed
decisions about product placement, pricing, and marketing strategies.
2.Medical Research - Correlation analysis is often used in medical research to explore
relationships between different variables, such as the correlation between smoking and
lung cancer risk or the correlation between blood pressure and heart disease.
3.Financial Analysis - Correlation analysis is frequently used in financial analysis to
measure the strength of relationships between different financial variables, such as the
correlation between stock prices and interest rates.
4.Fraud Detection - Association mining can be used to identify behavior patterns
associated with fraudulent activity, such as multiple failed login attempts or unusual
purchase patterns.
elwanzita20s@gmail.com
125