0% found this document useful (0 votes)
3 views37 pages

Data Wareousing and Mining-Notes

The document is a comprehensive guide on data warehousing and data mining, aimed at BCA/MCA students. It covers key concepts such as data warehouse definitions, characteristics, functions, and design processes, as well as data mart and data mining techniques. The guide emphasizes the importance of data integration, analysis, and reporting for informed decision-making in organizations.

Uploaded by

workurbannest
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views37 pages

Data Wareousing and Mining-Notes

The document is a comprehensive guide on data warehousing and data mining, aimed at BCA/MCA students. It covers key concepts such as data warehouse definitions, characteristics, functions, and design processes, as well as data mart and data mining techniques. The guide emphasizes the importance of data integration, analysis, and reporting for informed decision-making in organizations.

Uploaded by

workurbannest
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

DATA WAREHOUSING

AND DATA MINING


COMPLETE GUIDE FOR BCA/MCA EXAMINATION

By Dr. Anil Kumar, Ph.D.


CONTENTS

1. Introduction to Data Warehouse… …………………..1.1 – 1.09

2. Introduction to Data Mining ………………………….2.1 – 2.07

3. Introduction to Data Mart ………………………… 3.1 – 3.04

4. Association Rule Mining………………………………4.1 – 4.09

5. Clustering ……………………………………………...5.1 – 5.04


Chapter

1
Introduction to Data Warehouse
1.1 Definition of Data Warehouse 2. Integrated
A data warehouse is a central storage system that collects Integration means that a data warehouse combines data
and manages large amounts of data from different sources. from multiple sources into a standardized format. This
It is designed for fast data retrieval, analysis, and ensures that all data follows the same naming
reporting. Organizations use data warehouses to make conventions, formats, and coding standards.
informed decisions because they provide a single, reliable For example, a data warehouse may collect information
source of information. from mainframe systems, relational databases, and cloud
A data warehouse works by combining and organizing data storage. By integrating data from different sources,
from various sources into a consistent and structured businesses can perform effective analysis and decision-
format. This ensures that the data is accurate and easy to making.
analyze. A well-managed data warehouse helps users
understand trends and patterns in specific areas.
A data warehouse is a subject-oriented, integrated, time-
variant and non-volatile collection of data in support of
management's decision-making process.

1.2 Characteristics of Data Warehouse

Data Warehouse has following characteristics: 3. Time-Variant

1. Subject-Oriented A data warehouse stores information over a long period to

A data warehouse focuses on specific topics or themes analyze trends and changes over time. Data is recorded at

rather than daily operations. It organizes data based on a different intervals (daily, weekly, monthly, or yearly),

particular subject, such as sales, marketing, or making it possible to study historical trends.

distribution. Unlike operational databases, which focus on current data,

Unlike operational databases, which handle real-time a data warehouse allows organizations to track changes

transactions, a data warehouse helps analyze data and and make long-term strategic decisions. Once stored, data

make better business decisions. It filters out unnecessary cannot be modified—it remains as a historical record for

details and provides a clear and concise view of relevant future analysis.

information. 4. Non-Volatile
Non-volatile means that data in a data warehouse is never
deleted or changed after being stored. When new data is
added, the existing data remains unchanged.
This feature ensures that businesses always have access to
past information for historical analysis and trend
identification. The data warehouse does not require
transaction processing, concurrency control, or frequent
updates like operational databases.
Introduction to Data Warehouse | 1.2

There are two key types of data operations in a data 7. Data Reporting
warehouse: Data warehouses provide dashboards and reports that
Data Loading – Storing new data into the system. help organizations track performance and identify trends.
Data Access – Retrieving stored data for analysis.
8. Data Mining
Advanced techniques such as machine learning and
pattern recognition are used to discover useful insights
from large datasets.

9. Performance Optimization
A data warehouse is optimized for fast querying and
efficient analysis, ensuring quick access to data when
1.3 Functions of a Data Warehouse
needed.

A data warehouse serves multiple purposes, including


By using a data warehouse, organizations can make data-
storing, organizing, and analyzing data. Below are some of
driven decisions, improve efficiency, and gain a competitive
its main functions:
advantage.

1. Data Consolidation
1.4 Purpose of a Data Warehouse
Merging data from different sources into a single central
repository ensures accuracy and consistency.
A data warehouse is designed to collect, store, and analyze
large amounts of data from various sources. Its main
2. Data Cleaning
purpose is to help businesses make better, data-driven
Identifying and removing errors, duplicates, and
decisions by providing a centralized, structured, and
inconsistencies before storing data improves data quality
reliable source of information. Below are some key
and reliability.
purposes of a data warehouse:

3. Data Integration
1. Centralized Data Storage
Combining structured and unstructured data from
A data warehouse consolidates data from multiple sources
multiple sources into a unified format allows businesses to
(such as databases, cloud storage, and applications) into a
perform better analysis.
single location. This eliminates data silos and makes it
easier to access and manage information.
4. Data Storage
Data warehouses can store large amounts of historical
2. Improved Decision-Making
data, making it easy to access and analyze when needed.
By providing historical and current data in an organized
manner, a data warehouse enables businesses to analyze
5. Data Transformation
trends, predict future outcomes, and make informed
Data is converted into a standardized and structured
strategic decisions.
format by removing duplicates and unnecessary details.

3. Faster Query Performance


6. Data Analysis
Data warehouses are optimized for fast data retrieval and
Stored data is processed and analyzed to generate insights
reporting. Unlike traditional databases, which handle
and reports for business decision-making.
frequent updates, data warehouses focus on efficient
analysis and complex queries.
Introduction to Data Warehouse | 1.3

2. Bottom-up approach
4. Data Integration from Multiple Sources The bottom-up approach starts with experiments and
A data warehouse combines and standardizes data from prototypes. This is useful in the early stage of business
various sources, ensuring consistency in naming modeling and technology development. It allows an
conventions, formats, and codes. This makes analysis organization to move forward at considerably less expense
more accurate and reliable. and to evaluate the benefits of the technology before
making significant commitments.
5. Historical Data Analysis
Since data warehouses store long-term historical data, 3. Hybrid approach
organizations can track performance over time, identify In the hybrid or combined approach, an organization can
patterns, and compare past and present trends. exploit the planned and strategic nature of the top-down
approach while retaining the rapid implementation and
6. Business Intelligence and Reporting opportunistic application of the bottom-up approach.
A data warehouse supports business intelligence (BI) tools,
dashboards, and reporting systems, helping organizations The warehouse design process consists of the following
generate insights through charts, graphs, and summary steps:
reports. • Choose a business process to model, for example,
orders, invoices, shipments, inventory, account
7. Enhanced Data Quality and Accuracy administration, sales, or the general ledger. If the
By implementing data cleaning and transformation business process is organizational and involves
processes, a data warehouse ensures that stored data is multiple complex object collections, a data warehouse
consistent, error-free, and reliable for analysis. model should be followed. However, if the process is
departmental and focuses on the analysis of one kind
8. Scalability and Performance Optimization of business process, a data mart model should be
Data warehouses are designed to handle large volumes of chosen.
data efficiently, making them suitable for growing • Choose the grain of the business process. The grain is
businesses with increasing data needs. the fundamental, atomic level of data to be represented
in the fact table for this process, for example,
9. Security and Data Governance individual transactions, individual daily snapshots,
A data warehouse provides controlled access to data, and so on.
ensuring that only authorized users can retrieve and • Choose the dimensions that will apply to each fact
analyze specific datasets. This enhances data security and table record. Typical dimensions are time, item,
compliance with regulations. customer, supplier, warehouse, transaction type, and
status.
1.5 Data Warehouse Design Process • Choose the measures that will populate each fact table
A data warehouse can be built using a top-down approach, record. Typical measures are numeric additive
a bottom-up approach, or a combination of both. quantities like dollars sold and units sold.

1. Top-Down approach 1.6 Data Warehouse Architecture


The top-down approach starts with the overall design and
planning. It is useful in cases where the technology is A data warehouse is like a big storage system that helps
mature and well known, and where the business problems businesses analyze large amounts of data to make
that must be solved are clear and well understood. informed decisions.
Introduction to Data Warehouse | 1.4

To organize and process data efficiently, a Three-Tier ODBC (Open Database Connection) and OLEDB (Open
Architecture is used, which divides the system into three Linking and Embedding for Databases) by Microsoft and
layers: JDBC (Java Database Connection). This tier also contains
a metadata repository, which stores information about the
1. Bottom Tier – Data Storage and Processing
data warehouse and its contents.
2. Middle Tier – Data Analysis (OLAP Engine)
3. Top Tier – User Interface and Reporting The ETL process (Extract, Transform, Load) is used to:

Each of these layers plays a key role in storing, processing, • Extract data from different sources like databases,
and analyzing data to make it useful for businesses. files, or web services.
• Transform data by cleaning, filtering, and
organizing it to match business needs.
• Load the processed data into a structured storage
system for easy access.

Tools Used in the ETL Process:

• IBM Infosphere
• Informatica
• Microsoft SSIS
• SnapLogic

Common Challenges & Solutions


Challenge Solution
Inconsistent Use data cleaning techniques to fix
1. Bottom Tier – Data Storage and ETL Process data errors before storing.
This is the foundation of the data warehouse where all Different data Standardize data formats before loading
data is collected, cleaned, stored, and prepared for formats into the warehouse.
analysis. It contains a relational database where data from Growing data Design a scalable storage system that
different sources is combined. volume can handle more data over time.

The bottom tier is a warehouse database server that is


2. Middle Tier – OLAP Server for Data Analysis
almost always a relational database system. Back-end
This is where complex data analysis and calculations take
tools and utilities are used to feed data into the bottom tier
place. The OLAP (Online Analytical Processing) server is
from operational databases or other external sources (such
used to process large datasets efficiently.
as customer profile information provided by external
consultants). OLAP model is an extended relational DBMS that maps
operations on multidimensional data to standard relational
These tools and utilities perform data extraction, cleaning,
operations.
and transformation (e.g., to merge similar data from
different sources into a unified format), as well as load and Types of OLAP Models:
refresh functions to update the data warehouse. The data
are extracted using application program interfaces known • ROLAP (Relational OLAP) – Works with relational

as gateways. A gateway is supported by the underlying databases, great for handling large datasets.

DBMS and allows client programs to generate SQL code to


be executed at a server. Examples of gateways include
Introduction to Data Warehouse | 1.5

• MOLAP (Multidimensional OLAP) – Uses a special handle data efficiently, three main types of data warehouse
data storage format that makes queries much models exist:
faster.
1. Enterprise Data Warehouse
• HOLAP (Hybrid OLAP) – A mix of ROLAP and
2. Data Mart
MOLAP, balancing flexibility and speed.
3. Virtual Warehouse
Common Challenges & Solutions
Each model serves a different purpose depending on the
Challenge Solution
scale, functionality, and usage within an organization.
Data processing Use query optimization techniques
Let’s explore them in detail.
takes too long like indexing.
Delays in updating Use real-time processing to keep
data data fresh. 1. Enterprise Data Warehouse (EDW)
Merging data from Use tools like Talend or An Enterprise Data Warehouse (EDW) is a large-scale
different sources Informatica to standardize data system designed to store and manage data for an entire
formats. organization. It integrates data from multiple sources,
making it useful for company-wide decision-making.
3. Top Tier – User Interface and Reporting
Key Characteristics:
The top tier is a front-end client layer, which contains
query and reporting tools, analysis tools, and/or data
• Centralized Repository: Stores all data across the
mining tools (e.g., trend analysis, prediction, and so on).
organization, providing a single source of truth for
different departments.
This is the front-end layer where users view and analyze
data through dashboards, reports, and charts. • Comprehensive Data Integration: Collects and
processes detailed and summarized data from multiple
operational systems and external sources (e.g.,
customer records, financial data, sales transactions).
Popular BI Tools Used for Reporting:
• Large Storage Capacity: Can store data ranging from

• IBM Cognos – Advanced reporting and analytics. gigabytes to terabytes or even petabytes, depending on

• Microsoft BI – Works well with Excel and other the organization’s needs.

Microsoft tools. • Cross-Functional Usage: Used by different teams like

• SAP BW – Great for businesses using SAP finance, marketing, sales, HR, and supply chain.

software. • Complex and Time-Consuming to Build: Requires


years of planning, modeling, and implementation,
Common Challenges & Solutions often using high-performance computing platforms like
Challenge Solution mainframes, super servers, or parallel processing
Users find tools too Provide training and support architectures.
complex for better adoption.
Integration issues with Choose BI tools that integrate 2. Data Mart

other data sources easily with existing systems. A data mart is a smaller, specialized version of a data
warehouse, designed to serve the needs of a specific

1.7 Data Warehouse Models department or user group. It extracts a subset of the data
from an enterprise warehouse or other sources and focuses

A data warehouse is a system that stores and manages on a particular business function.

large volumes of data for analysis and decision-making. To


Introduction to Data Warehouse | 1.6

Key Characteristics: • Limited Performance for Large Queries: Since it


pulls data directly from operational systems, complex
• Narrow Scope: Stores data relevant to a particular
queries can put a heavy load on live databases,
department (e.g., marketing, sales, finance, HR).
affecting performance.
• Faster Implementation: Can be developed in weeks
instead of years, making it more cost-effective than an Challenges of Virtual Warehouses:
enterprise warehouse. • High Dependency on Operational Systems: If the
• Summarized Data: Contains pre-processed and live database is slow or overloaded, queries in the
aggregated data for quick analysis. virtual warehouse also slow down.
• Lower Infrastructure Cost: Usually deployed on • Limited Query Optimization: Since data is not pre-
departmental servers (Windows, UNIX, or Linux processed, some analytical operations take longer than
systems). in a traditional warehouse.
• Can Be Independent or Dependent:
o Independent Data Mart: Created separately
1.8 Meta Data Repository:
from an enterprise data warehouse, using data
Metadata are data about data. When used in a data
from different sources like operational
warehouse, metadata are the data that define warehouse
databases or external data providers.
objects. Metadata are created for the data names and
o Dependent Data Mart: Extracted directly from
definitions of the given warehouse. Additional metadata are
an enterprise data warehouse to maintain data
created and captured for time stamping any extracted
consistency.
data, the source of the extracted data, and missing fields
Challenges of Data Marts: that have been added by data cleaning or integration
• Limited Scope: Since data marts focus on a specific processes. A metadata repository should contain the
domain, they do not provide a holistic view of the following:
organization’s data. • A description of the structure of the data warehouse,
• Integration Issues: If data marts are created without which includes the warehouse schema, view,
enterprise-wide planning, integrating them later with dimensions, hierarchies, and derived data definitions,
other data sources becomes complex. as well as data mart locations and contents.
• Operational metadata, which include data lineage
3. Virtual Warehouse
(history of migrated data and the sequence of
A virtual warehouse is a logical view of data that does not
transformations applied to it), currency of data (active,
store information in a separate system but instead creates
archived, or purged), and monitoring information
on-the-fly summaries from existing operational databases.
(warehouse usage statistics, error reports, and audit
trails).
Key Characteristics:
• The algorithms used for summarization, which include
• No Physical Storage: Instead of storing copies of data, measure and dimension definition algorithms, data on
it retrieves information directly from operational granularity, partitions, subject areas, aggregation,
systems using SQL views. summarization, and predefined queries and reports.
• Quick and Easy to Set Up: Does not require long • The mapping from the operational environment to the
implementation cycles or massive storage data warehouse, which includes source databases and
infrastructure. their contents, gateway descriptions, data partitions,
• Flexible and Cost-Effective: Uses existing resources data extraction, cleaning, transformation rules and
without needing a separate data warehouse system. defaults, data refresh and purging rules, and security
(user authorization and access control).
Introduction to Data Warehouse | 1.7

• Data related to system performance, which include {location key, street, city, province or state, country}. This
indices and profiles that improve data access and constraint may introduce some redundancy.
retrieval performance, in addition to rules for the
timing and scheduling of refresh, update, and For example, “Vancouver” and “Victoria” are both cities in
replication cycles. the Canadian province of British Columbia. Entries for
• Business metadata, which include business terms and such cities in the location dimension table will create
definitions, data ownership information, and charging redundancy among the attribute’s province or state and
policies. country, that is, (..., Vancouver, British Columbia, Canada)
and (..., Victoria, British Columbia, Canada). Moreover, the
1.9 Schema Design attributes within a dimension table may form either a
Stars, Snowflakes, and Fact Constellations: Schemas for hierarchy (total order) or a lattice (partial order).
Multidimensional Databases The entity- relationship data
model is commonly used in the design of relational
databases, where a database schema consists of a set of
entities and the relationships between them. Such a data
model is appropriate for on- line transaction processing. A
data warehouse, however, requires a concise, subject-
oriented schema that facilitates on-line data analysis. The
most popular data model for a data warehouse is a
multidimensional model. Such a model can exist in the
form of a star schema, a snowflake schema, or a fact
constellation schema. Let’s look at each of these schema 2. Snowflake schema
types. Star schema: The most common modeling paradigm A snowflake schema for All Electronics sales is given in
is the star schema, in which the data warehouse contains Figure Here, the sales fact table is identical to that of the
(1) a large central table (fact table) containing the bulk of star schema in Figure. The main difference between the
the data, with no redundancy, and (2) a set of smaller two schemas is in the definition of dimension tables.
attendant tables (dimension tables), one for each
dimension. The schema graph resembles a starburst, with The single dimension table for item in the star schema is
the dimension tables displayed in a radial pattern around normalized in the snowflake schema, resulting in new item
the central fact table. and supplier tables. For example, the item dimension table
now contains the attributes item key, item name, brand,
1. Star schema type, and supplier key, where supplier key is linked to the
A star schema for All Electronics sales is shown in Figure. supplier dimension table, containing supplier key and
Sales are considered along four dimensions, namely, time, supplier type information. Similarly, the single dimension
item, branch, and location. The schema contains a central table for location in the star schema can be normalized
fact table for sales that contains keys to each of the four into two new tables: location and city. The city key in the
dimensions, along with two measures: dollars sold and new location table links to the city dimension. Notice that
units sold. To minimize the size of the fact table, further normalization can be performed on province or
dimension identifiers (such as time key and item key) are state and country in the snowflake schema
system-generated identifiers. Notice that in the star
schema, each dimension is represented by only one table,
and each table contains a set of attributes. For example,
the location dimension table contains the attribute set
Introduction to Data Warehouse | 1.8

1.10 Online analytical Processing


 OLAP is an approach to answering multi-dimensional
analytical (MDA) queries swiftly.
 OLAP is part of the broader category of business
intelligence, which also encompasses relational
database, report writing and data mining.
 OLAP tools enable users to analyze multidimensional
data interactively from multiple perspectives.

OLAP consists of three basic analytical operations:


3. Fact constellation  Consolidation (Roll-Up)
A fact constellation schema is shown in Figure. This  Drill-Down
schema specifies two fact tables, sales and shipping. The  Slicing And Dicing
sales table definition is identical to that of the star schema.  Consolidation involves the aggregation of data that can
The shipping table has five dimensions, or keys: item key, be accumulated and computed in one or more
time key, shipper key, from location, and to location, and dimensions. For example, all sales offices are rolled up
two measures: dollars cost and units shipped. to the sales department or sales division to anticipate
A fact constellation schema allows dimension tables to be sales trends.
shared between fact tables. For example, the dimensions  The drill-down is a technique that allows users to
tables for time, item, and location are shared between both navigate through the details. For instance, users can
the sales and shipping fact tables. view the sales by individual products that make up a
In data warehousing, there is a distinction between a data region’s sales.
warehouse and a data mart.  Slicing and dicing is a feature whereby users can take
out (slicing) a specific set of data of the OLAP cube and
A data warehouse collects information about subjects that view (dicing) the slices from different viewpoints.
span the entire organization, such as customers, items,
sales, assets, and personnel, and thus its scope is Types of OLAP
enterprise-wide. For data warehouses, the fact There are following types of OLAP
constellation schema is commonly used, since it can model 1. Relational OLAP (ROLAP)
multiple, interrelated subjects. A data mart, on the other  ROLAP works directly with relational databases. The
hand, is a department subset of the data warehouse that base data and the dimension tables are stored as
focuses on selected subjects, and thus its scope is relational tables and new tables are created to hold the
department wide. For data marts, the star or snowflake aggregated information. It depends on a specialized
schema are commonly used, since both are geared toward schema design.
modeling single subjects, although the star schema is  This methodology relies on manipulating the data
more popular and efficient. stored in the relational database to give the
appearance of traditional OLAP's slicing and dicing
functionality. In essence, each action of slicing and
dicing is equivalent to adding a "WHERE" clause in the
SQL statement. ROLAP tools do not use pre-calculated
data cubes but instead pose the query to the standard
relational database and its tables in order to bring
back the data required to answer the question.
Introduction to Data Warehouse | 1.9

 ROLAP tools feature the ability to ask any question


because the methodology does not limit to the contents
of a cube. ROLAP also has the ability to drill down to
the lowest level of detail in the database.

2. Multidimensional OLAP (MOLAP):


 MOLAP is the 'classic' form of OLAP and is sometimes
referred to as just OLAP.
 MOLAP stores this data in an optimized multi-
dimensional array storage, rather than in a relational
database. Therefore, it requires the pre-computation
and storage of information in the cube - the operation
known as processing.
 MOLAP tools generally utilize a pre-calculated data set
referred to as a data cube. The data cube contains all
the possible answers to a given range of questions.
 MOLAP tools have a very fast response time and the
ability to quickly write back data into the data set.

3. Hybrid OLAP (HOLAP):


 There is no clear agreement across the industry as to
what constitutes Hybrid OLAP, except that a database
will divide data between relational and specialized
storage.
 For example, for some vendors, a HOLAP database will
use relational tables to hold the larger quantities of
detailed data, and use specialized storage for at least
some aspects of the smaller quantities of more-
aggregate or less-detailed data.
 HOLAP addresses the shortcomings of MOLAP and
ROLAP by combining the capabilities of both
approaches.
 HOLAP tools can utilize both pre-calculated cubes and
relational data sources.
Chapter

2
Introduction of Data Mining
2.1 Fundamentals of Data Mining mailings to identify the targets most likely to maximize
Data mining refers to extracting or mining knowledge from return on investment in future mailings. Other predictive
large amounts of data. The term is actually a misnomer. problems include forecasting bankruptcy and other forms
Thus, data mining should have been more appropriately of default, and identifying segments of a population likely
named as knowledge mining which emphasis on mining to respond similarly to given events.
from large amounts of data.
It is the computational process of discovering patterns in Automated discovery of previously unknown patterns.
large data sets involving methods at the intersection of Data mining tools sweep through databases and identify
artificial intelligence, machine learning, statistics, and previously hidden patterns in one step. An example of
database systems. pattern discovery is the analysis of retail sales data to
The overall goal of the data mining process is to extract identify seemingly unrelated products that are often
information from a data set and transform it into an purchased together. Other pattern discovery problems
understandable structure for further use. include detecting fraudulent credit card transactions and
identifying anomalous data that could represent data entry
The key properties of data mining are keying errors.
 Automatic discovery of patterns
 Prediction of likely outcomes 2.3 Data Mining Functionalities
 Creation of actionable information We have seen different types of databases and information
 Focus on large datasets and databases sources where data mining can be applied. Now, let’s
explore the types of patterns that can be discovered
2.2 The Scope of Data Mining through data mining. Data mining helps identify useful
Data mining derives its name from the similarities between patterns in large datasets, which can be broadly classified
searching for valuable business information in a large into two categories:
database, for example, finding linked products in gigabytes • Descriptive data mining: focuses on understanding
of store scanner data — and mining a mountain for a vein the general characteristics of data.
of valuable ore. Both processes require either sifting
• Predictive data mining: analyzes current data to
through an immense amount of material, or intelligently
make future predictions.
probing it to find exactly where the value resides.
Sometimes, users may not know what patterns are useful
Given databases of sufficient size and quality, data mining
in their data. In such cases, they may need to search for
technology can generate new business opportunities by
multiple patterns at the same time. That’s why data
providing these capabilities:
mining systems should be capable of finding different types
of patterns to meet various needs. These systems should
Automated prediction of trends and behaviours. Data
also allow users to refine their searches and explore
mining automates the process of finding predictive
patterns at different levels of detail. Since not all patterns
information in large databases. Questions that
apply to the entire dataset, data mining also includes a
traditionally required extensive hands- on analysis can
measure of how reliable or "trustworthy" each pattern is.
now be answered directly from the data — quickly.
A typical example of a predictive problem is targeted
marketing. Data mining uses data on past promotional
Fundamentals of Data Mining | 2.2

Frequent Patterns in Data Mining 2.5 Architecture of Data Mining


Data mining can uncover different types of patterns, A typical data mining system may have the following major
including frequent patterns, associations, and correlations. components.
Frequent patterns are data trends that appear often. Some
common types include:
• Frequent itemsets – Groups of items that frequently
appear together in transactions, such as milk and
bread in a supermarket.
• Sequential patterns – Recurring sequences of events,
like customers buying a PC first, then a digital camera,
and later a memory card.
• Structured patterns – Complex patterns found in data
structures like graphs, trees, or networks.

2.4 Data mining Task


1. Knowledge Base
Data mining involves six common classes of tasks:
This is the domain knowledge that is used to guide the
 Anomaly detection (Outlier/change/deviation
search or evaluate the interestingness of resulting
detection) – The identification of unusual data
patterns. Such knowledge can include concept hierarchies,
records, that might be interesting or data errors that
used to organize attributes or attribute values into
require further investigation.
different levels of abstraction.
 Association rule learning (Dependency modelling) –
Knowledge such as user beliefs, which can be used to
Searches for relationships between variables. For
assess a pattern’s interestingness based on its
example a supermarket might gather data on customer
unexpectedness, may also be included. Other examples of
purchasing habits. Using association rule learning, the
domain knowledge are additional interestingness
supermarket can determine which products are
constraints or thresholds, and metadata (e.g., describing
frequently bought together and use this information
data from multiple heterogeneous sources).
for marketing purposes. This is sometimes referred to
2. Data Mining Engine
as market basket analysis.
This is essential to the data mining system and ideally
 Clustering – is the task of discovering groups and
consists of a set of functional modules for tasks such as
structures in the data that are in some way or another
characterization, association and correlation analysis,
"similar", without using known structures in the data.
classification, prediction, cluster analysis, outlier analysis,
 Classification – is the task of generalizing known
and evolution analysis.
structure to apply to new data. For example, an e-mail
3. Pattern Evaluation Module
program might attempt to classify an e-mail as
This component typically employs interestingness
"legitimate" or as "spam".
measures interacts with the data mining modules so as to
 Regression – attempts to find a function which models
focus the search toward interesting patterns. It may use
the data with the least error.
interestingness thresholds to filter out discovered patterns.
 Summarization – providing a more compact
Alternatively, the pattern evaluation module may be
representation of the data set, including Visualization
integrated with the mining module, depending on the
and report generation.
implementation of the datamining method used. For
efficient data mining, it is highly recommended to push
the evaluation of pattern interestingness as deep as
Fundamentals of Data Mining| 2.3

possible into the mining process as to confine the search to prediction, clustering, outlier detection, and
only the interesting patterns. evolution analysis.
4. User interface • Granularity levels: Systems can mine high-level
This module communicates between users and the data generalized knowledge, low-level raw data insights,
mining system, allowing the user to interact with the or multi-level knowledge combining both.
system by specifying a data mining query or task, • Regular vs. irregular patterns: Some systems
providing information to help focus the search, and focus on common patterns, while others detect
performing exploratory datamining based on the exceptions or outliers in the data.
intermediate data mining results. In addition, this 3. Classification Based on the Techniques Used
component allows the user to browse database and data Data mining systems can also be categorized by the
warehouse schemas or data structures, evaluate mined techniques they use:
patterns, and visualize the patterns in different forms. • User interaction level: Autonomous (fully
automated), interactive (user-guided), or query-
2.6 Classification of Data Mining Systems driven (based on specific user queries).
Data mining is a field that brings together knowledge from • Methods used: Database-oriented, machine
various disciplines, such as databases, statistics, machine learning, statistical analysis, neural networks,
learning, visualization, and information science. pattern recognition, and visualization.
Depending on the approach used, data mining may also • Hybrid systems: Many advanced systems combine
involve techniques from other fields like neural networks, multiple techniques for better results.
fuzzy logic, pattern recognition, image analysis, and high- 4. Classification Based on the Application Domain
performance computing. Additionally, data mining can be Some data mining systems are designed for specific
applied in various domains, including economics, industries or fields, such as:
business, bioinformatics, web technology, and psychology. • Finance (e.g., stock market predictions, fraud
detection)
Because data mining includes so many different methods • Telecommunications (e.g., call pattern analysis,
and applications, many different types of data mining network optimization)
systems exist. To help users choose the right system for
• Healthcare and bioinformatics (e.g., DNA analysis,
their needs, data mining systems can be classified into
disease prediction)
different categories based on specific criteria.
• Retail and e-commerce (e.g., customer behavior
1. Classification Based on the Type of Database Mined
analysis, recommendation systems)
Data mining systems can be classified based on the type of
• Cybersecurity (e.g., fraud detection, intrusion
database they analyze. Since databases vary by structure
detection systems)
and content, each type may require different data mining
Since different applications may need customized
techniques.
approaches, a one-size-fits-all data mining system may not
• By data model: Relational, transactional, object-
be effective. Instead, industry-specific data mining
relational, or data warehouse mining systems.
solutions are often preferred.
• By data type: Spatial, time-series, text, By classifying data mining systems based on these criteria,
multimedia, stream data, or web mining systems. users can better understand which system suits their
2. Classification Based on the Type of Knowledge needs and how to apply data mining effectively in their
Mined
respective fields.
Different data mining systems extract different kinds of
knowledge, including:
2.7 Data Mining Process:
• Basic functions: Characterization, discrimination, Data Mining is a process of discovering various models,
association and correlation analysis, classification, summaries, and derived values from a given collection of
Fundamentals of Data Mining | 2.4

data. The general experimental procedure adapted to data- case, the estimated model cannot be successfully used in a
mining problems involves the following steps: final application of the results.
1. State the problem and formulate the hypothesis
Most data-based modeling studies are performed in a 2.8 Major Issues in Data Mining
particular application domain. Hence, domain-specific  Mining different kinds of knowledge in databases. -
knowledge and experience are usually necessary in order The need of different users is not the same. And
to come up with a meaningful problem statement. Different user may be in interested in different kind of
Unfortunately, many application studies tend to focus on knowledge. Therefore, it is necessary for data mining to
the data- mining technique at the expense of a clear cover broad range of knowledge discovery task.
problem statement. In this step, a modeler usually  Interactive mining of knowledge at multiple levels
specifies a set of variables for the unknown dependency of abstraction: The data mining process needs to be
and, if possible, a general form of this dependency as an interactive because it allows users to focus the search
initial hypothesis. There may be several hypotheses for patterns, providing and refining data mining
formulated for a single problem at this stage. requests based on returned results.
The first step requires the combined expertise of an  Incorporation of background knowledge: To guide
application domain and a data-mining model. In practice, discovery process and to express the discovered
it usually means a close interaction between the data- patterns, the background knowledge can be used.
mining expert and the application expert. In successful Background knowledge may be used to express the
data-mining applications, this cooperation does not stop in discovered patterns not only in concise terms but at
the initial phase; it continues during the entire data- multiple level of abstraction.
mining process.  Data mining query languages and ad hoc data
mining: Data Mining Query language that allows the
2. Collect the data user to describe ad hoc mining tasks, should be
This step is concerned with how the data are generated integrated with a data warehouse query language and
and collected. In general, there are two distinct optimized for efficient and flexible data mining.
possibilities. The first is when the data-generation process  Presentation and visualization of data mining
is under the control of an expert (modeler): this approach results: Once the patterns are discovered it needs to
is known as a designed experiment. be expressed in high level languages, visual
The second possibility is when the expert cannot influence representations. These representations should be
the data- generation process: this is known as the easily understandable by the users.
observational approach. An observational setting, namely,
random data generation, is assumed in most data-mining  Handling noisy or incomplete data: The data
applications. cleaning methods are required that can handle the
Typically, the sampling distribution is completely unknown noise, incomplete objects while mining the data
after data are collected, or it is partially and implicitly regularities. If data cleaning methods are not there
given in the data-collection procedure. It is very important, then the accuracy of the discovered patterns will be
however, to understand how data collection affects its poor.
theoretical distribution, since such a priori knowledge can
be very useful for modeling and, later, for the final  Pattern evaluation: It refers to interestingness of the
interpretation of results. Also, it is important to make sure problem. The patterns discovered should be interesting
that the data used for estimating a model and the data because either they represent common knowledge or
used later for testing and applying a model come from the lack novelty.
same, unknown, sampling distribution. If this is not the
Fundamentals of Data Mining| 2.5

 Efficiency and scalability of data mining 3. detection and resolution of data value conflicts:
algorithms: In order to effectively extract the For the same real-world entity, attribute values from
information from huge amount of data in databases, different sources may differ.
data mining algorithm must be efficient and scalable.
2.10 Data Transformation
 Parallel, distributed, and incremental mining In data transformation, the data are transformed or
algorithms: The factors such as huge size of consolidated into forms appropriate for mining.
databases, wide distribution of data, and complexity of
data mining methods motivate the development of Data transformation can involve the following:
parallel and distributed data mining algorithms. These  Smoothing: which works to remove noise from the
algorithms divide the data into partitions which is data. Such techniques include binning, regression,
further processed parallel. Then the results from the and clustering.
partitions are merged. The incremental algorithms,  Aggregation: where summary or aggregation
updates the databases without having to mine the data operations are applied to the data. For example, the
again from the scratch. daily sales data may be aggregated so as to compute
monthly and annual total amounts. This step is
2.9 Data Integration typically used in constructing a data cube for analysis
It combines data from multiple sources into a coherent of the data at multiple granularities. Generalization of
data store, as in data warehousing. These sources may the data, where low-level or ―primitive‖ (raw) data are
include multiple databases, data cubes, or flat files. replaced by higher-level concepts through the use of
The data integration systems are formally defined as triple concept hierarchies. For example, categorical
<G, S, M> attributes, like street, can be generalized to higher-
Where G: The global schema level concepts, like city or country.
S: Heterogeneous source of schemas  Normalization: where the attribute data are scaled so
M: Mapping between queries of source and global schema as to fall within a small specified range, such as 1:0 to
1:0, or 0:0 to 1:0.
 Attribute construction (or feature construction):
where new attributes are constructed and added from
the given set of attributes to help the mining process.

2.11 Data Reduction


Data reduction techniques can be applied to obtain a
reduced representation of the data set that is much
Issues in Data integration:
smaller in volume, yet closely maintains the integrity of the
1. Schema integration and object matching:
original data. That is, mining on the reduced data set
How can the data analyst or the computer be sure that
should be more efficient yet produce the same (or almost
customer id in one database and customer number in
the same) analytical results.
another reference to the same attribute.
2. Redundancy
Strategies for data reduction include the following:
An attribute (such as annual revenue, for instance) may be
 Data cube aggregation, where aggregation operations
redundant if it can be derived from another attribute or set
are applied to the data in the construction of a data
of attributes. Inconsistencies in attribute or dimension
cube.
naming can also cause redundancies in resulting data set.
Fundamentals of Data Mining | 2.6

 Attribute subset selection, where irrelevant, weakly technique; they will also influence the final data- mining
relevant, or redundant attributes or dimensions may results differently. Therefore, it is recommended to scale
be detected and removed. them and bring both features to the same weight for
 Dimensionality reduction, where encoding further analysis.
mechanisms are used to reduce the dataset size. Also, application-specific encoding methods usually
 Numerosity reduction, where the data are replaced or achieve dimensionality reduction by providing a smaller
estimated by alternative, smaller data representations number of informative features for subsequent data
such as parametric models (which need store only the modeling. These two classes of preprocessing tasks are
model parameters instead of the actual data) or only illustrative examples of a large spectrum of
nonparametric methods such as clustering, sampling, preprocessing activities in a data-mining process.
and the use of histograms. Data-preprocessing steps should not be considered
 Discretization and concept hierarchy generation, completely independent from other data- mining phases. In
where raw data values for attributes are replaced by every iteration of the data-mining process, all activities,
ranges or higher conceptual levels. Data discretization together, could define new and improved data sets for
is a form of numerosity reduction that is very useful subsequent iterations.
for the automatic generation of concept hierarchies. Generally, a good preprocessing method provides an
Discretization and concept hierarchy generation are optimal representation for a data-mining
powerful tools for datamining, in that they allow the technique by incorporating a priori knowledge in the form
mining of data at multiple levels of abstraction. of application-specific scaling and encoding.

2.12 Data Preprocessing 3. Estimate the model


In the observational setting, data are usually "collected" The selection and implementation of the appropriate data-
from the existing databases, data warehouses, and data mining technique is the main task in this phase. This
marts. Data preprocessing usually includes at least two process is not straightforward; usually, in practice, the
common tasks: implementation is based on several models, and selecting
the best one is an additional task.
1. Outlier detection and removal
Outliers are unusual data values that are not consistent Interpret the model and draw conclusions
In most cases, data-mining models should help in decision
with most observations. Commonly, outliers result from
making. Hence, such models need to be interpretable in
measurement errors, coding and recording errors, and,
order to be useful because humans are not likely to base
sometimes, are natural, abnormal values. Such non
their decisions on complex "black-box" models. Note that
representative samples can seriously affect the model
the goals of accuracy of the model and accuracy of its
produced later. There are two strategies for dealing without
interpretation are somewhat contradictory.
liers:
Usually, simple models are more interpretable, but they
• Detect and eventually remove outliers as a part of
are also less accurate. Modern data- mining methods are
the preprocessing phase, or
expected to yield highly accurate results using high
• Develop robust modeling methods that are
dimensional models. The problem of interpreting these
insensitive to outliers.
models, also very important, is considered a separate task,
with specific techniques to validate the results.
2. Scaling, encoding, and selecting features
A user does not want hundreds of pages of numeric
Data preprocessing includes several steps such as variable
results. He does not understand them; he cannot
scaling and different types of encoding. For example, one
summarize, interpret, and use them for successful decision
feature with the range [0, 1] and the other with the range
making.
[−100, 1000] will not have the same weights in the applied
Fundamentals of Data Mining| 2.7
Chapter

3
Introduction of Data Mart
3.1 Data Mart The Key Difference: ETL Process
A data mart is a simple form of a data warehouse that is
The main difference between these two types of data marts
focused on a single subject (or functional area), such as
is how the data is collected and prepared. This process is
sales, finance or marketing. Data marts are often built and
called ETL (Extraction, Transformation, and Loading):
controlled by a single department within an organization.
• In a dependent data mart, the data is already
prepared in the central warehouse, so the ETL process
Given their single-subject focus, data marts usually draw
mostly involves selecting and copying the relevant
data from only a few sources. The sources could be
data.
internal operational systems, a central data warehouse, or
external data. • In an independent data mart, all steps of ETL
(extracting data, cleaning it, and formatting it) must be

3.2 Types of Data Marts done separately, similar to how data is handled in a

A data mart is a smaller, focused part of a data warehouse central warehouse.

designed to store and analyze data for a specific Why Choose One Over the Other?

department or business function. There are two main types • Dependent data marts are preferred when a business

of data marts: already has a central data warehouse and wants to

1. Dependent Data Marts make data more accessible, improve efficiency, and

A dependent data mart gets its data from a central data reduce costs.

warehouse that already exists. This means that the data is • Independent data marts are used when a company
already collected, cleaned, and organized in the warehouse needs a quick solution without building a full
before it is sent to the data mart. warehouse, even though it requires more effort to
The process of moving data into a dependent data mart is manage data.
simpler because the data is already formatted and
summarized. 3.3 Steps in Implementing a Data Mart
These data marts are typically used to improve Simply stated, the major steps in implementing a data
performance, ensure data consistency, and reduce costs by mart are to design the schema, construct the physical
making relevant data easily accessible for a specific storage, populate the data mart with data from source
department. systems, access it to make informed decisions, and
2. Independent Data Marts manage it over time.
An independent data mart is a standalone system that 1. Designing
collects data directly from operational sources or external The design step is first in the data mart process. This step
sources, without relying on a central data warehouse. covers all of the tasks from initiating the request for a data
Since there is no data warehouse, the data must be mart through gathering information about requirements,
collected, cleaned, and formatted from scratch, making the and developing the logical and physical design of the data
process more complex. mart. The design step involves the following tasks:
Independent data marts are often created when a quick • Gathering the business and technical requirements
solution is needed and there is no time or resources to • Identifying data sources
build a full data warehouse first. • Selecting the appropriate subset of data
Introduction of Data Mart | 3.2

• Designing the logical and physical structure of the 5. Managing


data mart This step involves managing the data mart over its lifetime.
2. Constructing In this step, you perform management tasks such as the
This step includes creating the physical database and the following:
logical structures associated with the data mart to provide • Providing secure access to the data
fast and efficient access to the data. This step involves the • Managing the growth of the data
following tasks: • Optimizing the system for better performance
• Creating the physical database and storage • Ensuring the availability of data even with system
structures, such as tablespaces, associated with failures
the data mart 3.4 Data Mart issues
• Creating the schema objects, such as tables and 1. Functionality – As data marts become more popular,
indexes defined in the design step their capabilities have expanded.
• Determining how best to set up the tables and the 2. Size Management – As data marts grow, performance
access structures can slow down. To maintain efficiency, reducing their
3. Populating size is important.
The populating step covers all of the tasks related to 3. Load Performance – Two key factors affect
getting the data from the source, cleaning it up, modifying performance:
it to the right format and level of detail, and moving it into o End-user response time – How quickly users
the data mart. More formally stated, the populating step can access data.
involves the following tasks: o Data loading performance – Instead of
• Mapping data sources to target data structures updating the entire database, only the affected
• Extracting data parts should be updated to improve efficiency.
• Cleansing and transforming the data
• Loading data into the data mart 3.5 Advantages and Disadvantages

• Creating and storing metadata Advantages of Data Marts

4. Accessing • Business-Specific – Designed to meet the unique

The accessing step involves putting the data to use: needs of a particular business unit or department.

querying the data, analyzing it, creating reports, charts, • Faster Query Performance – Since data marts
and graphs, and publishing these. Typically, the end user contain smaller datasets than a full data warehouse,
uses a graphical front-end tool to submit queries to the they can process queries more quickly.
database and display the results of the queries. The • User-Friendly – Built with end-users in mind, making
accessing step requires that you perform the following it easier for non-technical users to access and analyze
tasks: relevant data.
• Set up an intermediate layer for the front-end tool • Quick Implementation – Data marts can be set up
to use. This layer, the meta layer, translates faster compared to large-scale data warehouses.
database structures and object names into • Better Data Quality – By focusing on a specific
business terms, so that the end user can interact subject area, data marts help maintain high-quality,
with the data mart using terms that relate to the well-organized data.
business function. • Independent Operation – Each data mart can
• Maintain and manage these business interfaces. function separately, giving departments more control
• Set up and manage database structures, like over their data and analysis.
summarized tables, that help queries submitted
through the front-end tool execute quickly and
efficiently. Disadvantages of Data Marts
Introduction of Data Mart| 3.3

• Data Silos – If not properly integrated, different data • Manage benefits and compensation plans
marts may lead to isolated data that is not easily effectively.
shared across the organization. 4. Supply Chain and Operations
• Inconsistency Issues – Independently developed data • Optimize inventory management.
marts may use different data definitions and metrics, • Improve supply chain efficiency and reduce
causing discrepancies. operational costs.
• Limited Scope – While focusing on a specific business
• Monitor production and distribution processes.
area is useful, it can also limit access to broader
5. Customer Relationship Management (CRM)
organizational insights.
• Segment and profile customers for targeted
• Data Redundancy – Storing similar data in multiple
marketing.
data marts can increase storage costs and
• Track customer interactions and feedback.
inefficiencies.
• Analyze customer lifetime value and churn rates.
• Integration Complexity – Connecting multiple data
6. Healthcare Analytics
marts or linking them to a central data warehouse can
• Analyze patient outcomes and conduct medical
be challenging.
research.
• Duplicate Data – Some data marts may store
• Optimize healthcare costs and resource allocation.
information that already exists in the central data
warehouse, leading to unnecessary duplication. • Monitor and improve patient care quality.
7. Retail Analytics

3.6 Applications of Data Mart • Manage inventory and forecast demand.


Data marts are specialized subsets of data warehouses • Analyze sales trends using point-of-sale (POS)
that focus on the analytical needs of specific business data.
units or departments. They help organizations analyze • Enhance customer experience and satisfaction.
data efficiently and make informed decisions. Here are 8. Risk Management
some common applications of data marts across different • Detect and prevent fraud in financial transactions.
industries: • Assess credit risks in banks and financial
1. Sales and Marketing Analytics institutions.
• Understand customer behavior, preferences, and • Generate compliance reports for regulatory
purchasing patterns. authorities.
• Measure the effectiveness of marketing campaigns. 9. E-Commerce
• Monitor sales performance and forecast future • Analyze website traffic and user behavior.
trends. • Provide personalized product recommendations.
2. Finance and Accounting
• Optimize order fulfillment and logistics.
• Generate financial reports and conduct in-depth 10. Education Analytics
analysis.
• Assess student performance and predict academic
• Support budgeting and financial forecasting. outcomes.
• Ensure compliance with regulations and facilitate • Manage enrolment and admissions processes.
auditing.
• Allocate educational resources effectively.
3. Human Resources (HR)
11. Government and Public Sector
• Analyze workforce performance and employee
• Use analytics for public safety and law
retention rates.
enforcement.
• Track recruitment, training, and employee
• Analyze government budgets and expenditures.
satisfaction.
• Evaluate the effectiveness of social programs.
Introduction of Data Mart | 3.4

12. Manufacturing Analytics


• Monitor product quality and detect defects.
• Track equipment maintenance and performance.
• Improve production efficiency and yield.
Chapter

4
Association Rule Mining

4.1 Association Rule Mining
Association rule mining is a technique used in data
Rule: {Bread, Butter} → {Milk}
analysis to find relationships between items in large
databases. It helps businesses discover patterns, such as Meaning: If a customer buys bread and butter, they are
which products are often bought together. likely to buy milk.

Why is it Important? 4.2 Important Concepts of Association Rule Mining

It helps in making better decisions, such as: When analyzing relationships between items in a dataset,
we use certain key measures to determine how strong and
• Recommending products to customers (e.g., useful these relationships are. Here are four important
Amazon, Flipkart). concepts in association rule mining:

• Improving store layout in supermarkets. 1. Support

• Detecting fraud in financial transactions. Support measures how often an item or itemset appears in
the dataset. It is calculated as:
How Does it Work?
Support=Number of transactions containing the itemset/T
1. Understanding Transactions and Items:
otal number of transactions

A database consists of multiple transactions. Each


Example: If there are 5 transactions in a supermarket
transaction contains a set of items. Example: In a
database, and "Bread & Butter" appears in 1 transaction,
supermarket, a transaction is a customer’s purchase, and
then:
items are the products they buy.
Support(Bread, Butter)=1/5=20%
2. Defining Rules:
Higher support means the itemset is more common in the
A rule looks like this: X → Y, i.e. If a customer buys X,
dataset.
they are likely to buy Y as well. Example: If people buy
bread and butter, they often buy milk too. 2. Confidence

Example: Supermarket Transactions Confidence measures how often the rule X → Y is correct.
It is calculated as:
Transaction ID Milk Bread Butter Beer
Confidence(X→Y)=Support of (X and Y)/Support of X
1 1 1 0 0

2 0 0 1 0 Example: If 100% of the customers who buy Bread &


Butter also buy Milk, then:
3 0 0 0 1

4 1 1 1 0 Confidence (Bread, Butter → Milk)=100%

5 0 1 0 0 Higher confidence means the rule is more reliable.


Here, 1 means the item was bought, and 0 means it was
not.
Association Rule Mining | 4.2

3. Lift buys(X,―computer‖))=>buys(X, ―HP printer‖) (1)

Lift shows how much stronger the relationship between X buys(X,―laptop computer‖))=>buys(X,―HPprinter‖) (2)
and Y is compared to what we would expect by chance. It
In rule (1) and (2), the items bought are referenced at
is calculated as:
different levels ofabstraction (e.g.,―computer‖ is a higher-
Lift(X→Y)=Confidence(X→Y)/Support(Y) level abstraction of ―laptop computer‖).

Example: If the lift is greater than 1, it means X and Y are 3.Based on number of data dimensions involved in rule
positively correlated (they occur together more often than
If the items or attributes in an association rule reference
random chance). If the lift is equal to 1, there is no
only one dimension, then it is a single-dimensional
correlation between X and Y. If the lift is less than 1, it
association rule.
means X and Y are negatively correlated (they rarely
appear together). buys(X, ―computer‖))=>buys(X, ―antivirus software‖)

4. Conviction If a rule references two or more dimensions, such as the


dimensions age, income, and buys, then it is a
Conviction measures how often the rule X → Y would be
multidimensional association rule. The following rule is an
wrong if X and Y were unrelated. It is calculated as:
example of a multidimensional rule:
Conviction(X→Y)=1−Support(Y)/1−Confidence(X→Y)
age(X, ―30,31…39‖) ^ income(X, ―42K,…48K‖))=>buys(X,
Example: If a rule has a conviction of 2, it means the rule ―high resolution TV‖)
is twice as likely to be correct as incorrect. Higher
4. Based on the types of values handled in the rule
conviction values indicate stronger rules.

If a rule involves associations between the presence or


4.3 Frequent Pattern Mining
absence of items, it is a Boolean association rule. If a rule
Frequent pattern mining can be classified in various ways, describes associations between quantitative items or
based on the following criteria: attributes, then it is a quantitative association rule.

1.Based on the completeness of patterns to be mined 5. Based on the kinds of rules to be mined

We can mine the complete set of frequent itemsets, the Frequent pattern analysis can generate various kinds of
closed frequent itemsets, and the maximal frequent rules and other interesting relationships. Association rule
itemsets, given a minimum support threshold. mining can generate a large number of rules, many of
which are redundant or do not indicate a correlation
We can also mine constrained frequent itemsets,
relationship among itemsets.
approximate frequent itemsets,near- match frequent
itemsets, top-k frequent itemsets and so on. The discovered associations can be further analyzed to
uncover statistical correlations, leading to correlation
2.Based on levels of abstraction involved in the rule set
rules.

Some methods for association rule mining can find rules at


6. Based on the kinds of patterns to be mined:
differing levels of abstraction.
Many kinds of frequent patterns can be mined from
For example, suppose that a set of association rules mined
different kinds of data sets. Sequential pattern mining
includes the following rules where X is a variable
searches for frequent subsequences in a sequence data
representing a customer:
set, where a sequence records an ordering of events.
Association Rule Mining| 4.3

For example, with sequential pattern mining, we can study 4. Confidence: The probability of item B being purchased
the order in which items are frequently purchased. For given that item A is purchased.
instance, customers may tend to first buy a PC, followed
Confidence(A⇒B)=Support(A∪B)/Support(A)
by a digital camera, and then a memory card.

5. Lift: Measures how much more likely A and B occur


Structured pattern mining searches for frequent sub
together than if they were independent.
structures in a structured data set. Single items are the
simplest form of structure. Lift(A⇒B)=Confidence(A⇒B)/Support(B)

Each element of an itemset may contain a subsequence, a Apriori Algorithm


subtree, and so on. Therefore, structured pattern mining
can be considered as the most general form of frequent 1. Find frequent 1-itemsets (L1)
pattern mining.
• Count the occurrences of each item in the
4.4 Apriori Algorithm transactions.

The Apriori Algorithm is a data mining technique used to • Remove items that do not meet the minimum
discover frequent itemsets and generate association rules support threshold.
from large databases. It helps businesses and researchers
2. Find frequent 2-itemsets (L2)
analyze patterns in transactional data, such as market
basket analysis. • Combine frequent 1-itemsets to form candidate 2-
itemsets.
Why is Apriori Important?
• Count the occurrences of each candidate itemset.
• Helps in market basket analysis (e.g., finding • Remove those that do not meet the minimum
which products are often bought together). support.

• Used in recommendation systems (e.g., Amazon’s 3. Find frequent 3-itemsets (L3), and so on...
"Customers who bought this also bought that").
• Use frequent (k-1)-itemsets to generate candidate

• Helps businesses make data-driven decisions k-itemsets.

about promotions and inventory management. • Count occurrences and apply the minimum
support filter.
Key Concepts
4. Stop when no more frequent itemsets can be found.
1. Itemset: A set of items in a transaction (e.g., {Milk,
Bread, Butter}). 5. Generate association rules

2. Frequent Itemset: An itemset that appears in at least o For each frequent itemset, generate rules

a minimum number of transactions (called minimum using confidence and lift.

support).
Algorithm (Step-by-Step)

3. Support: The proportion of transactions that contain


Input:
an itemset.

• Database D containing transactions.


Support(X)=Number of transactions containing X/Total
number of transactions
• Min_Support (minimum support threshold).

• Min_Confidence (minimum confidence threshold).


Association Rule Mining | 4.4

Output: Step 1: Find Frequent 1-Itemsets

• Frequent itemsets. Item Count Support (%)

• Association rules. Milk 4 80%

Steps: Bread 4 80%

1. Find all frequent 1-itemsets: Butter 4 80%

o Scan the database. All items meet the Min_Support (assume Min_Support =
50%).
o Count occurrences of each item.
Step 2: Find Frequent 2-Itemsets
o Remove items below Min_Support.
Itemset Count Support (%)
2. Generate candidate k-itemsets (Ck) using frequent (k-
1)-itemsets (Lk-1). {Milk, Bread} 3 60%

3. Prune candidates (remove those with infrequent {Milk, Butter} 3 60%


subsets).
{Bread, Butter} 4 80%
4. Count the occurrences of remaining candidate
itemsets in the database. All 2-itemsets meet Min_Support.

5. Filter out itemsets that do not meet Min_Support. Step 3: Find Frequent 3-Itemsets

6. Repeat steps 2-5 until no new frequent itemsets are Itemset Count Support (%)

found.
{Milk, Bread, Butter} 3 60%

7. Generate association rules from frequent itemsets:


Meets Min_Support → Keep it.

o Calculate confidence for each rule.


Step 4: Generate Association Rules

o Keep rules that satisfy Min_Confidence.


Rule Confidence (%) Lift
Example
{Milk, Bread} → Butter 100% >1
Consider a small database of transactions:
{Milk, Butter} → Bread 100% >1
Transaction ID Items Bought
{Bread, Butter} → Milk 75% >1
T1 Milk, Bread, Butter
Since all rules meet Min_Confidence (assume
T2 Bread, Butter Min_Confidence = 70%), they are valid.

T3 Milk, Bread Advantages of Apriori Algorithm

T4 Milk, Butter • Simple and easy to understand.


• Useful for market basket analysis and
T5 Bread, Butter, Milk recommendation systems.
• Can be applied to large databases.
Association Rule Mining| 4.5

Disadvantages of Apriori Algorithm Here’s how it works in simple terms:

• Requires multiple database scans, making it slow 1. Data Compression: First, FP-Growth compresses the
for large datasets. dataset into a smaller structure called the Frequent
• Can generate many candidate itemsets, leading to Pattern Tree (FP-Tree). This tree stores information
high memory usage. about itemsets (collections of items) and their
frequencies, without needing to generate candidate
Real-Life Applications of Apriori
sets like Apriori does.

• Retail & E-commerce: Identifying products often 2. Mining the Tree: The algorithm then examines this
bought together. tree to identify patterns that appear frequently, based
• Healthcare: Analyzing patient symptoms and on a minimum support threshold. It does this by
disease correlations. breaking the tree down into smaller “conditional” trees
• Banking: Detecting fraudulent transactions. for each item, making the process more efficient.
• Online Recommendations: Amazon, Netflix, and
YouTube use association rules to recommend 3. Generating Patterns: Once the tree is built and

products and content. analyzed, the algorithm generates the frequent


patterns (itemsets) and the rules that describe
4.5 Frequent Pattern Growth Algorithm relationships between items

FP-Growth addresses inefficiencies of apriori algorithm by Lets understand this with the help of a real life analogy:
using a more efficient approach to mine frequent itemsets,
eliminating the need for multiple database scans and Imagine you’re organizing a large family reunion, and you

speeding up the overall process. want to know which food items are most popular among
the guests. Instead of asking everyone individually and
Disscuing the Drawbacks of Apriori Algorithm writing down their answers one by one, you decide to use a
more efficient method.
The two primary drawbacks of the Apriori Algorithm are:

Step 1: Create a List of Items People Bring


1. At each step, candidate sets have to be built.

Instead of asking every person what they like to eat, you


2. To build the candidate sets, the algorithm has to
ask them to write down what foods they brought. You then
repeatedly scan the database.
create a list of all the food items brought to the party. This
These two properties inevitably make the algorithm slower. is like scanning the entire database once to get an overview
To overcome these redundant steps, a new association-rule and insights of the data.
mining algorithm was developed named Frequent Pattern
Step 2: Group Similar Items Together
Growth Algorithm. It overcomes the disadvantages of the
Apriori algorithm by storing all the transactions in a Trie Now, you group the food items that were brought most
Data Structure. frequently. You might end up with groups like “Pizza”
(which was brought by 10 people), “Cake” (by 4 people),
Understanding The Frequent Patter Growth
“Pasta” (by 3 people), and others. This is similar to creating
The FP-Growth algorithm is a method used to find frequent the Frequent Pattern Tree (FP-Tree) in FP-Growth, where
patterns in large datasets. It is faster and more efficient you only keep track of the items that are common enough.
than the Apriori algorithm because it avoids repeatedly
scanning the entire database.
Association Rule Mining | 4.6

Step 3: Look for Hidden Patterns elements are stored in descending order of their respective
frequencies. After insertion of the relevant items, the set L
Next, instead of going back to every person to ask again
looks like this:-
about their preferences, you simply look at your list of
items and patterns. You notice that people who brought L = {K : 5, E : 4, M : 3, O : 4, Y : 3}
pizza also often brought pasta, and those who brought
Now, for each transaction, the respective Ordered-Item
cake also brought pasta. These hidden relationships (e.g.,
set is built. It is done by iterating the Frequent Pattern set
pizza + pasta, cake + pasta) are like the “frequent patterns”
and checking if the current item is contained in the
you find in FP-Growth.
transaction in question. If the current item is contained,
Step 4: Simplify the Process the item is inserted in the Ordered-Item set for the current
transaction. The following table is built for all the
With FP-Growth, instead of scanning the entire party list
transactions:
multiple times to look for combinations of items, you’ve
condensed all the information into a smaller, more Transaction ID Items Ordered-Item-Set
manageable tree structure. You can now quickly see the T1 {E,K,M,N,O,Y} {K,E,M,O,Y}
T2 {D,E,K,N,O,Y} {K,E,O,Y}
most common combinations, like “Pizza and pasta” or T3 {A,E,K,M} {K,E,M}
“Cake and pasta,” without the need to revisit every single T4 {C,K,M,U,Y} {A,E,K,M}
T5 {C,E,I,K,O,O} {A,E,K,M}
detail. Now, all the Ordered-Item sets are inserted into a Trie Data
Structure.
Working of FP- Growth Algorithm

a) Inserting the set {K, E, M, O, Y}:


Lets jump to the usage of FP- Growth Algorithm and how it
works with reallife data. Consider the following data: Here, all the items are simply linked one after the other in
the order of occurrence in the set and initialize the support
Transaction ID Items
count for each item as 1.
T1 {E,K,M,N,O,Y}
T2 {D,E,K,N,O,Y}
T3 {A,E,K,M}
T4 {C,K,M,U,Y}
T5 {C,E,I,K,O,O}
The above-given data is a hypothetical dataset of
transactions with each letter representing an item. The
frequency of each individual item is computed:

Item Frequency
A 1
C 2
D 1 b) Inserting the set {K, E, O, Y}:
E 4
I 1
K 5 Till the insertion of the elements K and E, simply the
M 3 support count is increased by 1. On inserting O we can see
N 2
that there is no direct link between E and O, therefore a
O 4
U 1 new node for the item O is initialized with the support
Y 3 count as 1 and item E is linked to this new node. On
Let the minimum support be 3. A Frequent Pattern set is
inserting Y, we first initialize a new node for the item Y
built which will contain all the elements whose frequency
is greater than or equal to the minimum support. These
Association Rule Mining| 4.7

with support count as 1 and link the new node of O with e) Inserting the set {K, E, O}:
the new node of Y.
Here simply the support counts of the respective elements
are increased. Note that the support count of the new node
of item O is increased.

c) Inserting the set {K, E, M}:


Now, for each item, the Conditional Pattern Base is
Here simply the support count of each element is increased computed which is path labels of all the paths which lead
by 1. to any node of the given item in the frequent-pattern tree.
Note that the items in the below table are arranged in the
ascending order of their frequencies.

Now for each item, the Conditional Frequent Pattern Tree


is built. It is done by taking the set of elements that is
common in all the paths in the Conditional Pattern Base of
that item and calculating its support count by summing

d) Inserting the set {K, M, Y}: the support counts of all the paths in the Conditional
Pattern Base.
Similar to step b), first the support count of K is increased,
then new nodes for M and Y are initialized and linked
accordingly.

From the Conditional Frequent Pattern tree, the Frequent


Pattern rules are generated by pairing the items of the
Conditional Frequent Pattern Tree set to the corresponding
to the item as given in the below table.
Association Rule Mining | 4.8

Approaches to multilevel association rule mining :

1. Uniform Support(Using uniform minimum support


for all level)

2. Reduced Support (Using reduced minimum


support at lower levels)

For each row, two types of association rules can be inferred 3. Group-based Support(Using item or group based
for example for the first row which contains the element, support)
the rules K -> Y and Y -> K can be inferred. To determine
the valid rule, the confidence of both the rules is calculated Let’s discuss one by one.

and the one with confidence greater than or equal to the


1. Uniform Support
minimum confidence value is retained.
At the point when a uniform least help edge is used, the
4.6 Multilevel Association Rule
search methodology is simplified. The technique is likewise

Association rules created from mining information at basic in that clients are needed to determine just a single

different degrees of reflection are called various level or least help threshold. An advancement technique can be

staggered association rules. Multilevel association rules adopted, based on the information that a progenitor is a

can be mined effectively utilizing idea progressions under a superset of its descendant. the search keeps away from

help certainty system. Rules at a high idea level may add analyzing item sets containing anything that doesn’t have

to good judgment while rules at a low idea level may not be minimum support. The uniform support approach

valuable consistently. however has some difficulties. It is unlikely that items at


lower levels of abstraction will occur as frequently as those
Utilizing uniform least help for all levels : at higher levels of abstraction. If the minimum support
threshold is set too high it could miss several meaningful
• At the point when a uniform least help edge is
associations occurring at low abstraction levels. This
utilized, the pursuit system is rearranged.
provides the motivation for the following approach.

• The technique is likewise straightforward, in that


2. Reduce Support
clients are needed to indicate just a single least
help edge. For mining various level relationship with diminished
support, there are various elective hunt techniques as
• A similar least help edge is utilized when mining at follows.
each degree of deliberation. (for example for mining
from “PC” down to “PC”). Both “PC” and “PC” • Level-by-Level independence: This is a full-
discovered to be incessant, while “PC” isn’t. broadness search, where no foundation information on
regular item sets is utilized for pruning. Each hub is
Needs of Multidimensional Rule : examined, regardless of whether its parent hub is
discovered to be incessant.
• Sometimes at the low data level, data does not
show any significant pattern but there is useful • Level – cross-separating by single thing: A thing at
information hiding behind it. the I level is inspected if and just if its parent hub at
the (I-1) level is regular all in all, we research a more
• The aim is to find the hidden information in or
explicit relationship from a more broad one. If a hub is
between levels of abstraction.
frequent, its kids will be examined; otherwise, its
descendant is pruned from the inquiry.
Association Rule Mining| 4.9

• Level-cross separating by – K-itemset: A-itemset at


the I level is inspected if and just if it’s For mining
various level relationship with diminished support,
there are various elective hunt techniques.

• Level-by-Level independence: This is a full-


broadness search, where no foundation information on
regular item sets is utilized for pruning. Each hub is
examined, regardless of whether its parent hub is
discovered to be incessant.

• Level – cross-separating by single thing: A thing at


the 1st level is inspected if and just if its parent hub at
the (I-1) the level is regular all in all, we research a
more explicit relationship from a more broad one. If a
hub is frequent, its kids will be examined otherwise, its
descendant is pruned from the inquiry.

• Level-cross separating by – K-item set: A-item set at


the I level is inspected if and just if its corresponding
parents A item set (i-1) level is frequent.

• Group-based support: The group-wise threshold value


for support and confidence is input by the user or
expert. The group is selected based on a product price
or item set because often expert has insight as to
which groups are more important than others.

Example: For e.g. Experts are interested in purchase


patterns of laptops or clothes in the non and electronic
category. Therefore, low support threshold is set for this
group to give attention to these items’ purchase patterns
Chapter

5
Clustering
5.1 Clustering 5.4 Requirements for Clustering Algorithms
A cluster is a group of similar objects. When we analyze 1. Scalability:
data, we divide it into different groups based on The algorithm should work well for large datasets (millions
similarities. Each group is called a cluster. of data points). If we use only a small sample, results may
Clustering is a technique used to group similar data be biased.
together without any predefined labels. It is called 2. Handling Different Data Types:
unsupervised learning because we don’t have labeled Some clustering algorithms work best with numbers, but
categories in advance. Instead, we find patterns based on data can also be text, categories, or images.
similarities in the data. 3. Identifying Different Shapes of Clusters:
A good clustering method aims for: Some clusters are circular, but real-world clusters can be
• High similarity within a cluster – Objects in the irregular (like the spread of a wildfire).
same group should be very similar. 4. Noisy Data Handling:
• Low similarity between clusters – Different Data often has errors or missing values. Good clustering
groups should have distinct data. algorithms should be robust to such issues.
5. Handling New Data Incrementally:
5.2 Cluster Analysis If new data arrives, the algorithm should update clusters
Cluster analysis, or clustering, is the process of dividing a without starting from scratch.
dataset into meaningful groups. Each group contains 6. High-Dimensional Data Support:
similar objects but is different from other groups. Some datasets have thousands of attributes (like text
Example: Imagine we have a dataset with different types documents with thousands of keywords). Clustering
of vehicles (cars, buses, bicycles, etc.). Since clustering is should work efficiently even in such cases.
unsupervised learning, we don’t assign labels like "Car" or 7. Flexibility with Constraints:
"Bike." Instead, the algorithm groups similar vehicles Sometimes, clustering must follow specific rules (e.g.,
based on shared features. placing ATMs in locations with high customer demand).
Clustering is also used in data segmentation (dividing large 8. Easy to Understand and Use:
datasets into smaller groups). It can also help in outlier The results should be clear and useful for decision-
detection, which finds unusual data points, such as making.
fraudulent credit card transactions. 5.5 Working of Clustering
How Clustering Works:
5.3 Applications of Clustering: • Some algorithms divide data into equal-sized
• Fraud detection (identifying unusual credit card groups (partitioning).
transactions). • Some allow overlapping clusters (objects can
• Customer segmentation (grouping customers based belong to multiple groups).
on purchasing behavior). • Clustering uses different distance measures (like
• Medical diagnosis (grouping patients with similar Euclidean distance) to find similar data points.
symptoms).
• Image analysis (grouping similar images).
Clustering | 5.2

5.6 Clustering Methods 2. (Re) Assign each object to which object is most similar
based upon mean values.
Clustering is a technique used to group similar data points
together. There are different clustering methods, each with 3. Update Cluster means, i.e., Recalculate the mean of
its own advantages and use cases. Below are the five major each cluster with the updated values.
types of clustering methods explained in detail.
4. Repeat Step 2 until no change occurs.
1. Partitioning Methods
Partitioning methods divide a dataset into k clusters,
where k is a predefined number chosen by the user. Each
data point belongs to exactly one cluster, and the method
aims to optimize the grouping so that similar points stay
together while different ones remain apart.
The algorithm first creates an initial random partition and
then iteratively refines it by shifting data points between
clusters to improve overall grouping.
i. K-Mean (A centroid based Technique):
Flowchart:
The most commonly used partitioning method is K-Means,
which minimizes the distance between points within the
same cluster.
The K means algorithm takes the input parameter K from
the user and partitions the dataset containing N objects
into K clusters so that resulting similarity among the data
objects inside the group (intracluster) is high but the
similarity of data objects with the data objects from outside
the cluster is low (intercluster).
The similarity of the cluster is determined with respect to
the mean value of the cluster. It is a type of square error
algorithm.
At the start randomly k objects from the dataset are Example: Suppose we want to group the visitors to a
chosen in which each of the objects represents a cluster website using just their age as follows:
mean(centre). For the rest of the data objects, they are 16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44,
assigned to the nearest cluster based on their distance 45, 61, 62, 66
from the cluster mean. The new mean of each of the Initial Cluster:
cluster is then calculated with the added data objects. K=2
Algorithm: Centroid(C1) = 16 [16]
Input: Centroid(C2) = 22 [22]
K: The number of clusters in which the dataset has to be Note: These two points are chosen randomly from the
divided dataset.
D: A dataset containing N number of objects Iteration-1:
Output: C1 = 16.33 [16, 16, 17]
A dataset of K clusters C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44,
Method: 45, 61, 62, 66]
1. Randomly assign K objects from the dataset(D) as Iteration-2:
cluster centres(C) C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Clustering | 5.3

Iteration-3: This approach allows density-based clustering to find


C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29] clusters of arbitrary shapes, unlike K-Means, which
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66] assumes clusters are spherical. The most well-known
Iteration-4: algorithm in this category is DBSCAN (Density-Based
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29] Spatial Clustering of Applications with Noise), which
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66] defines clusters using two parameters: minimum points
No change Between Iteration 3 and 4, so we stop. Therefore per cluster and radius of the neighborhood. DBSCAN can
we get the clusters (16-29) and (36-66) as 2 clusters we also identify outliers, which are data points that do not
get using K Mean Algorithm. belong to any cluster.
Other density-based algorithms include OPTICS (which
ii. K-Medoids improves on DBSCAN for varying densities) and DENCLUE
Another variant, K-Medoids, is similar but chooses actual (which uses density functions for clustering). These
data points as cluster centers instead of averages, making methods are useful in applications like geospatial analysis,
it more resistant to outliers. These methods work well for image segmentation, and fraud detection, where clusters
spherical clusters but struggle with irregular shapes or may have irregular shapes.
clusters of different densities. 4. Grid-Based Methods
Grid-based methods divide the data space into small grid
2. Hierarchical Methods cells and perform clustering based on these cells rather
Hierarchical clustering builds a tree-like structure of than individual data points.
clusters. This method does not require the number of This significantly speeds up clustering, as the number of
clusters to be specified in advance. There are two main grid cells is usually much smaller than the number of data
types of hierarchical clustering: Agglomerative (bottom-up) points. The clustering operations are applied to the grid
and Divisive (top-down). structure, making them efficient and scalable. One well-
The Agglomerative approach starts with each data point as known algorithm, STING (Statistical Information Grid),
an individual cluster and merges the closest ones step by organizes data into hierarchical grids, making queries and
step until only one cluster remains. The Divisive approach, clustering extremely fast. Another grid-based algorithm,
on the other hand, starts with all data points in a single WaveCluster, uses wavelet transformations to analyze data
large cluster and then splits them into smaller groups at different levels of detail.
iteratively. Grid-based methods are ideal for large datasets where
A key drawback of hierarchical clustering is that once a processing speed is a major concern, such as in spatial
merge or split occurs, it cannot be undone, which may lead data analysis and real-time applications. However, they
to suboptimal results. However, it is useful for finding may not be as effective in detecting complex relationships
relationships between data and can be combined with between data points.
other clustering methods for better results. Algorithms 5. Model-Based Methods
such as Chameleon and BIRCH improve hierarchical Model-based clustering assumes that each cluster follows
clustering by carefully analyzing how objects are linked. a specific statistical model and tries to find the best fit for
the data. These methods automatically determine the
3. Density-Based Methods optimal number of clusters using statistical techniques,
Unlike partitioning and hierarchical methods that rely on making them useful when the correct number of clusters is
distance measurements, density-based methods group unknown.
data points based on how dense their neighborhoods are. One common model-based approach is Gaussian Mixture
These methods continue expanding a cluster as long as Models (GMMs), which assumes data points are generated
there are enough nearby points. from a mixture of multiple normal distributions. Another
approach, the Expectation-Maximization (EM) algorithm,
Clustering | 5.4

iteratively refines clusters by adjusting probability


distributions. Model-based clustering is effective in
applications where clusters have complex distributions,
such as genetics, speech recognition, and market
segmentation.
However, these methods can be computationally expensive
and require a good understanding of probability models to
work effectively.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy