Data Wareousing and Mining-Notes
Data Wareousing and Mining-Notes
1
Introduction to Data Warehouse
1.1 Definition of Data Warehouse 2. Integrated
A data warehouse is a central storage system that collects Integration means that a data warehouse combines data
and manages large amounts of data from different sources. from multiple sources into a standardized format. This
It is designed for fast data retrieval, analysis, and ensures that all data follows the same naming
reporting. Organizations use data warehouses to make conventions, formats, and coding standards.
informed decisions because they provide a single, reliable For example, a data warehouse may collect information
source of information. from mainframe systems, relational databases, and cloud
A data warehouse works by combining and organizing data storage. By integrating data from different sources,
from various sources into a consistent and structured businesses can perform effective analysis and decision-
format. This ensures that the data is accurate and easy to making.
analyze. A well-managed data warehouse helps users
understand trends and patterns in specific areas.
A data warehouse is a subject-oriented, integrated, time-
variant and non-volatile collection of data in support of
management's decision-making process.
A data warehouse focuses on specific topics or themes analyze trends and changes over time. Data is recorded at
rather than daily operations. It organizes data based on a different intervals (daily, weekly, monthly, or yearly),
particular subject, such as sales, marketing, or making it possible to study historical trends.
Unlike operational databases, which handle real-time a data warehouse allows organizations to track changes
transactions, a data warehouse helps analyze data and and make long-term strategic decisions. Once stored, data
make better business decisions. It filters out unnecessary cannot be modified—it remains as a historical record for
details and provides a clear and concise view of relevant future analysis.
information. 4. Non-Volatile
Non-volatile means that data in a data warehouse is never
deleted or changed after being stored. When new data is
added, the existing data remains unchanged.
This feature ensures that businesses always have access to
past information for historical analysis and trend
identification. The data warehouse does not require
transaction processing, concurrency control, or frequent
updates like operational databases.
Introduction to Data Warehouse | 1.2
There are two key types of data operations in a data 7. Data Reporting
warehouse: Data warehouses provide dashboards and reports that
Data Loading – Storing new data into the system. help organizations track performance and identify trends.
Data Access – Retrieving stored data for analysis.
8. Data Mining
Advanced techniques such as machine learning and
pattern recognition are used to discover useful insights
from large datasets.
9. Performance Optimization
A data warehouse is optimized for fast querying and
efficient analysis, ensuring quick access to data when
1.3 Functions of a Data Warehouse
needed.
1. Data Consolidation
1.4 Purpose of a Data Warehouse
Merging data from different sources into a single central
repository ensures accuracy and consistency.
A data warehouse is designed to collect, store, and analyze
large amounts of data from various sources. Its main
2. Data Cleaning
purpose is to help businesses make better, data-driven
Identifying and removing errors, duplicates, and
decisions by providing a centralized, structured, and
inconsistencies before storing data improves data quality
reliable source of information. Below are some key
and reliability.
purposes of a data warehouse:
3. Data Integration
1. Centralized Data Storage
Combining structured and unstructured data from
A data warehouse consolidates data from multiple sources
multiple sources into a unified format allows businesses to
(such as databases, cloud storage, and applications) into a
perform better analysis.
single location. This eliminates data silos and makes it
easier to access and manage information.
4. Data Storage
Data warehouses can store large amounts of historical
2. Improved Decision-Making
data, making it easy to access and analyze when needed.
By providing historical and current data in an organized
manner, a data warehouse enables businesses to analyze
5. Data Transformation
trends, predict future outcomes, and make informed
Data is converted into a standardized and structured
strategic decisions.
format by removing duplicates and unnecessary details.
2. Bottom-up approach
4. Data Integration from Multiple Sources The bottom-up approach starts with experiments and
A data warehouse combines and standardizes data from prototypes. This is useful in the early stage of business
various sources, ensuring consistency in naming modeling and technology development. It allows an
conventions, formats, and codes. This makes analysis organization to move forward at considerably less expense
more accurate and reliable. and to evaluate the benefits of the technology before
making significant commitments.
5. Historical Data Analysis
Since data warehouses store long-term historical data, 3. Hybrid approach
organizations can track performance over time, identify In the hybrid or combined approach, an organization can
patterns, and compare past and present trends. exploit the planned and strategic nature of the top-down
approach while retaining the rapid implementation and
6. Business Intelligence and Reporting opportunistic application of the bottom-up approach.
A data warehouse supports business intelligence (BI) tools,
dashboards, and reporting systems, helping organizations The warehouse design process consists of the following
generate insights through charts, graphs, and summary steps:
reports. • Choose a business process to model, for example,
orders, invoices, shipments, inventory, account
7. Enhanced Data Quality and Accuracy administration, sales, or the general ledger. If the
By implementing data cleaning and transformation business process is organizational and involves
processes, a data warehouse ensures that stored data is multiple complex object collections, a data warehouse
consistent, error-free, and reliable for analysis. model should be followed. However, if the process is
departmental and focuses on the analysis of one kind
8. Scalability and Performance Optimization of business process, a data mart model should be
Data warehouses are designed to handle large volumes of chosen.
data efficiently, making them suitable for growing • Choose the grain of the business process. The grain is
businesses with increasing data needs. the fundamental, atomic level of data to be represented
in the fact table for this process, for example,
9. Security and Data Governance individual transactions, individual daily snapshots,
A data warehouse provides controlled access to data, and so on.
ensuring that only authorized users can retrieve and • Choose the dimensions that will apply to each fact
analyze specific datasets. This enhances data security and table record. Typical dimensions are time, item,
compliance with regulations. customer, supplier, warehouse, transaction type, and
status.
1.5 Data Warehouse Design Process • Choose the measures that will populate each fact table
A data warehouse can be built using a top-down approach, record. Typical measures are numeric additive
a bottom-up approach, or a combination of both. quantities like dollars sold and units sold.
To organize and process data efficiently, a Three-Tier ODBC (Open Database Connection) and OLEDB (Open
Architecture is used, which divides the system into three Linking and Embedding for Databases) by Microsoft and
layers: JDBC (Java Database Connection). This tier also contains
a metadata repository, which stores information about the
1. Bottom Tier – Data Storage and Processing
data warehouse and its contents.
2. Middle Tier – Data Analysis (OLAP Engine)
3. Top Tier – User Interface and Reporting The ETL process (Extract, Transform, Load) is used to:
Each of these layers plays a key role in storing, processing, • Extract data from different sources like databases,
and analyzing data to make it useful for businesses. files, or web services.
• Transform data by cleaning, filtering, and
organizing it to match business needs.
• Load the processed data into a structured storage
system for easy access.
• IBM Infosphere
• Informatica
• Microsoft SSIS
• SnapLogic
as gateways. A gateway is supported by the underlying databases, great for handling large datasets.
• MOLAP (Multidimensional OLAP) – Uses a special handle data efficiently, three main types of data warehouse
data storage format that makes queries much models exist:
faster.
1. Enterprise Data Warehouse
• HOLAP (Hybrid OLAP) – A mix of ROLAP and
2. Data Mart
MOLAP, balancing flexibility and speed.
3. Virtual Warehouse
Common Challenges & Solutions
Each model serves a different purpose depending on the
Challenge Solution
scale, functionality, and usage within an organization.
Data processing Use query optimization techniques
Let’s explore them in detail.
takes too long like indexing.
Delays in updating Use real-time processing to keep
data data fresh. 1. Enterprise Data Warehouse (EDW)
Merging data from Use tools like Talend or An Enterprise Data Warehouse (EDW) is a large-scale
different sources Informatica to standardize data system designed to store and manage data for an entire
formats. organization. It integrates data from multiple sources,
making it useful for company-wide decision-making.
3. Top Tier – User Interface and Reporting
Key Characteristics:
The top tier is a front-end client layer, which contains
query and reporting tools, analysis tools, and/or data
• Centralized Repository: Stores all data across the
mining tools (e.g., trend analysis, prediction, and so on).
organization, providing a single source of truth for
different departments.
This is the front-end layer where users view and analyze
data through dashboards, reports, and charts. • Comprehensive Data Integration: Collects and
processes detailed and summarized data from multiple
operational systems and external sources (e.g.,
customer records, financial data, sales transactions).
Popular BI Tools Used for Reporting:
• Large Storage Capacity: Can store data ranging from
• IBM Cognos – Advanced reporting and analytics. gigabytes to terabytes or even petabytes, depending on
• Microsoft BI – Works well with Excel and other the organization’s needs.
• SAP BW – Great for businesses using SAP finance, marketing, sales, HR, and supply chain.
other data sources easily with existing systems. A data mart is a smaller, specialized version of a data
warehouse, designed to serve the needs of a specific
1.7 Data Warehouse Models department or user group. It extracts a subset of the data
from an enterprise warehouse or other sources and focuses
A data warehouse is a system that stores and manages on a particular business function.
• Data related to system performance, which include {location key, street, city, province or state, country}. This
indices and profiles that improve data access and constraint may introduce some redundancy.
retrieval performance, in addition to rules for the
timing and scheduling of refresh, update, and For example, “Vancouver” and “Victoria” are both cities in
replication cycles. the Canadian province of British Columbia. Entries for
• Business metadata, which include business terms and such cities in the location dimension table will create
definitions, data ownership information, and charging redundancy among the attribute’s province or state and
policies. country, that is, (..., Vancouver, British Columbia, Canada)
and (..., Victoria, British Columbia, Canada). Moreover, the
1.9 Schema Design attributes within a dimension table may form either a
Stars, Snowflakes, and Fact Constellations: Schemas for hierarchy (total order) or a lattice (partial order).
Multidimensional Databases The entity- relationship data
model is commonly used in the design of relational
databases, where a database schema consists of a set of
entities and the relationships between them. Such a data
model is appropriate for on- line transaction processing. A
data warehouse, however, requires a concise, subject-
oriented schema that facilitates on-line data analysis. The
most popular data model for a data warehouse is a
multidimensional model. Such a model can exist in the
form of a star schema, a snowflake schema, or a fact
constellation schema. Let’s look at each of these schema 2. Snowflake schema
types. Star schema: The most common modeling paradigm A snowflake schema for All Electronics sales is given in
is the star schema, in which the data warehouse contains Figure Here, the sales fact table is identical to that of the
(1) a large central table (fact table) containing the bulk of star schema in Figure. The main difference between the
the data, with no redundancy, and (2) a set of smaller two schemas is in the definition of dimension tables.
attendant tables (dimension tables), one for each
dimension. The schema graph resembles a starburst, with The single dimension table for item in the star schema is
the dimension tables displayed in a radial pattern around normalized in the snowflake schema, resulting in new item
the central fact table. and supplier tables. For example, the item dimension table
now contains the attributes item key, item name, brand,
1. Star schema type, and supplier key, where supplier key is linked to the
A star schema for All Electronics sales is shown in Figure. supplier dimension table, containing supplier key and
Sales are considered along four dimensions, namely, time, supplier type information. Similarly, the single dimension
item, branch, and location. The schema contains a central table for location in the star schema can be normalized
fact table for sales that contains keys to each of the four into two new tables: location and city. The city key in the
dimensions, along with two measures: dollars sold and new location table links to the city dimension. Notice that
units sold. To minimize the size of the fact table, further normalization can be performed on province or
dimension identifiers (such as time key and item key) are state and country in the snowflake schema
system-generated identifiers. Notice that in the star
schema, each dimension is represented by only one table,
and each table contains a set of attributes. For example,
the location dimension table contains the attribute set
Introduction to Data Warehouse | 1.8
2
Introduction of Data Mining
2.1 Fundamentals of Data Mining mailings to identify the targets most likely to maximize
Data mining refers to extracting or mining knowledge from return on investment in future mailings. Other predictive
large amounts of data. The term is actually a misnomer. problems include forecasting bankruptcy and other forms
Thus, data mining should have been more appropriately of default, and identifying segments of a population likely
named as knowledge mining which emphasis on mining to respond similarly to given events.
from large amounts of data.
It is the computational process of discovering patterns in Automated discovery of previously unknown patterns.
large data sets involving methods at the intersection of Data mining tools sweep through databases and identify
artificial intelligence, machine learning, statistics, and previously hidden patterns in one step. An example of
database systems. pattern discovery is the analysis of retail sales data to
The overall goal of the data mining process is to extract identify seemingly unrelated products that are often
information from a data set and transform it into an purchased together. Other pattern discovery problems
understandable structure for further use. include detecting fraudulent credit card transactions and
identifying anomalous data that could represent data entry
The key properties of data mining are keying errors.
Automatic discovery of patterns
Prediction of likely outcomes 2.3 Data Mining Functionalities
Creation of actionable information We have seen different types of databases and information
Focus on large datasets and databases sources where data mining can be applied. Now, let’s
explore the types of patterns that can be discovered
2.2 The Scope of Data Mining through data mining. Data mining helps identify useful
Data mining derives its name from the similarities between patterns in large datasets, which can be broadly classified
searching for valuable business information in a large into two categories:
database, for example, finding linked products in gigabytes • Descriptive data mining: focuses on understanding
of store scanner data — and mining a mountain for a vein the general characteristics of data.
of valuable ore. Both processes require either sifting
• Predictive data mining: analyzes current data to
through an immense amount of material, or intelligently
make future predictions.
probing it to find exactly where the value resides.
Sometimes, users may not know what patterns are useful
Given databases of sufficient size and quality, data mining
in their data. In such cases, they may need to search for
technology can generate new business opportunities by
multiple patterns at the same time. That’s why data
providing these capabilities:
mining systems should be capable of finding different types
of patterns to meet various needs. These systems should
Automated prediction of trends and behaviours. Data
also allow users to refine their searches and explore
mining automates the process of finding predictive
patterns at different levels of detail. Since not all patterns
information in large databases. Questions that
apply to the entire dataset, data mining also includes a
traditionally required extensive hands- on analysis can
measure of how reliable or "trustworthy" each pattern is.
now be answered directly from the data — quickly.
A typical example of a predictive problem is targeted
marketing. Data mining uses data on past promotional
Fundamentals of Data Mining | 2.2
possible into the mining process as to confine the search to prediction, clustering, outlier detection, and
only the interesting patterns. evolution analysis.
4. User interface • Granularity levels: Systems can mine high-level
This module communicates between users and the data generalized knowledge, low-level raw data insights,
mining system, allowing the user to interact with the or multi-level knowledge combining both.
system by specifying a data mining query or task, • Regular vs. irregular patterns: Some systems
providing information to help focus the search, and focus on common patterns, while others detect
performing exploratory datamining based on the exceptions or outliers in the data.
intermediate data mining results. In addition, this 3. Classification Based on the Techniques Used
component allows the user to browse database and data Data mining systems can also be categorized by the
warehouse schemas or data structures, evaluate mined techniques they use:
patterns, and visualize the patterns in different forms. • User interaction level: Autonomous (fully
automated), interactive (user-guided), or query-
2.6 Classification of Data Mining Systems driven (based on specific user queries).
Data mining is a field that brings together knowledge from • Methods used: Database-oriented, machine
various disciplines, such as databases, statistics, machine learning, statistical analysis, neural networks,
learning, visualization, and information science. pattern recognition, and visualization.
Depending on the approach used, data mining may also • Hybrid systems: Many advanced systems combine
involve techniques from other fields like neural networks, multiple techniques for better results.
fuzzy logic, pattern recognition, image analysis, and high- 4. Classification Based on the Application Domain
performance computing. Additionally, data mining can be Some data mining systems are designed for specific
applied in various domains, including economics, industries or fields, such as:
business, bioinformatics, web technology, and psychology. • Finance (e.g., stock market predictions, fraud
detection)
Because data mining includes so many different methods • Telecommunications (e.g., call pattern analysis,
and applications, many different types of data mining network optimization)
systems exist. To help users choose the right system for
• Healthcare and bioinformatics (e.g., DNA analysis,
their needs, data mining systems can be classified into
disease prediction)
different categories based on specific criteria.
• Retail and e-commerce (e.g., customer behavior
1. Classification Based on the Type of Database Mined
analysis, recommendation systems)
Data mining systems can be classified based on the type of
• Cybersecurity (e.g., fraud detection, intrusion
database they analyze. Since databases vary by structure
detection systems)
and content, each type may require different data mining
Since different applications may need customized
techniques.
approaches, a one-size-fits-all data mining system may not
• By data model: Relational, transactional, object-
be effective. Instead, industry-specific data mining
relational, or data warehouse mining systems.
solutions are often preferred.
• By data type: Spatial, time-series, text, By classifying data mining systems based on these criteria,
multimedia, stream data, or web mining systems. users can better understand which system suits their
2. Classification Based on the Type of Knowledge needs and how to apply data mining effectively in their
Mined
respective fields.
Different data mining systems extract different kinds of
knowledge, including:
2.7 Data Mining Process:
• Basic functions: Characterization, discrimination, Data Mining is a process of discovering various models,
association and correlation analysis, classification, summaries, and derived values from a given collection of
Fundamentals of Data Mining | 2.4
data. The general experimental procedure adapted to data- case, the estimated model cannot be successfully used in a
mining problems involves the following steps: final application of the results.
1. State the problem and formulate the hypothesis
Most data-based modeling studies are performed in a 2.8 Major Issues in Data Mining
particular application domain. Hence, domain-specific Mining different kinds of knowledge in databases. -
knowledge and experience are usually necessary in order The need of different users is not the same. And
to come up with a meaningful problem statement. Different user may be in interested in different kind of
Unfortunately, many application studies tend to focus on knowledge. Therefore, it is necessary for data mining to
the data- mining technique at the expense of a clear cover broad range of knowledge discovery task.
problem statement. In this step, a modeler usually Interactive mining of knowledge at multiple levels
specifies a set of variables for the unknown dependency of abstraction: The data mining process needs to be
and, if possible, a general form of this dependency as an interactive because it allows users to focus the search
initial hypothesis. There may be several hypotheses for patterns, providing and refining data mining
formulated for a single problem at this stage. requests based on returned results.
The first step requires the combined expertise of an Incorporation of background knowledge: To guide
application domain and a data-mining model. In practice, discovery process and to express the discovered
it usually means a close interaction between the data- patterns, the background knowledge can be used.
mining expert and the application expert. In successful Background knowledge may be used to express the
data-mining applications, this cooperation does not stop in discovered patterns not only in concise terms but at
the initial phase; it continues during the entire data- multiple level of abstraction.
mining process. Data mining query languages and ad hoc data
mining: Data Mining Query language that allows the
2. Collect the data user to describe ad hoc mining tasks, should be
This step is concerned with how the data are generated integrated with a data warehouse query language and
and collected. In general, there are two distinct optimized for efficient and flexible data mining.
possibilities. The first is when the data-generation process Presentation and visualization of data mining
is under the control of an expert (modeler): this approach results: Once the patterns are discovered it needs to
is known as a designed experiment. be expressed in high level languages, visual
The second possibility is when the expert cannot influence representations. These representations should be
the data- generation process: this is known as the easily understandable by the users.
observational approach. An observational setting, namely,
random data generation, is assumed in most data-mining Handling noisy or incomplete data: The data
applications. cleaning methods are required that can handle the
Typically, the sampling distribution is completely unknown noise, incomplete objects while mining the data
after data are collected, or it is partially and implicitly regularities. If data cleaning methods are not there
given in the data-collection procedure. It is very important, then the accuracy of the discovered patterns will be
however, to understand how data collection affects its poor.
theoretical distribution, since such a priori knowledge can
be very useful for modeling and, later, for the final Pattern evaluation: It refers to interestingness of the
interpretation of results. Also, it is important to make sure problem. The patterns discovered should be interesting
that the data used for estimating a model and the data because either they represent common knowledge or
used later for testing and applying a model come from the lack novelty.
same, unknown, sampling distribution. If this is not the
Fundamentals of Data Mining| 2.5
Efficiency and scalability of data mining 3. detection and resolution of data value conflicts:
algorithms: In order to effectively extract the For the same real-world entity, attribute values from
information from huge amount of data in databases, different sources may differ.
data mining algorithm must be efficient and scalable.
2.10 Data Transformation
Parallel, distributed, and incremental mining In data transformation, the data are transformed or
algorithms: The factors such as huge size of consolidated into forms appropriate for mining.
databases, wide distribution of data, and complexity of
data mining methods motivate the development of Data transformation can involve the following:
parallel and distributed data mining algorithms. These Smoothing: which works to remove noise from the
algorithms divide the data into partitions which is data. Such techniques include binning, regression,
further processed parallel. Then the results from the and clustering.
partitions are merged. The incremental algorithms, Aggregation: where summary or aggregation
updates the databases without having to mine the data operations are applied to the data. For example, the
again from the scratch. daily sales data may be aggregated so as to compute
monthly and annual total amounts. This step is
2.9 Data Integration typically used in constructing a data cube for analysis
It combines data from multiple sources into a coherent of the data at multiple granularities. Generalization of
data store, as in data warehousing. These sources may the data, where low-level or ―primitive‖ (raw) data are
include multiple databases, data cubes, or flat files. replaced by higher-level concepts through the use of
The data integration systems are formally defined as triple concept hierarchies. For example, categorical
<G, S, M> attributes, like street, can be generalized to higher-
Where G: The global schema level concepts, like city or country.
S: Heterogeneous source of schemas Normalization: where the attribute data are scaled so
M: Mapping between queries of source and global schema as to fall within a small specified range, such as 1:0 to
1:0, or 0:0 to 1:0.
Attribute construction (or feature construction):
where new attributes are constructed and added from
the given set of attributes to help the mining process.
Attribute subset selection, where irrelevant, weakly technique; they will also influence the final data- mining
relevant, or redundant attributes or dimensions may results differently. Therefore, it is recommended to scale
be detected and removed. them and bring both features to the same weight for
Dimensionality reduction, where encoding further analysis.
mechanisms are used to reduce the dataset size. Also, application-specific encoding methods usually
Numerosity reduction, where the data are replaced or achieve dimensionality reduction by providing a smaller
estimated by alternative, smaller data representations number of informative features for subsequent data
such as parametric models (which need store only the modeling. These two classes of preprocessing tasks are
model parameters instead of the actual data) or only illustrative examples of a large spectrum of
nonparametric methods such as clustering, sampling, preprocessing activities in a data-mining process.
and the use of histograms. Data-preprocessing steps should not be considered
Discretization and concept hierarchy generation, completely independent from other data- mining phases. In
where raw data values for attributes are replaced by every iteration of the data-mining process, all activities,
ranges or higher conceptual levels. Data discretization together, could define new and improved data sets for
is a form of numerosity reduction that is very useful subsequent iterations.
for the automatic generation of concept hierarchies. Generally, a good preprocessing method provides an
Discretization and concept hierarchy generation are optimal representation for a data-mining
powerful tools for datamining, in that they allow the technique by incorporating a priori knowledge in the form
mining of data at multiple levels of abstraction. of application-specific scaling and encoding.
3
Introduction of Data Mart
3.1 Data Mart The Key Difference: ETL Process
A data mart is a simple form of a data warehouse that is
The main difference between these two types of data marts
focused on a single subject (or functional area), such as
is how the data is collected and prepared. This process is
sales, finance or marketing. Data marts are often built and
called ETL (Extraction, Transformation, and Loading):
controlled by a single department within an organization.
• In a dependent data mart, the data is already
prepared in the central warehouse, so the ETL process
Given their single-subject focus, data marts usually draw
mostly involves selecting and copying the relevant
data from only a few sources. The sources could be
data.
internal operational systems, a central data warehouse, or
external data. • In an independent data mart, all steps of ETL
(extracting data, cleaning it, and formatting it) must be
3.2 Types of Data Marts done separately, similar to how data is handled in a
designed to store and analyze data for a specific Why Choose One Over the Other?
department or business function. There are two main types • Dependent data marts are preferred when a business
1. Dependent Data Marts make data more accessible, improve efficiency, and
A dependent data mart gets its data from a central data reduce costs.
warehouse that already exists. This means that the data is • Independent data marts are used when a company
already collected, cleaned, and organized in the warehouse needs a quick solution without building a full
before it is sent to the data mart. warehouse, even though it requires more effort to
The process of moving data into a dependent data mart is manage data.
simpler because the data is already formatted and
summarized. 3.3 Steps in Implementing a Data Mart
These data marts are typically used to improve Simply stated, the major steps in implementing a data
performance, ensure data consistency, and reduce costs by mart are to design the schema, construct the physical
making relevant data easily accessible for a specific storage, populate the data mart with data from source
department. systems, access it to make informed decisions, and
2. Independent Data Marts manage it over time.
An independent data mart is a standalone system that 1. Designing
collects data directly from operational sources or external The design step is first in the data mart process. This step
sources, without relying on a central data warehouse. covers all of the tasks from initiating the request for a data
Since there is no data warehouse, the data must be mart through gathering information about requirements,
collected, cleaned, and formatted from scratch, making the and developing the logical and physical design of the data
process more complex. mart. The design step involves the following tasks:
Independent data marts are often created when a quick • Gathering the business and technical requirements
solution is needed and there is no time or resources to • Identifying data sources
build a full data warehouse first. • Selecting the appropriate subset of data
Introduction of Data Mart | 3.2
The accessing step involves putting the data to use: needs of a particular business unit or department.
querying the data, analyzing it, creating reports, charts, • Faster Query Performance – Since data marts
and graphs, and publishing these. Typically, the end user contain smaller datasets than a full data warehouse,
uses a graphical front-end tool to submit queries to the they can process queries more quickly.
database and display the results of the queries. The • User-Friendly – Built with end-users in mind, making
accessing step requires that you perform the following it easier for non-technical users to access and analyze
tasks: relevant data.
• Set up an intermediate layer for the front-end tool • Quick Implementation – Data marts can be set up
to use. This layer, the meta layer, translates faster compared to large-scale data warehouses.
database structures and object names into • Better Data Quality – By focusing on a specific
business terms, so that the end user can interact subject area, data marts help maintain high-quality,
with the data mart using terms that relate to the well-organized data.
business function. • Independent Operation – Each data mart can
• Maintain and manage these business interfaces. function separately, giving departments more control
• Set up and manage database structures, like over their data and analysis.
summarized tables, that help queries submitted
through the front-end tool execute quickly and
efficiently. Disadvantages of Data Marts
Introduction of Data Mart| 3.3
• Data Silos – If not properly integrated, different data • Manage benefits and compensation plans
marts may lead to isolated data that is not easily effectively.
shared across the organization. 4. Supply Chain and Operations
• Inconsistency Issues – Independently developed data • Optimize inventory management.
marts may use different data definitions and metrics, • Improve supply chain efficiency and reduce
causing discrepancies. operational costs.
• Limited Scope – While focusing on a specific business
• Monitor production and distribution processes.
area is useful, it can also limit access to broader
5. Customer Relationship Management (CRM)
organizational insights.
• Segment and profile customers for targeted
• Data Redundancy – Storing similar data in multiple
marketing.
data marts can increase storage costs and
• Track customer interactions and feedback.
inefficiencies.
• Analyze customer lifetime value and churn rates.
• Integration Complexity – Connecting multiple data
6. Healthcare Analytics
marts or linking them to a central data warehouse can
• Analyze patient outcomes and conduct medical
be challenging.
research.
• Duplicate Data – Some data marts may store
• Optimize healthcare costs and resource allocation.
information that already exists in the central data
warehouse, leading to unnecessary duplication. • Monitor and improve patient care quality.
7. Retail Analytics
4
Association Rule Mining
•
4.1 Association Rule Mining
Association rule mining is a technique used in data
Rule: {Bread, Butter} → {Milk}
analysis to find relationships between items in large
databases. It helps businesses discover patterns, such as Meaning: If a customer buys bread and butter, they are
which products are often bought together. likely to buy milk.
It helps in making better decisions, such as: When analyzing relationships between items in a dataset,
we use certain key measures to determine how strong and
• Recommending products to customers (e.g., useful these relationships are. Here are four important
Amazon, Flipkart). concepts in association rule mining:
• Detecting fraud in financial transactions. Support measures how often an item or itemset appears in
the dataset. It is calculated as:
How Does it Work?
Support=Number of transactions containing the itemset/T
1. Understanding Transactions and Items:
otal number of transactions
Example: Supermarket Transactions Confidence measures how often the rule X → Y is correct.
It is calculated as:
Transaction ID Milk Bread Butter Beer
Confidence(X→Y)=Support of (X and Y)/Support of X
1 1 1 0 0
Lift shows how much stronger the relationship between X buys(X,―laptop computer‖))=>buys(X,―HPprinter‖) (2)
and Y is compared to what we would expect by chance. It
In rule (1) and (2), the items bought are referenced at
is calculated as:
different levels ofabstraction (e.g.,―computer‖ is a higher-
Lift(X→Y)=Confidence(X→Y)/Support(Y) level abstraction of ―laptop computer‖).
Example: If the lift is greater than 1, it means X and Y are 3.Based on number of data dimensions involved in rule
positively correlated (they occur together more often than
If the items or attributes in an association rule reference
random chance). If the lift is equal to 1, there is no
only one dimension, then it is a single-dimensional
correlation between X and Y. If the lift is less than 1, it
association rule.
means X and Y are negatively correlated (they rarely
appear together). buys(X, ―computer‖))=>buys(X, ―antivirus software‖)
1.Based on the completeness of patterns to be mined 5. Based on the kinds of rules to be mined
We can mine the complete set of frequent itemsets, the Frequent pattern analysis can generate various kinds of
closed frequent itemsets, and the maximal frequent rules and other interesting relationships. Association rule
itemsets, given a minimum support threshold. mining can generate a large number of rules, many of
which are redundant or do not indicate a correlation
We can also mine constrained frequent itemsets,
relationship among itemsets.
approximate frequent itemsets,near- match frequent
itemsets, top-k frequent itemsets and so on. The discovered associations can be further analyzed to
uncover statistical correlations, leading to correlation
2.Based on levels of abstraction involved in the rule set
rules.
For example, with sequential pattern mining, we can study 4. Confidence: The probability of item B being purchased
the order in which items are frequently purchased. For given that item A is purchased.
instance, customers may tend to first buy a PC, followed
Confidence(A⇒B)=Support(A∪B)/Support(A)
by a digital camera, and then a memory card.
The Apriori Algorithm is a data mining technique used to • Remove items that do not meet the minimum
discover frequent itemsets and generate association rules support threshold.
from large databases. It helps businesses and researchers
2. Find frequent 2-itemsets (L2)
analyze patterns in transactional data, such as market
basket analysis. • Combine frequent 1-itemsets to form candidate 2-
itemsets.
Why is Apriori Important?
• Count the occurrences of each candidate itemset.
• Helps in market basket analysis (e.g., finding • Remove those that do not meet the minimum
which products are often bought together). support.
• Used in recommendation systems (e.g., Amazon’s 3. Find frequent 3-itemsets (L3), and so on...
"Customers who bought this also bought that").
• Use frequent (k-1)-itemsets to generate candidate
about promotions and inventory management. • Count occurrences and apply the minimum
support filter.
Key Concepts
4. Stop when no more frequent itemsets can be found.
1. Itemset: A set of items in a transaction (e.g., {Milk,
Bread, Butter}). 5. Generate association rules
2. Frequent Itemset: An itemset that appears in at least o For each frequent itemset, generate rules
support).
Algorithm (Step-by-Step)
o Scan the database. All items meet the Min_Support (assume Min_Support =
50%).
o Count occurrences of each item.
Step 2: Find Frequent 2-Itemsets
o Remove items below Min_Support.
Itemset Count Support (%)
2. Generate candidate k-itemsets (Ck) using frequent (k-
1)-itemsets (Lk-1). {Milk, Bread} 3 60%
5. Filter out itemsets that do not meet Min_Support. Step 3: Find Frequent 3-Itemsets
6. Repeat steps 2-5 until no new frequent itemsets are Itemset Count Support (%)
found.
{Milk, Bread, Butter} 3 60%
• Requires multiple database scans, making it slow 1. Data Compression: First, FP-Growth compresses the
for large datasets. dataset into a smaller structure called the Frequent
• Can generate many candidate itemsets, leading to Pattern Tree (FP-Tree). This tree stores information
high memory usage. about itemsets (collections of items) and their
frequencies, without needing to generate candidate
Real-Life Applications of Apriori
sets like Apriori does.
• Retail & E-commerce: Identifying products often 2. Mining the Tree: The algorithm then examines this
bought together. tree to identify patterns that appear frequently, based
• Healthcare: Analyzing patient symptoms and on a minimum support threshold. It does this by
disease correlations. breaking the tree down into smaller “conditional” trees
• Banking: Detecting fraudulent transactions. for each item, making the process more efficient.
• Online Recommendations: Amazon, Netflix, and
YouTube use association rules to recommend 3. Generating Patterns: Once the tree is built and
FP-Growth addresses inefficiencies of apriori algorithm by Lets understand this with the help of a real life analogy:
using a more efficient approach to mine frequent itemsets,
eliminating the need for multiple database scans and Imagine you’re organizing a large family reunion, and you
speeding up the overall process. want to know which food items are most popular among
the guests. Instead of asking everyone individually and
Disscuing the Drawbacks of Apriori Algorithm writing down their answers one by one, you decide to use a
more efficient method.
The two primary drawbacks of the Apriori Algorithm are:
Step 3: Look for Hidden Patterns elements are stored in descending order of their respective
frequencies. After insertion of the relevant items, the set L
Next, instead of going back to every person to ask again
looks like this:-
about their preferences, you simply look at your list of
items and patterns. You notice that people who brought L = {K : 5, E : 4, M : 3, O : 4, Y : 3}
pizza also often brought pasta, and those who brought
Now, for each transaction, the respective Ordered-Item
cake also brought pasta. These hidden relationships (e.g.,
set is built. It is done by iterating the Frequent Pattern set
pizza + pasta, cake + pasta) are like the “frequent patterns”
and checking if the current item is contained in the
you find in FP-Growth.
transaction in question. If the current item is contained,
Step 4: Simplify the Process the item is inserted in the Ordered-Item set for the current
transaction. The following table is built for all the
With FP-Growth, instead of scanning the entire party list
transactions:
multiple times to look for combinations of items, you’ve
condensed all the information into a smaller, more Transaction ID Items Ordered-Item-Set
manageable tree structure. You can now quickly see the T1 {E,K,M,N,O,Y} {K,E,M,O,Y}
T2 {D,E,K,N,O,Y} {K,E,O,Y}
most common combinations, like “Pizza and pasta” or T3 {A,E,K,M} {K,E,M}
“Cake and pasta,” without the need to revisit every single T4 {C,K,M,U,Y} {A,E,K,M}
T5 {C,E,I,K,O,O} {A,E,K,M}
detail. Now, all the Ordered-Item sets are inserted into a Trie Data
Structure.
Working of FP- Growth Algorithm
Item Frequency
A 1
C 2
D 1 b) Inserting the set {K, E, O, Y}:
E 4
I 1
K 5 Till the insertion of the elements K and E, simply the
M 3 support count is increased by 1. On inserting O we can see
N 2
that there is no direct link between E and O, therefore a
O 4
U 1 new node for the item O is initialized with the support
Y 3 count as 1 and item E is linked to this new node. On
Let the minimum support be 3. A Frequent Pattern set is
inserting Y, we first initialize a new node for the item Y
built which will contain all the elements whose frequency
is greater than or equal to the minimum support. These
Association Rule Mining| 4.7
with support count as 1 and link the new node of O with e) Inserting the set {K, E, O}:
the new node of Y.
Here simply the support counts of the respective elements
are increased. Note that the support count of the new node
of item O is increased.
d) Inserting the set {K, M, Y}: the support counts of all the paths in the Conditional
Pattern Base.
Similar to step b), first the support count of K is increased,
then new nodes for M and Y are initialized and linked
accordingly.
For each row, two types of association rules can be inferred 3. Group-based Support(Using item or group based
for example for the first row which contains the element, support)
the rules K -> Y and Y -> K can be inferred. To determine
the valid rule, the confidence of both the rules is calculated Let’s discuss one by one.
Association rules created from mining information at basic in that clients are needed to determine just a single
different degrees of reflection are called various level or least help threshold. An advancement technique can be
staggered association rules. Multilevel association rules adopted, based on the information that a progenitor is a
can be mined effectively utilizing idea progressions under a superset of its descendant. the search keeps away from
help certainty system. Rules at a high idea level may add analyzing item sets containing anything that doesn’t have
to good judgment while rules at a low idea level may not be minimum support. The uniform support approach
5
Clustering
5.1 Clustering 5.4 Requirements for Clustering Algorithms
A cluster is a group of similar objects. When we analyze 1. Scalability:
data, we divide it into different groups based on The algorithm should work well for large datasets (millions
similarities. Each group is called a cluster. of data points). If we use only a small sample, results may
Clustering is a technique used to group similar data be biased.
together without any predefined labels. It is called 2. Handling Different Data Types:
unsupervised learning because we don’t have labeled Some clustering algorithms work best with numbers, but
categories in advance. Instead, we find patterns based on data can also be text, categories, or images.
similarities in the data. 3. Identifying Different Shapes of Clusters:
A good clustering method aims for: Some clusters are circular, but real-world clusters can be
• High similarity within a cluster – Objects in the irregular (like the spread of a wildfire).
same group should be very similar. 4. Noisy Data Handling:
• Low similarity between clusters – Different Data often has errors or missing values. Good clustering
groups should have distinct data. algorithms should be robust to such issues.
5. Handling New Data Incrementally:
5.2 Cluster Analysis If new data arrives, the algorithm should update clusters
Cluster analysis, or clustering, is the process of dividing a without starting from scratch.
dataset into meaningful groups. Each group contains 6. High-Dimensional Data Support:
similar objects but is different from other groups. Some datasets have thousands of attributes (like text
Example: Imagine we have a dataset with different types documents with thousands of keywords). Clustering
of vehicles (cars, buses, bicycles, etc.). Since clustering is should work efficiently even in such cases.
unsupervised learning, we don’t assign labels like "Car" or 7. Flexibility with Constraints:
"Bike." Instead, the algorithm groups similar vehicles Sometimes, clustering must follow specific rules (e.g.,
based on shared features. placing ATMs in locations with high customer demand).
Clustering is also used in data segmentation (dividing large 8. Easy to Understand and Use:
datasets into smaller groups). It can also help in outlier The results should be clear and useful for decision-
detection, which finds unusual data points, such as making.
fraudulent credit card transactions. 5.5 Working of Clustering
How Clustering Works:
5.3 Applications of Clustering: • Some algorithms divide data into equal-sized
• Fraud detection (identifying unusual credit card groups (partitioning).
transactions). • Some allow overlapping clusters (objects can
• Customer segmentation (grouping customers based belong to multiple groups).
on purchasing behavior). • Clustering uses different distance measures (like
• Medical diagnosis (grouping patients with similar Euclidean distance) to find similar data points.
symptoms).
• Image analysis (grouping similar images).
Clustering | 5.2
5.6 Clustering Methods 2. (Re) Assign each object to which object is most similar
based upon mean values.
Clustering is a technique used to group similar data points
together. There are different clustering methods, each with 3. Update Cluster means, i.e., Recalculate the mean of
its own advantages and use cases. Below are the five major each cluster with the updated values.
types of clustering methods explained in detail.
4. Repeat Step 2 until no change occurs.
1. Partitioning Methods
Partitioning methods divide a dataset into k clusters,
where k is a predefined number chosen by the user. Each
data point belongs to exactly one cluster, and the method
aims to optimize the grouping so that similar points stay
together while different ones remain apart.
The algorithm first creates an initial random partition and
then iteratively refines it by shifting data points between
clusters to improve overall grouping.
i. K-Mean (A centroid based Technique):
Flowchart:
The most commonly used partitioning method is K-Means,
which minimizes the distance between points within the
same cluster.
The K means algorithm takes the input parameter K from
the user and partitions the dataset containing N objects
into K clusters so that resulting similarity among the data
objects inside the group (intracluster) is high but the
similarity of data objects with the data objects from outside
the cluster is low (intercluster).
The similarity of the cluster is determined with respect to
the mean value of the cluster. It is a type of square error
algorithm.
At the start randomly k objects from the dataset are Example: Suppose we want to group the visitors to a
chosen in which each of the objects represents a cluster website using just their age as follows:
mean(centre). For the rest of the data objects, they are 16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44,
assigned to the nearest cluster based on their distance 45, 61, 62, 66
from the cluster mean. The new mean of each of the Initial Cluster:
cluster is then calculated with the added data objects. K=2
Algorithm: Centroid(C1) = 16 [16]
Input: Centroid(C2) = 22 [22]
K: The number of clusters in which the dataset has to be Note: These two points are chosen randomly from the
divided dataset.
D: A dataset containing N number of objects Iteration-1:
Output: C1 = 16.33 [16, 16, 17]
A dataset of K clusters C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44,
Method: 45, 61, 62, 66]
1. Randomly assign K objects from the dataset(D) as Iteration-2:
cluster centres(C) C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Clustering | 5.3