Unit - II DW
Unit - II DW
Unit - II
Data Mining Vs Data Warehousing
Data warehouse refers to the process of compiling and organizing data into one common database, whereas data
mining refers to the process of extracting useful data from the databases. The data mining process depends on the data
compiled in the data warehousing phase to recognize meaningful patterns. A data warehousing is created to support
management systems.
Data Warehouse:
A Data Warehouse refers to a place where data can be stored for useful mining. It is like a quick computer system with
exceptionally huge data storage capacity. Data from the various organization's systems are copied to the Warehouse,
where it can be fetched and conformed to delete errors. Here, advanced requests can be made against the warehouse
storage of data.
Data warehouse combines data from numerous sources which ensure the data quality, accuracy, and consistency. Data
warehouse boosts system execution by separating analytics processing from transnational databases. Data flows into a
data warehouse from different databases. A data warehouse works by sorting out data into a pattern that depicts the
format and types of data. Query tools examine the data tables using patterns.
Data warehouses and databases both are relative data systems, but both are made to serve different purposes. A data
warehouse is built to store a huge amount of historical data and empowers fast requests over all the data, typically
using Online Analytical Processing (OLAP). A database is made to store current transactions and allow quick access
to specific transactions for ongoing business processes, commonly known as Online Transaction Processing (OLTP).
Data Mining:
Data mining refers to the analysis of data. It is the computer-supported process of analyzing huge sets of data that have
either been compiled by computer systems or have been downloaded into the computer. In the data mining process, the
computer analyzes the data and extract useful information from it. It looks for hidden patterns within the data set and
try to predict future behavior. Data mining is primarily used to discover and indicate relationships among the data sets.
Page 1 of 20
Data Warehouse and Data Mining
Data mining aims to enable business organizations to view business behaviors, trends relationships that allow the
business to make data-driven decisions. It is also known as knowledge Discover in Database (KDD). Data mining tools
utilize AI, statistics, databases, and machine learning systems to discover the relationship between the data. Data mining
tools can support business-related questions that traditionally time-consuming to resolve any issue.
Important features of Data Mining:
The important features of Data Mining are given below:
It utilizes the Automated discovery of patterns.
It predicts the expected results.
It focuses on large data sets and databases
It creates actionable information.
Page 2 of 20
Data Warehouse and Data Mining
Classification based on the mined Databases
Classification based on the type of mined knowledge
Classification based on statistics
Classification based on Machine Learning
Classification based on visualization
Classification based on Information Science
Classification based on utilized techniques
Classification based on adapted applications
Page 3 of 20
Data Warehouse and Data Mining
Types of Data Mining
Each of the following data mining techniques serves several different business problems and provides a different insight
into each of them. However, understanding the type of business problem you need to solve will also help in knowing
which technique will be best to use, which will yield the best results. The Data Mining types can be divided into two
basic parts that are as follows:
1. Predictive Data Mining Analysis
2. Descriptive Data Mining Analysis
Page 4 of 20
Data Warehouse and Data Mining
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and other documents.
You need a huge amount of historical data for data mining to be successful. Organizations typically store data in
databases or data warehouses. Data warehouses may comprise one or more databases, text files spreadsheets, or other
repositories of data. Sometimes, even plain text files or spreadsheets may contain information. Another primary source
of data is the World Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned, integrated, and selected. As
the information comes from various sources and in different formats, it can't be used directly for the data mining
procedure because the data may not be complete and accurate. So, the first data requires to be cleaned and unified. More
information than needed will be collected from various data sources, and only the data of interest will have to be selected
and passed to the server. These procedures are not as easy as we think. Several methods may be performed on the data
as part of selection, integration, and cleaning.
Database or Data Warehouse Server:
The database or data warehouse server consists of the original data that is ready to be processed. Hence, the server is
cause for retrieving the relevant data that is based on data mining as per user request.
Data Mining Engine:
The data mining engine is a major component of any data mining system. It contains several modules for operating data
mining tasks, including association, characterization, classification, clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It comprises instruments and software
used to obtain insights and knowledge from data collected from various data sources and stored within the data
warehouse.
Pattern Evaluation Module:
The Pattern evaluation module is primarily responsible for the measure of investigation of the pattern by using a
threshold value. It collaborates with the data mining engine to focus the search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining modules to focus the search
towards fascinating patterns. It might utilize a stake threshold to filter out discovered patterns. On the other hand, the
pattern evaluation module might be coordinated with the mining module, depending on the implementation of the data
mining techniques used. For efficient data mining, it is abnormally suggested to push the evaluation of pattern stake as
much as possible into the mining procedure to confine the search to only fascinating patterns.
Graphical User Interface:
The graphical user interface (GUI) module communicates between the data mining system and the user. This module
helps the user to easily and efficiently use the system without knowing the complexity of the process. This module
cooperates with the data mining system when the user specifies a query or a task and displays the results.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the search or evaluate
the stake of the result patterns. The knowledge base may even contain user views and data from user experiences that
might be helpful in the data mining process. The data mining engine may receive inputs from the knowledge base to
make the result more accurate and reliable. The pattern assessment module regularly interacts with the knowledge base
to get inputs, and also update it.
Page 5 of 20
Data Warehouse and Data Mining
Challenges of Data Mining
Data mining, the process of extracting knowledge from data, has become increasingly important as the amount of data
generated by individuals, organizations, and machines has grown exponentially. However, data mining is not without
its challenges. In this article, we will explore some of the main challenges of data mining.
1]Data Quality
The quality of data used in data mining is one of the most significant challenges. The accuracy, completeness, and
consistency of the data affect the accuracy of the results obtained. The data may contain errors, omissions, duplications,
or inconsistencies, which may lead to inaccurate results. Moreover, the data may be incomplete, meaning that some
attributes or values are missing, making it challenging to obtain a complete understanding of the data.
Data quality issues can arise due to a variety of reasons, including data entry errors, data storage issues, data integration
problems, and data transmission errors. To address these challenges, data mining practitioners must apply data cleaning
and data preprocessing techniques to improve the quality of the data. Data cleaning involves detecting and correcting
errors, while data preprocessing involves transforming the data to make it suitable for data mining.
2]Data Complexity
Data complexity refers to the vast amounts of data generated by various sources, such as sensors, social media, and the
internet of things (IoT). The complexity of the data may make it challenging to process, analyze, and understand. In
addition, the data may be in different formats, making it challenging to integrate into a single dataset.
To address this challenge, data mining practitioners use advanced techniques such as clustering, classification, and
association rule mining. These techniques help to identify patterns and relationships in the data, which can then be used
to gain insights and make predictions.
4]Scalability
Data mining algorithms must be scalable to handle large datasets efficiently. As the size of the dataset increases, the
time and computational resources required to perform data mining operations also increase. Moreover, the algorithms
must be able to handle streaming data, which is generated continuously and must be processed in real-time.
To address this challenge, data mining practitioners use distributed computing frameworks such as Hadoop and Spark.
These frameworks distribute the data and processing across multiple nodes, making it possible to process large datasets
quickly and efficiently.
5]Interpretability
Data mining algorithms can produce complex models that are difficult to interpret. This is because the algorithms use a
combination of statistical and mathematical techniques to identify patterns and relationships in the data. Moreover, the
models may not be intuitive, making it challenging to understand how the model arrived at a particular conclusion.
To address this challenge, data mining practitioners use visualization techniques to represent the data and the models
visually. Visualization makes it easier to understand the patterns and relationships in the data and to identify the most
important variables.
Page 6 of 20
Data Warehouse and Data Mining
6]Ethics
Data mining raises ethical concerns related to the collection, use, and dissemination of data. The data may be used to
discriminate against certain groups, violate privacy rights, or perpetuate existing biases. Moreover, data mining
algorithms may not be transparent, making it challenging to detect biases or discrimination.
Interactive mining of knowledge at multiple levels of abstraction. - The data mining process needs to be
interactive because it allows users to focus the search for patterns, providing and refining data mining requests
based on returned results.
Incorporation of background knowledge. - To guide discovery process and to express the discovered patterns,
the background knowledge can be used. Background knowledge may be used to express the discovered patterns
not only in concise terms but at multiple level of abstraction.
Page 7 of 20
Data Warehouse and Data Mining
Data mining query languages and ad hoc data mining. - Data Mining Query language that allows the user
to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for
efficient and flexible data mining.
Presentation and visualization of data mining results. - Once the patterns are discovered it needs to be
expressed in high level languages, visual representations. This representations should
be easily understandable by the users.
Handling noisy or incomplete data. - The data cleaning methods are required that can handle the noise,
incomplete objects while mining the data regularities. If data cleaning methods are not there then the accuracy
of the discovered patterns will be poor.
Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should be interesting
because either they represent common knowledge or lack novelty.
Efficiency and scalability of data mining algorithms. - In order to effectively extract the information from
huge amount of data in databases, data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of databases, wide
distribution of data, and complexity of data mining methods motivate the development of parallel and
distributed data mining algorithms. These algorithms divide the data into partitions which is further processed
parallel. Then the results from the partitions are merged. The incremental algorithms, updates the databases
without having to mine the data again from the scratch.
Page 8 of 20
Data Warehouse and Data Mining
On Line Transaction Processing (OLTP)
On-Line Transaction Processing (OLTP) System is a type of computer system that helps manage transaction-related
tasks. These systems are made to quickly handle transactions and queries (Insert, Delete and update) on the internet.
Almost every industry nowadays uses OLTP systems to keep track of their transactional data. OLTP systems mainly
focus on entering, storing, and retrieving data, which includes daily operations like purchasing, manufacturing, payroll,
accounting, etc. Many users use these systems for short transactions. They support simple database queries, which
makes it easier for users to get quick responses.
Type of queries that an OLTP system can Process
Insert queries
OLTP systems can process insert queries that add new data to the database, such as when a customer purchases a product.
Update queries
OLTP systems can process update queries that modify existing data in the database, such as when a customer changes
their address.
Delete queries
OLTP systems can process delete queries that remove data from the database, such as when a customer cancels an order.
Simple select queries
OLTP systems can process simple select queries that retrieve data from the database, such as when a customer searches
for a product.
Join queries
OLTP systems can process join queries that retrieve data from multiple tables in the database, such as when a customer
wants to see all their orders and the corresponding product details.
Page 9 of 20
Data Warehouse and Data Mining
Who uses OLAP and Why?
OLAP applications are used by a variety of the functions of an organization.
Finance and accounting:
o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
Sales and Marketing
o Sales analysis and forecasting
o Market research analysis
o Promotion analysis
o Customer analysis
o Market and customer segmentation
Production
o Production planning
o Defect analysis
Characteristics of OLAP
In the FASMI characteristics of OLAP methods, the term derived from the first letters of the characteristics are:
Fast
Analysis
Share
Multidimensional
Information
Consider the OLAP operations which are to be performed on multidimensional data. The figure shows data cubes for
sales of a shop. The cube contains the dimensions, location, and time and item, where the location is aggregated with
regard to city values, time is aggregated with respect to quarters, and an item is aggregated with respect to item types.
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on a data cube, by
climbing down concept hierarchies, i.e., dimension reduction. Roll-up is like zooming-out on the data cubes. Figure
shows the result of roll-up operations performed on the dimension location. The hierarchy for the location is defined as
the Order Street, city, province, or state, country. The roll-up operation aggregates the data by ascending the location
hierarchy from the level of the city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed from the cube. For example,
consider a sales data cube having two dimensions, location and time. Roll-up may be performed by removing, the time
dimensions, appearing in an aggregation of the total sales by location, relatively than by location and by time.
Example
Consider the following cubes illustrating temperature of certain days recorded weekly:
Temperature 64 65 68 69 70 71 72 75 80 81 83 85
Week1 1 0 1 0 1 0 0 0 0 0 1 0
Week2 0 0 0 1 0 0 1 2 0 1 0 0
Consider that we want to set up levels (hot (80-85), mild (70-75), cool (64-69)) in temperature from the above cubes.
Page 10 of 20
Data Warehouse and Data Mining
To do this, we have to group column and add up the value according to the concept hierarchies. This operation is known
as a roll-up.
By doing this, we contain the following cube:
Temperature cool mild hot
Week1 2 1 1
Week2 2 1 1
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down is like zooming-in on
the data cube. It navigates from less detailed record to more detailed data. Drill-down can be performed by
either stepping down a concept hierarchy for a dimension or adding additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a concept hierarchy which is
defined as day, month, quarter, and year. Drill-down appears by descending the time hierarchy from the level of the
quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a new dimension to a cube.
For example, a drill-down on the central cubes of the figure can occur by introducing an additional dimension, such as
a customer group.
Page 11 of 20
Data Warehouse and Data Mining
Example
Drill-down adds more details to the given data
Day 1 0 0 0
Day 2 0 0 0
Day 3 0 0 1
Day 4 0 1 0
Day 5 1 0 0
Day 6 0 0 0
Day 7 1 0 0
Day 8 0 0 0
Day 9 1 0 0
Day 10 0 1 0
Day 11 0 1 0
Day 12 0 1 0
Day 13 0 0 1
Day 14 0 0 0
Page 12 of 20
Data Warehouse and Data Mining
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of the dimension. For example,
a slice operation is executed when the customer wants a selection on one dimension of a three-dimensional cube
resulting in a two-dimensional site. So, the Slice operations perform a selection on one dimension of the given cube,
thus resulting in a subcube.
For example, if we make the selection, temperature=cool we will obtain the following cube:
Temperature cool
Day 1 0
Day 2 0
Day 3 0
Day 4 0
Day 5 1
Page 13 of 20
Data Warehouse and Data Mining
Day 6 1
Day 7 1
Day 8 1
Day 9 1
Day 11 0
Day 12 0
Day 13 0
Day 14 0
Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
It will form a new sub-cubes by selecting one or more dimensions.
Page 14 of 20
Data Warehouse and Data Mining
Dice
The dice operation describes a subcube by operating a selection on two or more dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature = cool OR temperature = hot)
to the original cubes we get the following subcube (still two-dimensional)
Day 3 0 1
Day 4 0 0
The dice operation on the cubes based on the following selection criteria involves three dimensions.
o (location = "Toronto" or "Vancouver")
o (time = "Q1" or "Q2")
o (item =" Mobile" or "Modem")
Page 15 of 20
Data Warehouse and Data Mining
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the data axes in view to
provide an alternative presentation of the data. It may contain swapping the rows and columns or moving one of the
row-dimensions into the column dimensions.
Page 16 of 20
Data Warehouse and Data Mining
Types of OLAP
There are three main types of OLAP servers are as following:
Page 17 of 20
Data Warehouse and Data Mining
MOLAP structure primarily reads the precompiled data. MOLAP structure has limited capabilities to dynamically create
aggregations or to evaluate results which have not been pre-calculated and stored.
Hybrid OLAP (HOLAP) Server
HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture. HOLAP systems save more
substantial quantities of detailed data in the relational tables while the aggregations are stored in the pre-calculated
cubes. HOLAP also can drill through from the cube down to the relational tables for delineated data. The Microsoft
SQL Server 2000 provides a hybrid OLAP server.
Page 18 of 20
Data Warehouse and Data Mining
Page 19 of 20
Data Warehouse and Data Mining
Users: OLTP systems are designed for office worker while the OLAP systems are designed for decision-makers.
Therefore while an OLTP method may be accessed by hundreds or even thousands of clients in a huge enterprise, an
OLAP system is suitable to be accessed only by a select class of manager and may be used only by dozens of users.
2) Functions: OLTP systems are mission-critical. They provide day-to-day operations of an enterprise and are largely
performance and availability driven. These operations carry out simple repetitive operations. OLAP systems are
management-critical to support the decision of enterprise support tasks using detailed investigation.
3) Nature: Although SQL queries return a set of data, OLTP methods are designed to step one record at the time, for
example, a data related to the user who may be on the phone or in the store. OLAP system is not designed to deal with
individual customer records. Instead, they include queries that deal with many data at a time and provide summary or
aggregate information to a manager. OLAP applications include data stored in a data warehouses that have been
extracted from many tables and possibly from more than one enterprise database.
4) Design: OLTP database operations are designed to be application-oriented while OLAP operations are designed to
be subject-oriented. OLTP systems view the enterprise record as a collection of tables (possibly based on an entity-
relationship model). OLAP operations view enterprise information as multidimensional).
5) Data: OLTP systems usually deal only with the current status of data. For example, a record about an employee who
left three years ago may not be feasible on the Human Resources System. The old data may have been achieved on
some type of stable storage media and may not be accessible online. On the other hand, OLAP systems needed historical
data over several years since trends are often essential in decision making.
6) Kind of use: OLTP methods are used for reading and writing operations while OLAP methods usually do not update
the data.
7) View: An OLTP system focuses primarily on the current data within an enterprise or department, which does not
refer to historical data or data in various organizations. In contrast, an OLAP system spans multiple version of a database
schema, due to the evolutionary process of an organization. OLAP system also deals with information that originates
from different organizations, integrating information from many data stores. Because of their huge volume, these are
stored on multiple storage media.
8) Access Patterns: The access pattern of an OLTP system consist primarily of short, atomic transactions. Such a system
needed concurrency control and recovery techniques. However, access to OLAP systems is mostly read-only operations
because these data warehouses store historical information.
The biggest difference between an OLTP and OLAP system is the amount of data analyzed in a single transaction.
Whereas an OLTP handles many concurrent customers and queries touching only a single data or limited collection of
records at a time, an OLAP system must have the efficiency to operate on millions of data to answer a single query.
Page 20 of 20