0% found this document useful (0 votes)
30 views32 pages

Module 1 - Data Warehousing & Modeling F.0

Uploaded by

suryakirana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views32 pages

Module 1 - Data Warehousing & Modeling F.0

Uploaded by

suryakirana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

15CS651 DATA MINING & DATA WAREHOUSING Dr.

Suresh Y

B.E in Computer Science & Engineering


Sixth Semester
Professional Elective: Data Mining & Data Warehousing
SYLLABUS

Module – 1:
Data Warehousing & modeling: Basic Concepts: Data Warehousing: A multitier
Architecture, Data warehouse models: Enterprise warehouse, Data mart and virtual
warehouse, Extraction, Transformation and loading, Data Cube: A multidimensional data
model, Stars, Snowflakes and Fact constellations: Schemas for multidimensional Data
models, Dimensions: The role of concept Hierarchies, Measures: Their Categorization and
computation, Typical OLAP Operations.

Module – 2:
Data warehouse implementation& Data mining: Efficient Data Cube computation: An
overview, Indexing OLAP Data: Bitmap index and join index, Efficient processing of OLAP
Queries, OLAP server Architecture ROLAP versus MOLAP Versus HOLAP. : Introduction:
What is data mining, Challenges, Data Mining Tasks, Data: Types of Data, Data Quality,
Data Pre-processing, Measures of Similarity and Dissimilarity.

Module – 3:
Association Analysis: Association Analysis: Problem Definition, Frequent Item set
Generation, Rule generation. Alternative Methods for Generating Frequent Item sets, FP-
Growth Algorithm, Evaluation of Association Patterns.

Module – 4:
Classification: Decision Trees Induction, Method for Comparing Classifiers, Rule Based
Classifiers, Nearest Neighbor Classifiers, Bayesian Classifiers.

Module – 5:
Clustering Analysis: Overview, K-Means, Agglomerative Hierarchical Clustering,
DBSCAN, Cluster Evaluation, Density-Based Clustering, Graph-Based Clustering, Scalable
Clustering Algorithms.

Text Books:
1. Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining,
Pearson, First impression,2014.
2. Jiawei Han, Micheline Kamber, Jian Pei: Data Mining -Concepts and Techniques, 3rd
Edition, Morgan Kaufmann Publisher, 2012.

Reference Books:
1. Sam Anahory, Dennis Murray: Data Warehousing in the Real World, Pearson, Tenth
Impression,2012.
2. Michael.J.Berry, Gordon.S.Linoff: Mastering Data Mining , Wiley Edition, 2nd
edtion,2012.

Dept. of CSE, BITM - Ballari Page 1


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Module – 1: Data Warehousing & Modeling


Basic Concepts:

Essential step in the process of knowledge discovery in databases


Knowledge discovery as a process is depicted in Figure-1 and consists of an iterative sequence of
the following steps:
1. Data Cleaning: to remove noise or irrelevant data.
2. Data Integration: where multiple data sources may be combined.
3. Data Selection: where data relevant to the analysis task are retrieved from the database.
4. Data Transformation: where data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
5. Data Mining: an essential process where intelligent methods are applied in order to
extract data patterns.
6. Pattern Evaluation to identify the truly interesting patterns representing knowledge
based on some interestingness measures.
7. Knowledge Presentation: where visualization and knowledge representation techniques
are used to present the mined knowledge to the user.

Figure 1: Data mining as a step in the process of knowledge discovery

Dept. of CSE, BITM - Ballari Page 2


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Figure 2: shows the typical framework for construction & use of a data warehouse for
AllElectronics.

ETL

An ODS or a data warehouse is based on a single global schema that integrates and
consolidates enterprise information from many sources. Building such a system requires data
acquisition from OLTP and legacy systems. The ETL process involves extracting,
transforming and loading data from source systems. The process may sound very simple since
it only involves reading information from source databases, transforming it to fit the ODS
database model and loading it in the ODS.

As different data sources tend to have different conventions for coding information and
different standards for the quality of information, building an ODS requires data filtering, data
cleaning, and integration.

The following examples show the importance of data cleaning:

If an enterprise wishes to contact its customers or its suppliers, it is essential that a complete,
accurate and up-to-date list of contact addresses, email addresses and telephone numbers be
available. Correspondence sent to a wrong address that is then redirected does not create a
very good impression about the enterprise.

If a customer or supplier calls, the staff responding should be quickly ale to find the person in
the enterprise database but this requires that the caller‘s name or his/her company name is
accurately listed in the database.

If a customer appears in the databases with two or more slightly different names or different
account numbers, it becomes difficult to update the customer‘s information.

Dept. of CSE, BITM - Ballari Page 3


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

It has been suggested that data cleaning should be based on the following five steps:

1. Parsing: Parsing identifies various components of the source data files and then
establishes relationships between those and the fields in the target files. The classical
example of parsing is identifying the various components of a person‘s name and address.

2. Correcting: Correcting the identified components is usually based on a variety of


sophisticated techniques including mathematical algorithms. Correcting may involve use
of other related information that may be available in the enterprise.

3. Standardizing: Business rules of the enterprise may now be used to transform the data to
standard form. For example, in some companies there might be rules on how name and
address are to be represented.

4. Matching: Much of the data extracted from a number of source systems is likely to be
related. Such data needs to be matched.

5. Consolidating: All corrected, standardized and matched data can now be consolidated to
build a single version of the enterprise data.

GUIDELINES FOR DATA WAREHOUSE: Implementation steps


1. Requirements analysis and capacity planning: In other projects, the first step in data
warehousing involves defining enterprise needs, defining architecture, carrying out capacity
planning and selecting the hardware and software tools. This step will involve consulting
senior management as well as the various stakeholders.

2. Hardware integration: Once the hardware and software have been selected, they need to be
put together by integrating the servers, the storage devices and the client software tools.

3. Modelling: Modelling is a major step that involves designing the warehouse schema and
views. This may involve using a modelling tool if the data warehouse is complex.

4. Physical modelling: For the data warehouse to perform efficiently, physical modelling is
required. This involves designing the physical data warehouse organization, data placement,
data partitioning, deciding on access methods and indexing.

5. Sources: The data for the data warehouse is likely to come from a number of data sources.
This step involves identifying and connecting the sources using gateways, ODBC drives or
other wrappers.

6. ETL: The data from the source systems will need to go through an ETL process. The step of
designing and implementing the ETL process may involve identifying a suitable ETL tool
vendor and purchasing and implementing the tool. This may include customizing the tool to
suit the needs of the enterprise.

Dept. of CSE, BITM - Ballari Page 4


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

7. Populate the data warehouse: Once the ETL tools have been agreed upon, testing the tools
will be required, perhaps using a staging area. Once everything is working satisfactorily, the
ETL tools may be used in populating the warehouse given the schema and view definitions.

8. User applications: For the data warehouse to be useful there must be end-user applications.
This step involves designing and implementing applications required by the end users.

9. Roll-out the warehouse and applications: Once the data warehouse has been populated and
the end-user applications tested, the warehouse system and the applications may be rolled out
for the user community to use.

Implementation Guidelines

1. Build incrementally: Data warehouses must be built incrementally. Generally it is


recommended that a data mart may first be built with one particular project in mind and once
it is implemented a number of other sections of the enterprise may also wish to implement
similar systems. An enterprise data warehouse can then be implemented in an iterative
manner allowing all data marts to extract information from the data warehouse. Data
warehouse modeling itself is an iterative methodology as users become familiar with the
technology and are then able to understand and express their requirements more clearly.

2. Need a champion: A data warehouse project must have a champion who is willing to carry
out considerable research into expected costs and benefits of the project. Data warehousing
projects require inputs from many units in an enterprise and therefore need to be driven by
someone who is capable of interaction with people in the enterprise and can actively persuade
colleagues. Without the cooperation of other units, the data model for the warehouse and the
data required to populate the warehouse may be more complicated than they need to be.
Studies have shown that having a champion can help adoption and success of data
warehousing projects.

3. Senior management support: A data warehouse project must be fully supported by the senior
management. Given the resource intensive nature of such projects and the time they can take
to implement, a warehouse project calls for a sustained commitment from senior
management. This can sometimes be difficult since it may be hard to quantify the benefits of
data warehouse technology and the managers may consider it a cost without any explicit
return on investment. Data warehousing project studies show that top management support is
essential for the success of a data warehousing project.

4. Ensure quality: Only data that has been cleaned and is of a quality that is understood by the
organization should be loaded in the data warehouse. The data quality in the source systems is
not always high and often little effort is made to improve data quality in the source systems.
Improved data quality, when recognized by senior managers and stakeholders, is likely to lead
to improved support for a data warehouse project.

5. Corporate strategy: A data warehouse project must fit with corporate strategy and business
objectives. The objectives of the project must be clearly defined before the start of the project.
Given the importance of senior management support for a data warehousing project, the
fitness of the project with the corporate strategy is essential.

Dept. of CSE, BITM - Ballari Page 5


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

6. Business plan: The financial costs (hardware, software, people-ware), expected benefits and a
project plan (including an ETL plan) for a data warehouse project must be clearly outlined
and understood by all stakeholders. Without such understanding, rumours about expenditure
and benefits can become the only source of information, undermining the project.

7. Training: A data warehouse project must not overlook data warehouse training requirements.
For a data warehouse project to be successful, the users must be trained to use the warehouse
and to understand its capabilities. Training of users and professional development of the
project team may also be required since data warehousing is a complex task and the skills of
the project team are critical to the success of the project.
8. Adaptability: The project should build in adaptability so that changes may be made to the data
warehouse if and when required. Like any system, a data warehouse will need to change, as
needs of an enterprise change. Furthermore, once the data warehouse is operational, new
applications using the data warehouse are almost certain to be proposed. The system should
be able to support such new applications.

9. Joint management: The project must be managed by both IT and business professionals in
the enterprise. To ensure good communication with the stakeholders and that the project is
focused on assisting the enterprise‘s business, business professionals must be involved in the
project along with technical professionals.

Which Technologies Are Used?


As a highly application-driven domain, data mining has incorporated many techniques from
other domains such as statistics, machine learning, pattern recognition, database and data
warehouse systems, information retrieval, visualization, algorithms, high-performance
computing, and many application domain.
The interdisciplinary nature of data mining research and development contributes
significantly to the success of data mining and its extensive applications. In this section, we
give examples of several disciplines that strongly influence the development of data mining
methods.

1. Statistics

Statistics studies the collection, analysis, interpretation or explanation, and presentation of


data. Data mining has an inherent connection with statistics. A statistical model is a set of
mathematical functions that describe the behavior of the objects in a target class in terms of
random variables and their associated probability distributions. Statistical models are widely
used to model data and data classes.

For example, in data mining tasks like data characterization and classification,
statistical models of target classes can be built. In other words, such statistical models can be
the outcome of a data mining task. Alternatively, data mining tasks can be built on top of
statistical models. For example, we can use statistics to model noise and missing data
values. Then, when mining patterns in a large data set, the data mining process can use the
model to help identify and handle noisy or missing values in the data.

Dept. of CSE, BITM - Ballari Page 6


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Figure 3: Domains constituting to Data Mining Process.

Statistical methods can also be used to verify data mining results. For example, after a
classification or prediction model is mined, the model should be verified by statistical
hypothesis testing. A statistical hypothesis test (sometimes called confirmatory data analysis)
makes statistical decisions using experimental data. A result is called statistically significant
if it is unlikely to have occurred by chance. If the classification or prediction model holds
true, then the descriptive statistics of the model increases the soundness of the model.

2. Machine Learning
Machine learning investigates how computers can learn (or improve their performance) based
on data. A main research area is for computer programs to automatically learn to recognize
complex patterns and make intelligent decisions based on data. For example, a typical
machine learning problem is to program a computer so that it can automatically recognize
handwritten postal codes on mail after learning from a set of examples. Machine learning is a
fast-growing discipline.

2a. Supervised learning is basically a synonym for classification. The supervision in the
learning comes from the labeled examples in the training data set. For example, in the postal
code recognition problem, a set of handwritten postal code images and their corresponding
machine-readable translations are used as the training examples, which supervise the learning
of the classification model.
2b. Unsupervised learning is essentially a synonym for clustering. The learning process is
unsupervised since the input examples are not class labeled. Typically, we may use clustering
to discover classes within the data. For example, an unsupervised learning method can take,
as input, a set of images of handwritten digits. Suppose that it finds 10 clusters of data. These
clusters may correspond to the 10 distinct digits of 0 to 9, respectively. However, since the
training data are not labeled, the learned model cannot tell us the semantic meaning of the
clusters found.

Dept. of CSE, BITM - Ballari Page 7


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

2c. Semi-supervised learning is a class of machine learning techniques that make use of
both labeled and unlabeled examples when learning a model. In one approach, labelled
examples are used to learn class models and unlabeled examples are used to refine the
boundaries between classes. For a two-class problem, we can think of the set of examples
belonging to one class as the positive examples and those belonging to the other class as the
negative examples.

2d. Active learning is a machine learning approach that lets users play an active role in the
learning process. An active learning approach can ask a user (e.g., a domain expert) to label
an example, which may be from a set of unlabeled examples or synthesized by the learning
program. The goal is to optimize the model quality by actively acquiring knowledge from
human users, given a constraint on how man examples they can be asked to label.

3. Database Systems and Data Warehouses


Database systems research focuses on the creation, maintenance, and use of databases for
organizations and end-users. Particularly, database systems researchers have established
highly recognized principles in data models, query languages, query processing and
optimization methods, data storage, and indexing and accessing methods. Database systems
are often well known for their high scalability in processing very large, relatively structured
data sets.

Many data mining tasks need to handle large data sets or even real-time, fast streaming data.
Therefore, data mining can make good use of scalable database technologies to achieve high
efficiency and scalability on large data sets. Moreover, data mining tasks can be used to
extend the capability of existing database systems to satisfy advanced users’ sophisticated
data analysis requirements.

Recent database systems have built systematic data analysis capabilities on database data
using data warehousing and data mining facilities. A data warehouse integrates data
originating from multiple sources and various timeframes. It consolidates data in
multidimensional space to form partially materialized data cubes. The data cube model not
only facilitates OLAP in multidimensional databases but also promotes multidimensional
data mining.

The following section describes data mining functionalities, and the kinds of patterns they
can discover
- Or -
Data mining functionalities: characterization, discrimination, association and correlation
analysis, classification, prediction, clustering, and evolution analysis (With examples of each
data mining functionality, using a real-life database)
1. Concept/class description: characterization and discrimination
Data can be associated with classes or concepts. It describes a given set of data in a
concise and summarative manner, presenting interesting general properties of the data.
These descriptions can be derived via
1. data characterization, by summarizing the data of the class under study (often called
the target class)
2. data discrimination, by comparison of the target class with one or a set of comparative
classes
3. both data characterization and discrimination
Dept. of CSE, BITM - Ballari Page 8
15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

a. Data characterization
It is a summarization of the general characteristics or features of a target class of data.
Example: A data mining system should be able to produce a description summarizing the
characteristics of a student who has obtained more than 75% in every semester; the result
could be a general profile of the student.

b. Data Discrimination
Is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes.

The general features of students with high GPA’s may be compared with the general features
of students with low GPA’s. The resulting description could be a general comparative profile
of the students such as 75% of the students with high GPA’s are fourth-year computing
science students while 65% of the students with low GPA’s are not.
The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs. The resulting descriptions can also be presented as generalized relations,
or in rule form called characteristic rules.

2. Mining Frequent Patterns, Association and Correlations


It is the discovery of association rules showing attribute-value conditions that occur
frequently together in a given set of data. For example, a data mining system may find
association rules like
major(X, “computing science””) ⇒ owns(X, “personal computer”)
[support = 12%, confidence = 98%]
where X is a variable representing a student. The rule indicates that of the students under
study, 12% (support) major in computing science and own a personal computer. There is a
98% probability (confidence, or certainty) that a student in this group owns a personal
computer.
Example:
A grocery store retailer to decide whether to but bread on sale. To help determine the impact
of this decision, the retailer generates association rules that show what other products are
frequently purchased with bread. He finds 60% of the times that bread is sold so are pretzels
and that 70% of the time jelly is also sold. Based on these facts, he tries to capitalize on the
association between bread, pretzels, and jelly by placing some pretzels and jelly at the end of
the aisle where the bread is placed. In addition, he decides not to place either of these items
on sale at the same time.
3. Classification and prediction
Classification:
- It predicts categorical class labels
- It classifies data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data

Typical Applications
- credit approval
- target marketing
- medical diagnosis
- treatment effectiveness analysis

Dept. of CSE, BITM - Ballari Page 9


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Classification can be defined as the process of finding a model (or function) that describes
and distinguishes data classes or concepts, for the purpose of being able to use the model to
predict the class of objects whose class label is unknown. The derived model is based on the
analysis of a set of training data (i.e., data objects whose class label is known).
Example:
An airport security screening station is used to deter mine if passengers are potential
terrorist or criminals. To do this, the face of each passenger is scanned and its basic
pattern(distance between eyes, size, and shape of mouth, head etc) is identified. This
pattern is compared to entries in a database to see if it matches any patterns that are
associated with known offenders
A classification model can be represented in various forms, such as
1) IF-THEN rules,
student ( class , "undergraduate") AND concentration ( level, "high") ==> class A
student (class ,"undergraduate") AND concentrtion (level,"low") ==> class B
student (class , "post graduate") ==> class C
2) Decision tree

Prediction:
Find some missing or unavailable data values rather than class labels referred to as
prediction. Although prediction may refer to both data value prediction and class label
prediction, it is usually confined to data value prediction and thus is distinct from
classification. Prediction also encompasses the identification of distribution trends based on
the available data.
Example:
Predicting flooding is difficult problem. One approach is uses monitors placed at various
points in the river. These monitors collect data relevant to flood prediction: water level, rain
amount, time, humidity etc. These water levels at a potential flooding point in the river can be
predicted based on the data collected by the sensors upriver from this point. The prediction
must be made with respect to the time the data were collected.

Classification vs. Prediction


Classification differs from prediction in that the former is to construct a set of models (or
functions) that describe and distinguish data class or concepts, whereas the latter is to predict
some missing or unavailable, and often numerical, data values. Their similarity is that they
are both tools for prediction: Classification is used for predicting the class label of data
objects and prediction is typically used for predicting missing numerical data values.
4. Clustering analysis
Clustering analyzes data objects without consulting a known class label.
The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity.
Dept. of CSE, BITM - Ballari Page 10
15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

- Each cluster that is formed can be viewed as a class of objects.


Clustering can also facilitate taxonomy formation, that is, the organization of observations
into a hierarchy of classes that group similar events together as shown below:

Example:
A certain national department store chain creates special catalogs targeted to various
demographic groups based on attributes such as income, location and physical characteristics
of potential customers (age, height, weight, etc). To determine the target mailings of the
various catalogs and to assist in the creation of new, more specific catalogs, the company
performs a clustering of potential customers based on the determined attribute values. The
results of the clustering exercise are the used by management to create special catalogs and
distribute them to the correct target population based on the cluster for that catalog.

Classification vs. Clustering


- In general, in classification you have a set of predefined classes and want to know
which class a new object belongs to.
- Clustering tries to group a set of objects and find whether there is some
relationship between the objects.
- In the context of machine learning, classification is supervised learning and
clustering is unsupervised learning.
5. Outlier Analysis
A database may contain data objects that do not comply with general model of data.
These data objects are outliers. In other words, the data objects which do not fall within the
cluster will be called as outlier data objects. Noisy data or exceptional data are also called as
outlier data. The analysis of outlier data is referred to as outlier mining.
Example
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges
incurred by the same account. Outlier values may also be detected with respect to the location
and type of purchase, or the purchase frequency.
Challenges of Data Mining: There are several important challenges in applying
data mining techniques to large data sets:
 Scalability
Scalable techniques are needed to handle the massive size of some of the datasets that are
now being created. As an example, such datasets typically require the use of efficient
methods for storing, indexing, and retrieving data from secondary or even tertiary storage
systems. Furthermore, parallel or distributed computing approaches are often necessary if
the desired data mining task is to be performed in a timely manner. While such techniques
can dramatically increase the size of the datasets that can be handled, they often require
the design of new algorithms and data structures.

Dept. of CSE, BITM - Ballari Page 11


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

 Dimensionality
In some application domains, the number of dimensions (or at-tributes of a record) can be
very large, which makes the data difficult to an-alyze because of the ‘curse of
dimensionality’.
For example, in bioinformatics, the development of advanced microarray technologies
allows us to analyze gene expression data with thousands of attributes. The
dimensionality of a data mining problem may also increase substantially due to the
temporal, spatial, and sequential nature of the data.

 Complex Data
Traditional statistical methods often deal with simple data types such as continuous and
categorical attributes. However, in recent years, more complicated types of structured and
semi-structured data have become more important. One example of such data is graph-
based data representing the linkages of web pages, social networks, or chemical
structures. Another example is the free-form text that is found on most web pages.
Traditional data analysis techniques often need to be modified to handle the complex
nature of such data.
 Data Quality
Many data sets have one or more problems with data quality, e.g.,some values may be
erroneous or inexact, or there may be missing values. As a result, even if a ‘perfect’ data
mining algorithm is used to analyze the data, the information discovered may still be
incorrect. Hence, there is a need for data mining techniques that can perform well when
the data quality is less than perfect.

 Data Ownership and Distribution


For a variety of reasons, e.g., privacy and ownership, some collections of data are
distributed across a number of sites. In many such cases, the data cannot be centralized,
and thus, the choice is either distributed data mining or no data mining. Challenges
involved in developing distributed data mining solutions include the need for efficient
algorithms to cope with the distributed and possibly, heterogeneous data sets, the need to
minimize the cost of communication, and the need to accommodate data security and data
ownership policies.

Data warehousing:
A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management’s decision making process

The four keywords—subject-oriented, integrated, time-variant, and non-volatile—distinguish


data warehouses from other data repository systems, such as relational database systems,
transaction processing systems, and file systems.
 Subject-oriented:
 A data warehouse is organized around major subjects such as customer, supplier,
product, and sales.
 Rather than concentrating on the day-to-day operations and transaction processing of
an organization, a data warehouse focuses on the modeling and analysis of data for
decision makers.
 Hence, data warehouses typically provide a simple and concise view of particular
subject issues by excluding data that are not useful in the decision support process.
Dept. of CSE, BITM - Ballari Page 12
15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

 Integrated:
 A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and online transaction records.
 Data cleaning and data integration techniques are applied to ensure consistency in
naming conventions, encoding structures, attribute measures, and so on.
 Time-variant:
 Data are stored to provide info. rom an historic perspective (e.g., the past 5–10 years).
 Every key structure in the data warehouse contains, either implicitly or explicitly, a
time element.
 Non-volatile:
 A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment.
 A data warehouse does not require transaction processing, recovery, and concurrency
control mechanisms.
 It usually requires only two operations in data accessing: initial loading of data and
access of data.

Organizations use the information from data warehouses:


1. increasing customer focus, which includes the analysis of customer buying patterns
(such as buying preference, buying time, budget cycles, and appetites for spending);
2. Repositioning products and managing product portfolios by comparing the
performance of sales by quarter, by year, and by geographic regions in order to fine-
tune production strategies;
3. Analyzing operations and looking for sources of profit; and
4. Managing customer relationships, making environmental corrections, and managing
the cost of corporate assets.

Data Warehouse: A Multi-tier Architecture


Data warehouses often adopt three-tier architecture, as presented in the Figure 4.

Tier -1 : Bottom Layer


 The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom-tier
from operational databases or other external sources (e.g., customer profile information
provided by external consultants).
These tools and utilities perform data extraction, cleaning, and transformation (e.g., to
merge similar data from different sources into a unified format), as well as load and
refresh functions to update the data warehouse.
 The data are extracted using application program interfaces known as gateways. A
gateway is supported by the underlying DBMS and allows client programs to generate
SQL code to be executed at a server.
o Examples of gateways include ODBC (Open Database Connection) and OLEDB
(Object Linking and Embedding Database) by Microsoft and JDBC (Java
Database Connection).
 This tier also contains a metadata repository, which stores information about the data
warehouse and its contents.

Dept. of CSE, BITM - Ballari Page 13


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Figure-4: 3 Tier Multi-dimensional Datawarehouse architecture

Tier – 2: Middle Layer

The middle tier is an OLAP server that is typically implemented using either
i. Relational OLAP(ROLAP) model (i.e., an extended relational DBMS that maps
operations on multidimensional data to standard relational operations); or
ii. A multidimensional OLAP (MOLAP) model (i.e., a special-purpose server that
directly implements multidimensional data and operations).

Tier -3: Top Layer

The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

Dept. of CSE, BITM - Ballari Page 14


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Metadata Repository
 Metadata are data about data.
 Metadata are the data that define warehouse objects.
 Are created for the data names and definitions of the given warehouse.
 Additional metadata are created and captured for timestamping any extracted data, the
source of the extracted data, and missing fields that have been added by data cleaning or
integration processes.
 A metadata repository should contain the following:
i. Description of the data warehouse structure, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data
mart locations and contents.
ii. Operational metadata, which include data lineage (history of migrated data & the
sequence of transformations applied to it), currency of data (active, archived, or
purged), and monitoring information (warehouse usage statistics, error reports, &
audit trails).
iii. Algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas, aggregation,
summarization, and predefined queries and reports.
iv. Mapping from the operational environment to the data warehouse, which includes
source databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging rules,
and security (user authorization and access control).
v. Data related to system performance, which include indices and profiles that
improve data access and retrieval performance, in addition to rules for the timing &
scheduling of refresh, update, and replication cycles.
vi. Business metadata, which include business terms and definitions, data ownership
information, and charging policies.

Data Warehouse Models:


From the architecture point of view, there are three data warehouse models: the enterprise
warehouse, the data mart, and the virtual warehouse.
1. Enterprise warehouse:
i. An enterprise warehouse collects all of the information about subjects spanning the
entire organization.
ii. It provides corporate-wide data integration, usually from one or more operational
systems or external information providers, and is cross-functional in scope.
iii. It typically contains detailed data as well as summarized data, and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond. An enterprise
data warehouse may be implemented on traditional mainframes, computer super-
servers, or parallel architecture platforms.
iv. It requires extensive business modeling and may take years to design and build.
2. Data mart:
i. A data mart contains a subset of corporate-wide data that is of value to a specific
group of users. The scope is confined to specific selected subjects.
a. For example, a marketing data mart may confine its subjects to customer,
item, and sales.

Dept. of CSE, BITM - Ballari Page 15


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

ii. The data contained in data marts tend to be summarized. Data marts are usually
implemented on low-cost departmental servers that are Unix / Linux or Windows
based.
iii. The implementation cycle of a data mart is more likely to be measured in weeks
rather than months or years. However, it may involve complex integration in the long
run if its design and planning were not enterprise-wide.
iv. Depending on the source of data, data marts can be categorized as independent or
dependent.
a. Independent data marts are sourced from data captured from one or more
operational systems or external information providers, or from data generated
locally within a particular department or geographic area.
b. Dependent data marts are sourced directly from enterprise data warehouses.
3. Virtual warehouse:
i. A virtual warehouse is a set of views over operational databases.
ii. For efficient query processing, only some of the possible summary views may be
materialized.
iii. A virtual warehouse is easy to build but requires excess capacity on operational
database servers.

Top-down development of an enterprise warehouse serves as a systematic solution and


minimizes integration problems. However, it is expensive, takes a long time to develop, and
lacks flexibility due to the difficulty in achieving consistency and consensus for a common
data model for the entire organization.
Bottom-up approach to the design, development, and deployment of independent data marts
provides flexibility, low cost, and rapid return of investment. It, however, can lead to
problems when integrating various disparate data marts into a consistent enterprise data
warehouse.
A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner, as shown in figure-5.

Figure 5: Recommended approach for data warehouse development.

Dept. of CSE, BITM - Ballari Page 16


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

First, a high-level corporate data model is defined within a reasonably short period (such
as one or two months) that provides a corporate-wide, consistent, integrated view of data
among different subjects and potential usages. This high-level model, although it will need to
be refined in the further development of enterprise data warehouses and departmental data
marts, will greatly reduce future integration problems.
Second, independent data marts can be implemented in parallel with the enterprise
warehouse based on the same corporate data model set noted before.
Third, distributed data marts can be constructed to integrate different data marts via hub
servers.
Finally, a multitier data warehouse is constructed where the enterprise warehouse is the
sole custodian of all warehouse data, which is then distributed to the various dependent data
marts.
Extraction, Transformation, and Loading (ETL)
• The ETL process involves extracting, transforming and loading data from multiple source
systems.
• The process is much more complex and tedious. The process may require significant
resources to implement.
• Different data-sources tend to have
→ different conventions for coding information &
→ different standards for the quality of information
• Building an ODS requires data filtering, data cleaning and integration.
• Data-errors at least partly arise because of unmotivated data-entry staff.

Successful implementation of an ETL system involves resolving


following issues:
1) What are the source systems? These systems may include relational database systems,
legacy systems.
2) To what extent are the source systems and the target system interoperable? The more
different the sources and target, the more complex the ETL process.
3) What ETL technology will be used?
4) How big is the ODS likely to be at the beginning and in the long term? Database systems
tend to grow with time. Consideration may have to be given to whether some of the data
→ from the ODS will be archived regularly as the data becomes old and
→ is no longer needed in the ODS
5) How frequently will the ODS be refreshed or reloaded?
6) How will the quality and integrity of the data be monitored? Data cleaning will often
required to deal with issues like missing values, data formats, code values, primary keys
and referential integrity.
7) How will a log be maintained? A dispute may arise about the origin of some data. It is
therefore necessary to be able to not only log which information came from where but also
when the information was last updated.
8) How will recovery take place?
9) Would the extraction process only copy data from the source systems and not delete the
original data?
10) How will the transformation be carried out? Transformation could be done in either
source OLTP system, ODS or staging area.
11) How will data be copied from non-relational legacy systems that are still operational?

Dept. of CSE, BITM - Ballari Page 17


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

ETL FUNCTIONS
• The ETL process consists of
→ data extraction from source systems
→ data transformation which includes data cleaning and
→ data loading in the ODS or the data warehouse
• Data cleaning deals with detecting & removing errors/inconsistencies from the data, in
particular the data that is sourced from a variety of computer systems.
• Building an integrated database from a number of source-systems may involve solving
some of the following problems:

1. Instance Identity Problem


 The same customer may be represented slightly differently in different source-
systems.
 There is a possibility of mismatching between the different systems that needs to
be identified & corrected.
 Checking for homonyms & synonyms is necessary.
 Achieving very high consistency in names & addresses requires a huge amount of
resources.

2. Data Errors
• Following are different types of data errors
→ data may have some missing attribute values
→ there may be duplicate records
→ there may be wrong aggregations
→ there may be inconsistent use of nulls, spaces and empty values
→ some attribute values may be inconsistent(i.e. outside their domain)
→ there may be non-unique identifiers
3. Record Linkage Problem
 This deals with problem of linking information from different databases that
relates to the same customer.
 Record linkage can involve a large number of record comparisons to ensure
linkages that have a high level of accuracy.

Figure 6: The ETL process.

Dept. of CSE, BITM - Ballari Page 18


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

4. Semantic Integration Problem


This deals with integration of information found in heterogeneous OLTP & legacy
sources.
→ Some of the sources may be relational.
→ Some may be even in text documents
→ Some data may be character strings while others may be integers.
5. Data Integrity Problem
This deals with issues like → referential integrity → null values → domain of values

Data warehouse systems use back-end tools and utilities to populate and refresh their data.
These tools and utilities include the following functions:
1. Data extraction, which typically gathers data from multiple, heterogeneous, and
external sources.
2. Data cleaning, which detects errors in the data and rectifies them when possible.
3. Data transformation, which converts data from legacy or host format to
warehouse format.
4. Load, which sorts, summarizes, consolidates, computes views, checks integrity, &
5. builds indices and partitions.
6. Refresh, which propagates the updates from the data sources to the warehouse.

Sl.
Feature OLTP OLAP
No
Customer oriented. Market oriented.
User &
Used for: query and transactions Used for Data analysis
1 System
by clerks by Knowledge workers
Orientation
(managers, analyst)
Large amount of historic
2 Data base Current data
data.
Data base
3 Adopts ER model Star & Snowflake schema
design
Short atomic transactions. Requires
Access
4 concurrency control. Provides Read only operations
patterns
recovery mechanism
5 Size GB to Higher order ≥ TB
Query throughput,
6 Metric Transaction throughput
response time.
7 Update records Continuous Batch mode
Complex (Summarized,
8 Query & View Simple (Detailed, flat relational)
multi-dimensional)
9 Focus Data-in Information-out
10 No. of Users Thousands Hundreds
11 No. of Records Tens Millions
High performance, High flexibility,
12 Priority
High availability End-user autonomy
13 Operations Index/hash on primary key. Lots of scans.

Dept. of CSE, BITM - Ballari Page 19


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Data Cube: A Multidimensional Data Model


A data cube allows data to be modelled and viewed in multiple dimensions. It is defined by
dimensions and facts.

Dimensions are the perspectives or entities with respect to which an organization wants to
keep records.
For example, AllElectronics may create a sales data warehouse in order to keep
records of the store’s sales with respect to the dimensions time, item, branch, and location.
Each dimension may have a table associated with it, called a dimension table, which further
describes the dimension.
For example, a dimension table for item may contain the attributes item name, brand, and
type. Dimension tables can be specified by users or experts, or automatically generated and
adjusted based on data distributions.
A multidimensional data model is typically organized around a central theme, such as
sales. This theme is represented by a fact table. Facts are numeric measures. The fact table
contains he names of the facts, or measures, as well as keys to each of the related dimension
tables.
Although we usually think of cubes as 3-D geometric structures, in data warehousing
the data cube is n-dimensional. To gain a better understanding of data cubes and the
multidimensional data model, let’s start by looking at a simple 2-D data cube that is, in fact, a
table or spreadsheet for sales data from AllElectronics. In particular, we will look at the
AllElectronics sales data for items sold per quarter in the city of Vancouver. These data are
shown in Table 1. In this 2-D representation, the sales for Vancouver are shown with respect
to the time dimension (organized in quarters) and the item dimension (organized according to
the types of items sold). The fact or measure displayed is dollars sold (in thousands).

Table1: 2D View of Sales Data for AllElectronics according to time and item.

Now, suppose that we would like to view the sales data with a third dimension. For instance,
suppose we would like to view the data according to time and item, as well as location, for
the cities Chicago, New York, Toronto, and Vancouver. These 3-D data are shown in Table
2. The 3-D data in the table are represented as a series of 2-D tables. Conceptually, we may
also represent the same data in the form of a 3-D data cube, as in Figure 2.

Dept. of CSE, BITM - Ballari Page 20


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Table-2: 3D View of Sales Data for AllElectronics according to time, item & locations.

Tables 1 and 2 show the data at different degrees of summarization. In the data
warehousing research literature, a data cube like those shown in Figures 7 and Figure 8 is
often referred to as a cuboid.

Figure 7: 3-D data cube rep.n of the data in Table-2, according to time, item, & location.

Figure 8: 4-D data cube representation of sales data, according to time, item, location, and
supplier.
The cuboid that holds the lowest level of summarization is called the base cuboid. For
example, the 4-D cuboid in Figure-8 is the base cuboid for the given time, item, location, and
supplier dimensions. Figure-7 is a 3-D (non-base) cuboid for time, item, and location,
summarized for all suppliers. The 0-D cuboid, which holds the highest level of
summarization, is called the apex cuboid. In our example, this is the total sales, or dollars
sold, summarized over all four dimensions. The apex cuboid is typically denoted by all.

Dept. of CSE, BITM - Ballari Page 21


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Figure-9: Lattice of cuboids, making up a 4-D data cube for time, item, location, and supplier.
Each cuboid represents a different degree of summarization.

OLAP
• OLAP stands for Online Analysis Processing Systems.
• This is primarily a software-technology concerned with fast analysis of enterprise
information.
• In other words, OLAP is the dynamic enterprise analysis required to create, manipulate,
animate & synthesize information from exegetical, contemplative and formulaic data
analysis models.
• Business-Intelligence(BI) is used to mean both data-warehousing and OLAP.
• In other words, BI is defined as a user-centered process of
→ exploring data, data-relationships and trends
→ thereby helping to improve overall decision-making.

MOTIVATIONS FOR USING OLAP


1) Understanding and Improving Sales
• For an enterprise that has many products and many channels for selling the products,
OLAP can assist in finding the most popular products & the most popular channels.
• In some cases, it may be possible to find the most profitable customers.
• Analysis of business-data can assist in improving the enterprise-business.
2) Understanding and Reducing Costs of doing Business
• OLAP can assist in
→ analyzing the costs associated with sales &
→ controlling the costs as much as possible without affecting sales
• In some cases, it may also be possible to identify expenditures that produce a high ROI
(return on investment).

Dept. of CSE, BITM - Ballari Page 22


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

FASMI CHARACTERISTICS OF OLAP SYSTEMS


Fast
• Most queries should be answered very quickly, perhaps within seconds.
• The performance of the system has to be like that of a search-engine.
• The data-structures must be efficient.
• The hardware must be powerful enough for
→ amount of data &
→ number of users
• One approach can be
→ pre-compute the most commonly queried aggregates and
→ compute the remaining aggregates on-the-fly

Analytic
• The system must provide rich analytic-functionality.
• Most queries should be answered without any programming.
• The system should be able to cope with any relevant queries for application & user.

Shared
• The system is
→ likely to be accessed only by few business-analysts and
→ may be used by thousands of users
• Being a shared system, the OLAP software should provide adequate security for
confidentiality & integrity.
• Concurrency-control is obviously required if users are writing or updating data in
the database.

Multidimensional

• This is the basic requirement.


• OLAP software must provide a multidimensional conceptual view of the data.
• A dimension often has hierarchies that show parent/child relationships between the
members of dimensions. The multidimensional structure should allow such
hierarchies.

Information

• The system should be able to handle a large amount of input-data.


• The capacity of system to handle information and its integration with the data
warehouse may be critical.

Schemas for
Multidimensional Data Models: Stars, Snowflakes, & Fact Constellations:
The entity-relationship data model is commonly used in the design of relational databases,
where a database schema consists of a set of entities and the relationships between them.
Such a data model is appropriate for online transaction processing.

Dept. of CSE, BITM - Ballari Page 23


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

A data warehouse, however, requires a concise, subject-oriented schema that facilitates online
data analysis. The most popular data model for a data warehouse is a multidimensional
model, which can exist in the form of a star schema, a snowflake schema, or a fact
constellation schema.
1. Star schema:
The most common modeling paradigm is the star schema, in which the data warehouse
contains
i. A large central table (fact table) containing the bulk of the data, with no
redundancy, and
ii. A set of smaller attendant tables (dimension tables), one for each dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial
pattern around the central fact table.

Example: A star schema for AllElectronics sales is shown in Figure 10. Sales are considered
along four dimensions: time, item, branch, and location. The schema contains a central fact
table for sales that contains keys to each of the four dimensions, along with two measures:
dollars sold and units sold. To minimize the size of the fact table, dimension identifiers (e.g.,
time key and item key) are system-generated identifiers.

Notice that in the star schema, each dimension is represented by only one table, and each
table contains a set of attributes. For example, the location dimension table contains the
attribute set {location_key, street, city, province or state, country}. This constraint may
introduce some redundancy.

Figure-10: Star Schema for Sales Data Warehouse


Advantages of Star Schema:
i. Facts and Dimensions are clearly depicted.
ii. Dimension tables are relatively static.
iii. Easy to comprehend.

Dept. of CSE, BITM - Ballari Page 24


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

2. Snowflake Schema:
The snowflake schema is a variant of the star schema model, where some dimension tables
are normalized, thereby further splitting the data into additional tables. The resulting schema
graph forms a shape similar to a snowflake.

i. The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to
reduce redundancies.
ii. Such a table is easy to maintain and saves storage space. However, this space
savings is negligible in comparison to the typical magnitude of the fact table.
iii. Furthermore, the snowflake structure can reduce the effectiveness of browsing,
since more joins will be needed to execute a query. Consequently, the system
performance may be adversely impacted. Hence, although the snowflake schema
reduces redundancy, it is not as popular as the star schema in data warehouse
design.

Figure-11: Snow flake Schema for Sales Data Warehouse

Advantage of Snowflake Schema:


i. Represents Dimension Hierarchy DIRECTLY by NORMALIZING the tables.
ii. Easy to maintain & save storage.

Fact constellation:
 Sophisticated applications may require multiple fact tables to share dimension tables.
 This kind of schema can be viewed as a collection of stars, and hence is called a
galaxy schema or a fact constellation.
Dept. of CSE, BITM - Ballari Page 25
15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Example: Fact constellation. A fact constellation schema is shown in Figure 12. This schema
specifies two fact tables, sales and shipping. The sales table definition is identical to that of
the star schema (Figure-10). The shipping table has five dimensions, or keys—item key, time
key, shipper key, from location, and to location—and two measures—dollars cost and units
shipped. A fact constellation schema allows dimension tables to be shared between fact
tables. For example, the dimensions tables for time, item, and location are shared between the
sales and shipping fact tables.

Figure 12: Fact Constellation schema of sales and shipping Datawarehouse.


Star Schema vs Snowflake Schema

Sl. No Feature Star Schema Snow Flake Schema


Ease of Has redundant data and hence No redundancy, so snowflake
1 maintenance/ less easy to maintain/change schemas are easier to
change maintain and change.
Ease of Use Lower query complexity and More complex queries and
2
easy to understand hence less easy to understand
Query Less number of foreign keys More foreign keys and hence
3 Performance and hence shorter query longer query execution time
execution time (faster) (slower)
Type of Good for datamarts with Good to use for data
Datawarehouse simple relationships (1:1 or warehouse core to simplify
4
1:many) complex relationships
(many:many)
5 Joins Fewer Joins Higher number of joins
Dimension table A star schema contains only A snowflake schema may
single dimension table for each have more than one
6
dimension. dimension table for each
dimension.

Dept. of CSE, BITM - Ballari Page 26


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

When to use When dimension table contains When dimension table is


less number of rows, we can relatively big in size,
7
choose Star schema. snowflaking is better as it
reduces space.
Normalization / Both Dimension and Fact Dimension Tables are in
De- Tables are in De-Normalized Normalized form but Fact
8
Normalization form Table is in De-Normalized
form
9 Data model Top down approach Bottom up approach

Dimensions: The Role of Concept Hierarchies


A concept hierarchy defines a sequence of mappings from a set of low-level concepts to
higher-level, more general concepts.
Consider a concept hierarchy for the dimension location. City values for location
include Vancouver, Toronto, New York, and Chicago. Each city, however, can be mapped to
the province or state to which it belongs.
For example, Vancouver can be mapped to British Columbia, and Chicago to Illinois.
The provinces and states can in turn be mapped to the country (e.g., Canada or the United
States) to which they belong. These mappings form a concept hierarchy for the dimension
location, mapping a set of low-level concepts (i.e., cities) to higher-level, more general
concepts (i.e., countries). This concept hierarchy is illustrated in Figure 13.
Many concept hierarchies are implicit within the database schema. For example,
suppose that the dimension location is described by the attributes number, street, city,
province or state, zip code, and country. These attributes are related by a total order, forming
a concept hierarchy such as “street < city < province or state < country.” This hierarchy is
shown in Figure 13(a). Alternatively, the attributes of a dimension may be organized in a
partial order, forming a lattice. An example of a partial order for the time dimension based on
the attributes day, week, month, quarter, and year is “day {<month < quarter; week } < year.
This lattice structure is shown in Figure 13(b).
 A concept hierarchy that is a total or partial order among attributes in a database schema
is called a schema hierarchy.
 Concept hierarchies that are common to many applications (e.g., for time) may be
predefined in the data mining system. Data mining systems should provide users with the
flexibility to tailor predefined hierarchies according to their particular needs.
 Concept hierarchies may also be defined by discretizing or grouping values for a given
dimension or attribute, resulting in a set-grouping hierarchy. A total or partial order can
be defined among groups of values.
 There may be more than one concept hierarchy for a given attribute or dimension,
based on different user viewpoints.
 Concept hierarchies may be provided manually by system users, domain experts, or
knowledge engineers, or may be automatically generated based on statistical analysis
of the data distribution.
 Concept hierarchies allow data to be handled at varying levels of abstraction

Dept. of CSE, BITM - Ballari Page 27


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Figure-13: A concept hierarchy for location. Due to space limitations, not all of the hierarchy
nodes are shown, indicated by ellipses between nodes.

Figure 13(b)
Figure 13(a)

Figure 13: Hierarchical and lattice structures of attributes in warehouse dimensions: (a) a
hierarchy for location and (b) a lattice for time.

Why concept hierarchies are useful in data mining:


 Concept hierarchies define a sequence of mappings from a set of lower-level concepts
to higher-level, more general concepts and can be represented as a set of nodes
organized in a tree, in the form of a lattice, or as a partial order.
 They are useful in data mining because they allow the discovery of knowledge at
multiple levels of abstraction and provide the structure on which data can be
generalized (rolled-up) or specialized (drilled-down).
 Together, these operations allow users to view the data from different perspectives,
gaining further insight into relationships hidden in the data.
 Generalizing has the advantage of compressing the data set, and mining on a
compressed data set will require fewer I/O operations. This will be more efficient than
mining on a large, uncompressed data set.

Dept. of CSE, BITM - Ballari Page 28


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Measures: Their Categorization and Computation


To know How are measures computed, we should first study how measures can
be categorized. Note that a multidimensional point in the data cube space can be defined
by a set of dimension–value pairs; for example, <time D “Q1”, location D “Vancouver”,
item D “computer”>.
 A data cube measure is a numeric function that can be evaluated at each point in the
data cube space.
 A measure value is computed for a given point by aggregating the data corresponding
to the respective dimension–value pairs defining the given point.
Measures can be organized into three categories—
1. Distributive,
2. Algebraic, and
3. Holistic, based on the kind of aggregate functions used.
1. Distributive:
i. An aggregate function is distributive if it can be computed in a distributed manner as
follows.
a) Suppose the data are partitioned into n sets.
b) We apply the function to each partition, resulting in n aggregate values.
c) If the result derived by applying the function to the n aggregate values is the same as
that derived by applying the function to the entire data set (without partitioning), the
function can be computed in a distributed manner.
For example, sum() can be computed for a data cube by first partitioning the cube into a set
of sub-cubes, computing sum() for each sub-cube, and then summing up the counts obtained
for each sub-cube. Hence, sum() is a distributive aggregate function.

For the same reason, count(), min(), and max() are distributive aggregate functions.
 A measure is distributive if it is obtained by applying a distributive aggregate function.
 Distributive measures can be computed efficiently because of the way the computation
can be partitioned.
2. Algebraic:
i. An aggregate function is algebraic if it can be computed by an algebraic function
with M arguments (where M is a bounded positive integer), each of which is obtained by
applying a distributive aggregate function.
 For example, avg() (average) can be computed by sum()/count(), where both sum() and
count() are distributive aggregate functions.
 Similarly, it can be shown that min N() and max N() (which find the N minimum and N
maximum values, respectively, in a given set) and standard deviation() are algebraic
aggregate functions.
 A measure is algebraic if it is obtained by applying an algebraic aggregate function.
3. Holistic:
i. An aggregate function is holistic if there is no constant bound on the storage size
needed to describe a sub-aggregate. That is, there does not exist an algebraic
function with M arguments (where M is a constant) that characterizes the
computation.
 Common examples of holistic functions include median(), mode(), and rank().
 A measure is holistic if it is obtained by applying a holistic aggregate function.

Dept. of CSE, BITM - Ballari Page 29


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Typical OLAP Operations

In the multidimensional model, data are organized into multiple dimensions, and each
dimension contains multiple levels of abstraction defined by concept hierarchies. This
organization provides users with the flexibility to view data from different perspectives. A
number of OLAP data cube operations exist to materialize these different views, allowing
interactive querying and analysis of the data at hand. Hence, OLAP provides a user-friendly
environment for interactive data analysis.

OLAP operations. . Each of the following operations described is illustrated in Figure 14. At
the center of the figure is a data cube for AllElectronics sales. The cube contains the
dimensions location, time, and item, where location is aggregated with respect to city values,
time is aggregated with respect to quarters, and item is aggregated with respect to item types.
For better understanding, we refer to this cube as the central cube. The measure displayed is
dollars sold (in thousands). (For improved readability, only some of the cubes’ cell values are
shown.) The data examined are for the cities Chicago, New York, Toronto, and Vancouver.

1. Roll-up:

i. The roll-up operation (also called the drill-up operation by some vendors) performs
aggregation on a data cube, either by climbing up a concept hierarchy for a dimension
or by dimension reduction.
ii. Figure 14 shows the result of a roll-up operation performed on the central cube by
climbing up the concept hierarchy for location given in Figure 13.
This hierarchy was defined as the total order “street < city < province or state <
country.”
iii. The roll-up operation shown aggregates the data by ascending the location hierarchy
from the level of city to the level of country.
a. In other words, rather than grouping the data by city, the resulting cube groups
the data by country.
iv. When roll-up is performed by dimension reduction, one or more dimensions are
removed from the given cube.

2. Drill-down:

i. Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data.
ii. Drill-down can be realized by either stepping down a concept hierarchy for a
dimension or introducing additional dimensions.
iii. Drill-down occurs by descending the time hierarchy from the level of quarter to the
more detailed level of month.
iv. The resulting data cube details the total sales per month rather than summarizing them
by quarter.
v. Because a drill-down adds more detail to the given data, it can also be performed by
adding new dimensions to a cube.

Dept. of CSE, BITM - Ballari Page 30


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

Figure 14: Examples of Typical OLAP operations on multi-dimensional data.


3. Slice and dice:
i. The slice operation performs a selection on one dimension of the given cube, resulting
in a sub-cube.
ii. Figure 14 shows a slice operation where the sales data are selected from the central
cube for the dimension time using the criterion time D “Q1.”
 The dice operation defines a sub-cube by performing a selection on two or more
dimensions.
o Figure 14 shows a dice operation on the central cube based on the following
selection criteria that involve three dimensions: (location D “Toronto” or
“Vancouver”) and (time D “Q1” or “Q2”) and (item D “home entertainment”
or “computer”).

Dept. of CSE, BITM - Ballari Page 31


15CS651 DATA MINING & DATA WAREHOUSING Dr. Suresh Y

4. Pivot (rotate):

Pivot (also called rotate) is a visualization operation that rotates the data axes in view to
provide an alternative data presentation. Figure 14 shows a pivot operation where the item
and location axes in a 2-D slice are rotated. Other examples include rotating the axes in a 3-D
cube, or transforming a 3-D cube into a series of 2-D planes.

Five primitives for specifying a data mining task

1. Task-relevant data: This primitive specifies the data upon which mining is to be
performed. It involves specifying the database and tables or data warehouse containing
the relevant data, conditions for selecting the relevant data, the relevant attributes or
dimensions for exploration, and instructions regarding the ordering or grouping of the
data retrieved.

2. Knowledge type to be mined: This primitive specifies the specific data mining function
to be performed, such as characterization, discrimination, association, classification,
clustering, or evolution analysis. As well, the user can be more specific and provide
pattern templates that all discovered patterns must match. These templates or meta
patterns (also called meta rules or meta queries), can be used to guide the discovery
process.

3. Background knowledge: This primitive allows users to specify knowledge they have
about the domain to be mined. Such knowledge can be used to guide the knowledge
discovery process and evaluate the patterns that are found. Of the several kinds of
background knowledge, this chapter focuses on concept hierarchies.

4. Pattern interestingness measure: This primitive allows users to specify functions that
are used to separate uninteresting patterns from knowledge and may be used to guide the
mining process, as well as to evaluate the discovered patterns. This allows the user to
confine the number of uninteresting patterns returned by the process, as a data mining
process may generate a large number of patterns. Interestingness measures can be
specified for such pattern characteristics as simplicity, certainty, utility and novelty.

5. Visualization of discovered patterns: This primitive refers to the form in which


discovered patterns are to be displayed. In order for data mining to be effective in
conveying knowledge to users, data mining systems should be able to display the
discovered patterns in multiple forms such as rules, tables, cross tabs (cross-tabulations),
pie or bar charts, decision trees, cubes or other visual representations.

Dept. of CSE, BITM - Ballari Page 32

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy