0% found this document useful (0 votes)
5 views35 pages

DMDW Unit2

The document discusses the concepts of Data Warehousing and OLAP technology, focusing on dimensional modeling, schemas, and data warehouse architecture. It explains the elements of a dimensional data model, the different types of schemas (Star, Snowflake, and Galaxy), and the architecture layers of a data warehouse. Additionally, it covers data marts, their types, and the differences between data marts and data warehouses, along with the implementation process for data warehouses.

Uploaded by

krishnaveni kits
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views35 pages

DMDW Unit2

The document discusses the concepts of Data Warehousing and OLAP technology, focusing on dimensional modeling, schemas, and data warehouse architecture. It explains the elements of a dimensional data model, the different types of schemas (Star, Snowflake, and Galaxy), and the architecture layers of a data warehouse. Additionally, it covers data marts, their types, and the differences between data marts and data warehouses, along with the implementation process for data warehouses.

Uploaded by

krishnaveni kits
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

DMDW-U2

UNIT II Data Warehouse and OLAP Technology for Data Mining: Data Warehouse,
Multidimensional Data Model, Data Warehouse Architecture, Data Marts, Data Warehouse
Implementation, Further Development of Data Cube Technology, From Data Warehousing to
Data Mining, Data Cube Computation and Data Generalization, Attribute-Oriented Induction
Q) Write about Multi-Dimensional Modeling?
 Dimensional Modeling (DM) is a data structure technique optimized for
data storage in a Data warehouse.
 The purpose of dimensional modeling is to optimize the database for
faster retrieval of data. The concept of Dimensional Modeling was
developed by Ralph Kimball and consists of “fact” and “dimension”
tables.
A dimensional model in data warehouse is designed to read, summarize,
analyze numeric information like values, balances, counts, weights, etc. in a
data warehouse.
A multidimensional model is a technique, structure for data warehousing
tools. A multidimensional model display data in the form of a data-cube. This
data cube able data to be designed and display in multiple dimensions. This
cube is defined through dimensions and facts. Data cube can be represented
as follows:

1
DMDW-U2

Elements of Dimensional Data Model


Fact
Facts are the measurements/metrics from your business process. For a Sales
business process, a measurement would be quarterly sales number
Dimension
Dimension provides the context surrounding a business process event. In
simple terms, they give who, what, where of a fact.
Who – Customer Names
Where – Location
What – Product Name
In other words, a dimension is a window to view information in the facts.
Attributes
The Attributes are the various characteristics of the dimension in
dimensional data modeling.
In the Location dimension, the attributes can be
State
Country
Zip code etc.
Attributes are used to search, filter, or classify facts. Dimension Tables
contain Attributes
Fact Table
A fact table is a primary table in dimension modeling.
A Fact Table contains: Measurements/facts, foreign key to dimension table
Working on a Multidimensional Data Model
On the basis of the pre-decided steps, the Multidimensional Data Model
works.
The following stages should be followed by every project for building a Multi-
Dimensional Data Model:

2
DMDW-U2

Stage 1: Assembling data from the client: In first stage, a Multi-Dimensional


Data Model collects correct data from the client.
Stage 2: Grouping different segments of the system : In the second stage,
the Multi-Dimensional Data Model recognizes and classifies all the data to
the respective section they belong to.
Stage 3: Noticing the different proportions: In this stage, the main factors
are recognized according to the user’s point of view. These factors are also
known as “Dimensions”.
Stage 4: Preparing the actual-time factors and their respective qualities: In
the fourth stage, the factors which are recognized in the previous step are
used further for identifying the related qualities. These qualities are also
known as “attributes” in the database.
Stage 5: Finding the actuality of factors which are listed previously and their
qualities: In the fifth stage, A Multi-Dimensional Data Model separates and
differentiates the actuality from the factors which are collected by it.
Stage 6: Building the Schema to place the data, with respect to the
information collected from the steps above: In the sixth stage, on the basis
of the data which was collected previously, a Schema is built.
Let us take the example of the data of a factory which sells products per
quarter in Bangalore. The data is represented in the table given below:

In the above given presentation, the factory’s sales for Bangalore are, for
the time dimension, which is organized into quarters and the dimension of
items.
if we desire to view the data of the sales in a three-dimensional table, then it
is represented in the diagram given below. Here the data of the sales is
3
DMDW-U2

represented as a two dimensional table. Let us consider the data according


to item, time and location (like Kolkata, Delhi, and Mumbai). Here is the
table:

This data can be represented in the form of three dimensions conceptually,


which is shown in the image below:

Q) What are Schemas used in Data Warehouse Modeling?


Schema
Schema means the logical description of the entire database. It gives us a
brief idea about the link between different database tables through keys and
values.
In the data warehouse, we use modeling Star, Snowflake, and Galaxy
schema.
Key Concepts of Schemas:
1. Primary Key – An attribute in a relational database having unique
values. There are no duplicate values.

4
DMDW-U2

2. Foreign Key – An attribute in a relational database that links one table


to another. It refers to the primary key from another table.
3. Dimensions – Dimensions are the column names in a dimension table.
Also, dimensions have their attributes sub-divided in the table. We use
dimensions as a structured way of describing and labeling the
information.
4. Measures – Quantitative attributes in the fact table.
5. Fact Table – A fact table contains a dimension key from the dimension
table and measures. The measures here are to perform calculations for
analysis.
The dimension key and measures describe the facts of the business
processes. A fact table consists of measurements of our interests.
Example: Product_id, Date_id, No. of products.
Schema Definition
Data Mining Query Language (DMQL) defines Multidimensional Schema.
Using a multidimensional schema, we model data warehouse systems.
Star Schema
 Star Schema is the easiest schema. It has a fact table at its centre
linked to dimension tables having attributes. It is also called as Star-
Join Schema.
 It has a primary and foreign key relationship between the dimension
table and the fact table. It is de-normalized means the normalization is
not done as it is for relational databases.
 Its characteristic is that we represent each dimension with only a one-
dimension table.
DEFINING A STAR SCHEMA IN DMQL FOR THE DIAGRAM BELOW

5
DMDW-U2

ADVANTAGES:
1. Most Suitable for Query Processing: View-only reporting applications show
enhanced performance.
2. Simple Queries
3. Simplest and Easiest to design.
DISADVANTAGES:
1. They don’t support many to many relationships between business entities.
2. More data redundancy:
Snowflake Schema
 It is an extended version of the star schema where dimension tables
are sub-divided further.
 It means that there are many levels of dimension tables. It is because
of the normalized dimensions here.
 Normalization is a process that splits up data to avoid data
redundancy. This process sub-divides the tables and the number of
tables increases.
 The Snowflake schema is nothing but a normalized Star schema.

6
DMDW-U2

ADVANTAGES:
1. Easy to maintain: It is due to reduced data redundancy.
2. Saves Storage space: Dimension tables are easier to update.
DISADVANTAGES:
1. Complex Schema: Source query joins are complex.
2. Query Performance is not so good: because of the complex queries.
Galaxy Schema
 It consists of more than one fact table linked to the dimension tables
having attributes. It is also called a fact constellation schema.
 Conformed dimensions are the dimension tables shared with the fact
tables. We can normalize the dimensions in this schema further, but it
will lead to a more complex design.
The following diagram shows Placement and Workshop as the two fact tables
present. And the dimension table, Student, and TPO are the conformed
dimensions.

7
DMDW-U2

ADVANTAGES:
1. Flexible schema.
2. Effective analysis and reporting.
DISADVANTAGES:
1. Has huge dimension tables hence resulting in difficulty in managing.
2. Hard to maintain: It is because of their complex design and as there are
many fact tables.
Q) What is data warehouse architecture?
Bill Inmon is widely recognized as “the father of data warehousing”.
He defines a data warehouse as:
"Data warehouse architecture refers to a subject-oriented, integrated, time-
variant, and non-volatile collection of data in support of management's
decision-making process."
Subject-oriented
The data in the warehouse is organized around subjects or topics rather than
the applications or source systems that generate the data.
Integrated :The data from each source system (e.g. CRM, ERP, Behavioral
Data, or e-commerce platforms) is brought together and made consistent in
the data warehouse.

8
DMDW-U2

Time-variant
Data in the warehouse is maintained over time, allowing for trend analysis,
forecasting, AI/ML, and historical reporting.
Non-volatile
Data written into the warehouse doesn't overwritten or deleted, ensuring the
stability and reliability of the data, which is crucial for trustworthy analysis.
Data Warehouse Architecture
Data Warehouse Architecture is complex that contains historical and
commutative data from multiple sources.
There are 3 approaches for constructing Data Warehouse layers:
Single Tier, Two tier and Three tier.
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored. This
goal is to remove data redundancy.
Two-tier architecture
Two-layer architecture is one of the Data Warehouse layers which separates
physically available sources and data warehouse.
Three-Tier Data Warehouse Architecture
This is the most widely used Architecture of Data Warehouse. It consists of
the Top, Middle and Bottom Tier.
Bottom Tier: The database of the Data warehouse servers as the bottom
tier. It is usually a relational database system.
Middle Tier: The middle tier in Data warehouse is an OLAP server which is
implemented using either ROLAP or MOLAP model.
Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API
that you connect and get data out from the data warehouse.

9
DMDW-U2

Data warehouse Components

There are mainly five Data Warehouse Components:


1. Data Warehouse Database
2. Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)
3. Metadata
4. Query Tools
5. Data warehouse Bus Architecture
Data Warehouse Database
The central database is the foundation of the data warehousing
environment. This database is implemented on the RDBMS technology.
Approaches to database in data warehouse are listed below:
 In a data warehouse, relational databases are deployed in parallel to
allow for scalability.
 New index structures are used to bypass relational table scan and
improve speed.
 Use of multidimensional database (MDDBs) to overcome any limitations
which are placed because of the relational Data Warehouse Models.

10
DMDW-U2

Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)


The data sourcing, transformation, and migration tools are used for
performing all the conversions, summarizations, and all the changes
needed to transform data into a unified format in the data warehouse.
They are also called Extract, Transform and Load (ETL) Tools.
Their functionality includes:
 Eliminating unwanted data in operational databases from loading into
Data warehouse.
 Search and replace common names and definitions for data arriving
from different sources.
 Calculating summaries and derived data
 In case of missing data, populate them with defaults.
 De-duplicated repeated data arriving from multiple data sources.
Metadata
Metadata is data about data which defines the data warehouse. It is used for
building, maintaining and managing the data warehouse.
Metadata can be classified into following categories:
Technical Meta Data: This kind of Metadata contains information about
warehouse which is used by Data warehouse designers and administrators.
Business Meta Data: This kind of Metadata contains detail that gives end-
users a way easy to understand information stored in the data warehouse.
Query Tools
One of the primary objects of data warehousing is to provide information to
businesses to make strategic decisions. Query tools allow users to interact
with the data warehouse system.
These tools fall into four different categories:
 Query and reporting tools
 Application Development tools
 Data mining tools

11
DMDW-U2

 OLAP tools
1. Query and reporting tools
Query and reporting tools can be further divided into
 Reporting tools
 Managed query tools
2. Application development tools
3. Data mining tools
Data mining is a process of discovering meaningful new correlation,
patterns, and trends by mining large amount data. Data mining tools are
used to make this process automatic.
4. OLAP tools
These tools are based on concepts of a multidimensional database. It allows
users to analyze the data using elaborate and complex multidimensional
views.
Data Marts
 A data mart is a simple form of a data warehouse that is focused on a
single subject or line of business, such as sales, finance, or marketing.
 Given their focus, data marts draw data from fewer sources than data
warehouses.
 For example, a company might store data from various sources, such as
supplier information, orders, sensor data, employee information, and
financial records in their data warehouse.
However, the company stores information relevant to, for instance, the
marketing department, such as social media reviews and customer records,
in a data mart.
Advantages of Data mart:
 More trustworthy data
 Easier access to data.
 Faster insights & decisions.

12
DMDW-U2

 Lower cost: Data marts typically cost less to set up than a full data
warehouse.
 Easier implementation & maintenance.
 Better support short-term projects.
 Better data access control.
How does a data mart work?
 A data mart turns raw information into structured, meaningful content for
a specific business department.
 To do this, data engineers set up a data mart to receive information
either from a data warehouse or directly from external data sources.
 When it is connected to a data warehouse, the data mart retrieves a
selection of information that is relevant to a business unit.
 Often, the information contains summarized data and excludes
unnecessary or detailed data.
ETL
 Extract, transform, and load (ETL) is a process for integrating and
transferring information from various data sources into a single physical
database.
 Data marts use ETL to retrieve information from external sources when it
does not come from a data warehouse.
 The process involves the following steps.
Extract: collect raw information from various sources
Transform: structure the information into a common format
Load: transfer the processed data to the database
Analytics
Business analysts use software tools to retrieve, analyze, and represent data
from the data mart. For example, they use the information stored in data
marts for business intelligence analytics, reporting dashboards, and cloud
applications.

13
DMDW-U2

Types of Data Mart:


There are three main types of data mart:
Dependent data mart
A dependent data mart populates its storage with a subset of information
from a centralized data warehouse. The data warehouse gathers all the
information from data sources. Then, the data mart queries and retrieves
subject-specific information from the data warehouse.

Independent:
An independent data mart is created without the use of central Data
warehouse. This kind of Data Mart is an ideal option for smaller groups
within an organization.
An independent data mart has neither a relationship with the enterprise data
warehouse nor with any other data mart. In Independent data mart, the
data is input separately, and its analyses are also performed autonomously.

14
DMDW-U2

Hybrid Data Mart


A hybrid data mart combines input from sources apart from Data warehouse.
This could be helpful when you want ad-hoc integration, like after a new
group or product is added to the organization.

Q) Difference between Data mart and Data warehouse

FACTOR DATA MART DATA WAREHOUSE

Type of Data Summarized historical (traditionally). Summarized historical (in traditional DW’s).

Fewer source systems which are Wide variety of source systems from all
Data Sources
operationally focused. across the enterprise.
Analyzing smaller data sets (typically Analyzing large (typically 100+ GB),
Use Case/ <100 GB) focused on a particular complex, enterprise-wide datasets to
Scope subject to support analytics and support data mining, BI artificial
business intelligence (BI). intelligence, and machine learning.
Data Easier because data is already Requires strict governance rules and
governance partitioned. systems to access data.

15
DMDW-U2

Q) Explain how to implement Data warehouse.


Data Warehouse Implementation is a series of operations required to build a
data warehouse that is completely functional. These operations follow the
steps of classifying, evaluating, and creating the data warehouse in line with
the requirements of the customer.
The process of putting a data warehouse into operation consists of a number
of distinct processes, the most important of which are planning, data
collection, data analysis, and business action. Important components of a
Data Warehouse, such as Data Marts, OLTP/OLAP, ETL, Metadata, and so on,
have to be defined at every stage of the system's implementation design
process.

The Steps Involved in Data Warehouse Implementation


Before we proceed with the implementation of the data warehouse
implementation we need to have a plan for implementing a data warehouse.
To create an effective data warehouse implementation plan, businesses need
to define clear objectives such as improving decision-making capabilities or
reducing operational costs.
They should then select the right technology stack based on these
objectives while considering factors like scalability and flexibility for future
growth needs.

16
DMDW-U2

In addition, identifying all relevant data sources is key in ensuring


consistency across different departments or teams working with the same
information.
The development of ETL processes involves extracting raw data from
multiple sources into one central location where it can be transformed into
usable insights through analytics tools such as machine learning algorithms
or predictive modeling techniques.
Once the data warehouse implementation plan is in place, then we can move
with the Data Warehouse Implementation and the steps involved in Data
Warehouse Implementation are as follow:
1. Requirements analysis and capacity planning
2. Integration
3. Designing Schema
4. Modeling
5. Connecting Drivers
6. ETL Phase
1. Requirements Analysis and Capacity Planning
The first step in the process of data warehousing implementation is called
requirements analysis and capacity planning. This step involves determining
what the requirements of the business are, developing appropriate
architectures, performing any necessary capacity planning, and selecting the
appropriate hardware and software tools.
2. Integration
After the hardware and software have been selected, the second step is to
integrate the user software tools, server storage systems, and server
software systems.
3. Designing Schema
In the third stage, you will model, which is a significant task that comprises
designing the schema and views for the warehouse.

17
DMDW-U2

4. Modeling
Physical modeling has the potential to significantly boost the performance of
data warehouses. Data partitioning, data location, access method selection,
indexing, and other similar aspects are all components that go into the
architecture of a physical data warehouse.
5. Connecting Drivers
It’s most likely that the data warehouse will compile information from a wide
variety of different sources. Finding the sources and connecting them
requires the use of the gateway, ODBC drivers, or another wrapper.
6. ETL Phase
This step requires the information to be extracted from the source system,
transformed, and then loaded.
The ETL tools are tested, which may require a staging environment. This is
the start of the process of filling the data warehouses.
In order for data warehouses to be useful, end-user applications are
required. The following phase entails the process of designing and deploying
software that is user-facing.
Steps to Data warehouse implementation
Data warehousing is considered one of the most important processes
involved in the process of collecting usable information for use in making
business choices.
In order to have a successful installation of the data warehouse system,
there is a precise order in which certain steps need to be carried out. The
following is what ends up taking place:
1. The Preparatory Work
It is helpful because it outlines the steps that we need to take in order to
achieve the objectives and goals that we have stated. Getting buy-in from
within an organization is one of the most important factors in the success of
any endeavor.

18
DMDW-U2

2. Acquiring the Necessary Information


The process of compiling information from a variety of resources for the
purposes of analysis and reporting is referred to as "data collection."
3. A Look at the Numbers
Following the gathering of data, the next step that should be taken is data
analysis. The process of developing and actionable insights from a compiled
set of data is referred to as "data analysis."
4. Business Steps
Insights and facts can be obtained from the analysis of data, which can then
be incorporated into the decision-making process.
The Components of Data Warehouse Implementation?
1. Your data warehouse cannot function without a database as its core
component. In most cases, these are relational databases that are hosted
either locally or in the cloud
2. In the process of data integration, data is first extracted from its original
source systems and then modified in such a way that it can be quickly
consumed for analytical purposes. This process is carried out with the
assistance of programs and services such as ETL (extract, transform, and
load), ELT (extract, transform), real-time data replication, bulk-load
processing, data transformation, and data quality and enrichment services.
3. Metadata is information about information. It details the origin, purpose,
values, and other characteristics of your data warehouse's collections.
4. Users will be able to interact with the information housed within your data
warehouse by using access tools. Access tools can take many forms,
including query and reporting tools, application development tools, data
mining tools, and online analytical processing (OLAP) tools, to name a few
examples.

19
DMDW-U2

Q) Explain how to implement Data Warehouse?


Data warehouses contain huge volumes of data. OLAP servers demand that
decision support queries be answered in the order of seconds.
Therefore, it is crucial for data warehouse systems to support highly efficient
cube computation techniques, access methods, and query processing
techniques.
Efficient Data Cube Computation:
At the core of multidimensional data analysis is the efficient computation of
aggregations across many sets of dimensions. In SQL terms, these
aggregations are referred to as group-by. Each group-by can be represented
by a cuboid, where the set of group-by forms a lattice of cuboids defining a
data cube.
The compute cube Operator and the Curse of Dimensionality:
The compute cube operator computes aggregates over all subsets of the
dimensions specified in the operation. This can require excessive storage
space, especially for large numbers of dimensions.
A data cube is a lattice of cuboids. Suppose that you want to create a data
cube for “AllElectronics” sales that contains the following: city, item, year,
and sales in dollars. We want to analyze the data, with queries such as the
following:
“Compute the sum of sales, grouping by city and item.”
“Compute the sum of sales, grouping by city.”
“Compute the sum of sales, grouping by item.”
For the above example the total numbers of cuboids are 2 3=8. The possible
group-by’s are the following: {(city, item, year), (city, item), (city, year),
(item, year), (city), (item),(year), () } where () means that the group-by is
empty (i.e., the dimensions are not grouped). These group-by’s form a
lattice of cuboids for the data cube, as shown in Figure:

20
DMDW-U2

 The base cuboid contains all three dimensions, city, item, and year. It
can return the total sales for any combination of the three dimensions.
 A pre computation lead to fast response time. By this pre computation
required storage space. If all the cuboids in a data cube are pre
computed it occupies more space. This problem is referred to as the
“curse of dimensionality”.
 If there were no hierarchies associated with each dimension, then the
total number of cuboids for an n-dimensional data cube, as we have
seen above, is 2n.
 However, in practice, many dimensions may have hierarchies. For
example, the dimension “time” have conceptual level, such as in the
hierarchy “day < month < quarter < year”. For an n-dimensional data
cube, the total number of cuboids that can be generated is:

Where Li is the number of levels associated with dimension i. One is added


to Li

21
DMDW-U2

Example
If the cube has 10 dimensions and each dimension has 4 levels, what will be
the number of cuboids generated?
Solution
Here n=10 Li=4 for i=1,2…….10
Thus
Total number of cuboids= 5×5×5×5×5×5×5×5×5×5=510~9.8 ×106
Computation of Selected Cuboids
There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not pre compute any of the “non base” cuboids.
2. Full materialization: Pre compute all of the cuboids.
3. Partial materialization: Selectively compute a proper subset of the whole
set of possible cuboids.
Indexing OLAP Data: Bitmap Index and Join Index:
Most data warehouse systems support index structures. There are two index
data structures in OLAP are “bitmap indexing” and “join indexing”.
Bitmap Index
A Bitmap Index is a type of indexing technique that uses bitmaps to
represent the presence or absence of values in a column.
It is particularly useful for low cardinality columns, where the number of
distinct values is relatively small.
Here's how Bitmap Index works:
 For each distinct value in the column, a bitmap is created.
 Each bit in the bitmap represents a row in the table.
 If a bit is set to 1, it indicates that the corresponding row contains the
value represented by the bitmap.
 If a bit is set to 0, it indicates that the corresponding row does not
contain the value.
Example:

22
DMDW-U2

Advantages of Bitmap Indexing:


 Efficient for low cardinality columns.
 Fast query performance for operations like equality, range.
Join Index
A Join Index is a type of indexing technique used to optimize join operations
between multiple tables in OLAP systems.
Here's how Join Index works:
 It identifies frequently executed join operations and creates an index
on the join columns.
 The index stores the pre computed results of the join operation.
 When a query involves a join operation, the Join Index is used to
retrieve the pre computed results instead of performing the join
operation again.
Advantages of Join Indexing:
1. Improved query performance for join operations.
2. Reduces the need for expensive join operations during query execution.

23
DMDW-U2

Join index tables based on the linkages between the sales fact table and the
location and item dimension tables shown in Figure.
Efficient Processing of OLAP Queries
The purpose of materializing cuboids and constructing OLAP index structures
is to speed up query processing in data cubes. Given materialized views,
query processing should proceed as follows:
1. Determine which operations should be performed on the available
cuboids: This involves transforming any selection, projection, roll-up

24
DMDW-U2

(group-by), and drill-down operations specified in the query into


corresponding SQL and/or OLAP operations.
2. Determine to which materialized cuboid(s) the relevant operations
should be applied: This involves identifying all of the materialized
cuboids that may potentially be used to answer the query, pruning the
set using knowledge of “dominance” relationships among the cuboids,
estimating the costs of using the remaining materialized cuboids, and
selecting the cuboid with the least cost.
Q) What is OLAP? Explain the different operations of OLAP.
Online Analytical Processing (OLAP) is software that allows users to analyze
information from multiple database systems at the same time.
It is a technology that enables analysts to extract and view business data
from different points of view.
Analysts frequently need to group, aggregate and join data. With OLAP data
can be pre-calculated and pre-aggregated, making analysis faster.
OLAP databases are divided into one or more cubes. The cubes are designed
in such a way that creating and viewing reports become easy.
Types of Operations on OLAP
There are four types of OLAP operations that can be performed. These areas
below:
1. Roll up
2. Drill down
3. Slice and dice
4. Pivot
1) Roll-up:
Roll-up is also known as “consolidation” or “aggregation.” The Roll-up
operation can be performed in 2 ways
1. Reducing dimensions

25
DMDW-U2

2. Climbing up concept hierarchy. Concept hierarchy is a system of


grouping things based on their order or level.
Consider the following diagram:

 In this example, cities New jersey and Lost Angles and rolled up into
country USA
 The sales figure of New Jersey and Los Angeles are 440 and 1560
respectively. They become 2000 after roll-up
 In this aggregation process, data is location hierarchy moves up from
city to the country.
 In the roll-up process at least one or more dimensions need to be
removed. In this example, Cities dimension is removed.

26
DMDW-U2

2) Drill-down
In drill-down data is fragmented into smaller parts. It is the opposite of the
rollup process. It can be done via
 Moving down the concept hierarchy
 Increasing a dimension

Consider the diagram above


 Quarter Q1 is drilled down to months January, February, and March.
Corresponding sales are also registers.
 In this example, dimension months are added.
3) Slice:
Here, one dimension is selected, and a new sub-cube is created.
Following diagram explain how slice operation performed:

27
DMDW-U2

 Dimension Time is sliced with Q1 as the filter.


 A new cube is created altogether.
Dice:
This operation is similar to a slice. The difference in dice is you select 2 or
more dimensions that result in the creation of a sub-cube

28
DMDW-U2

4) Pivot
In Pivot, you rotate the data axes to provide a substitute presentation of
data.
In the following example, the pivot is based on item types.

Q) Explain the different types of Data Cube Computation


Methods.
Data cube computation is an essential task in data warehouse
implementation. The pre computation of all or part of a data cube can
greatly reduce the response time and enhance the performance of online
analytical processing.

29
DMDW-U2

The following are data computation methods:


1. Multi way Array Aggregation for Full Cube Computation
The Multiway array aggregation (or simply MultiWay) method computes a
full data cube by using a multidimensional array as its basic data structure.
It is a typical MOLAP approach that uses direct array addressing, where
dimension values are accessed via the position or index of their
corresponding array locations. A different approach is developed for the
array-based cube construction, as follows:
1. Partition the array into chunks. A chunk is a sub cube that is small
enough to fit into the memory available for cube computation.
Chunking is a method for dividing an n-dimensional array into small n-
dimensional chunks, where each chunk is stored as an object on disk.

The chunks are compressed so as to remove wasted space resulting from


empty array cells. A cell is empty if it does not contain any valid data (i.e.,
its cell count is 0).

30
DMDW-U2

2. Computing Iceberg Cubes from the Apex Cuboid Downward (BUC


algorithm): (Bottom Up construction)
 An Iceberg-Cube contains only those cells of the data cube that meet an
aggregate condition. It is called an Iceberg-Cube because it contains only
some of the cells of the full cube.
 The aggregate condition could be, for example, minimum support or a
lower bound on average, min or max. The purpose of the Iceberg-Cube is
to identify and compute only those values that will most likely be required
for decision support queries.
 The aggregate condition specifies which cube values are more meaningful
and should therefore be stored. This is one solution to the problem of
computing versus storing data cubes.
 Figure shows a lattice of cuboids, making up a 3-D data cube with the
dimensions A, B, and C.

 The apex (0-D) cuboid, representing the concept "all", is at the top of the
lattice. The 3-D base cuboid, ABC, is at the bottom of the lattice. It is the
least aggregated (most detailed or specialized) level.

31
DMDW-U2

 It consolidates the notions of drill-down and roll-up (where we can


move from detailed, low-level cells to higher-level, more aggregated
cells).
2. Star-Cubing: Computing Iceberg Cubes Using a Dynamic Star-Tree
Structure:
It operates from a data structure called a star tree, which performs lossless
data compression, thereby reducing the computation time and memory
requirements.
The Star-Cubing algorithm explores both the bottom-up and top-down
computation models as follows: On the global computation order, it uses the
bottom-up model.
However, it has a sub layer underneath based on the top-down model,
which explores the notion of shared dimensions, as we shall see in the
following.

For example, dimension A is the shared dimension of AC and C. AC/C means


cuboid AC has shared dimensions C. Star cubing allows for shared
computations. e.g., cuboid C is computed simultaneously as AC. Star Cubing
aggregates in a top-down manner but with the bottom-up sub- layer
underneath which will allow Apriori pruning.

32
DMDW-U2

Q) What are the Basic approaches for Data generalization?


There are two basic approaches of data generalization:
1. Data cube approach: It is also known as OLAP approach.
2. Attribute Oriented Induction In Data Mining - Data
Characterization
The Attribute-Oriented Induction (AOI) approach to data generalization
and summarization-based characterization was first proposed in 1989.
The data cube approach can be considered as a data warehouse-based,
pre-computational-oriented, materialized approach. It performs offline
aggregation before an OLAP or data mining query is submitted for
processing.
On the other hand, the attribute-oriented induction approach, at least in
its initial proposal, is a relational database query-oriented, generalized-
based, online data analysis technique.
Basic Principles Of Attribute Oriented Induction:
Data focusing:
Analyzing task-relevant data, including dimensions, and the result is the
initial relation.
Attribute-removal:
To remove attribute A if there is a large set of distinct values for A but
(1) there is no generalization operator on A, or
(2) A’s higher-level concepts are expressed in terms of other attributes.
Attribute-generalization:
If there is a large set of distinct values for A, and there exists a set of
generalization operators on A, then select an operator and generalize A.
Attribute-threshold control:
Typical 2-8, specified/default.

33
DMDW-U2

Example
 Let's say there is a University database that is to be characterized,
for that its corresponding DMQL will be:
 “ use University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date, residence,
phone_no, GPA from student “
 Its corresponding SQL statement can be:
“Select name, gender, major, birth_place, birth_date, residence,
phone_no, GPA from student where status in {“Msc”, “MBA”, “Ph.D.” } “
 Initially the data is stored in “Initial Relation Working” table where
we can remove few attributes which has no meaning respective
task relevant data.

34
DMDW-U2

35

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy