Unit-2 DM
Unit-2 DM
DATA WAREHOUSING
Data warehousing Components –Building a Data warehouse –- Multi Dimensional Data Model –
OLAP operations in Multi Dimensional Data model-Three Tier Data warehouse architecture-
Schemas for multi dimensional data model-Online Analytical processing(OLAP)- OLAP vs OLTP
Integrated OLAM and OLAP Architecture
Data warehouse is an information system that contains historical and commutative data from single
or multiple sources. It simplifies reporting and analysis process of the organization. It is also a single
version of truth for any company for decision making and forecasting.
Subject-Oriented
Integrated
Time-variant
Non-volatile
Subject-Oriented:
A data warehouse is subject oriented as it offers information regarding a theme instead of companies’
ongoing operations. These subjects can be sales, marketing, distributions, etc.
A data warehouse never focuses on the ongoing operations. Instead, it put emphasis on modelling and
analysis of data for decision making. It also provides a simple and concise view around the specific
subject by excluding data which not helpful to support the decision process.
Integrated:
In Data Warehouse, integration means the establishment of a common unit of measure for all similar
data from the dissimilar database. The data also needs to be stored in the Data warehouse in common
and universally acceptable manner.
A data warehouse is developed by integrating data from varied sources like a mainframe, relational
databases, flat files, etc. Moreover, it must keep consistent naming conventions, format, and coding.
This integration helps in effective analysis of data. Consistency in naming conventions, attribute
measures, encoding structure etc. has to be ensured.
Time-Variant:
The time horizon for data warehouse is quite extensive compared with operational systems. The data
collected in a data warehouse is recognized with a particular period and offers information from the
Non-volatile:
Data warehouse is also non-volatile means the previous data is not erased when new data is entered
in it. Data is read-only and periodically refreshed. This also helps to analyze historical data and
understand what & when happened. It does not require transaction process, recovery and
concurrency control mechanisms.
Activities like delete, update, and insert which are performed in an operational application
environment are omitted in Data warehouse environment. Only two types of data operations
performed in the Data Warehousing are
1. Data loading
2. Data access
Single-tier architecture:
The objective of a single layer is to minimize the amount of data stored. This goal is to remove data
redundancy. This architecture is not frequently used in practice.
Two-tier architecture:
Two-layer architecture separates physically available sources and data warehouse. This architecture
is not expandable and also not supporting a large number of end-users. It also has connectivity
problems because of network limitations.
Three-tier architecture:
1. Bottom Tier: The database of the Data warehouse servers as the bottom tier. It is usually a
relational database system. Data is cleansed, transformed, and loaded into this layer using
back-end tools.
2. Middle Tier: The middle tier in Data warehouse is an OLAP server which is implemented using
either ROLAP or MOLAP model. For a user, this application tier presents an abstracted view of
the database. This layer also acts as a mediator between the end-user and the database.
3. Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect
and get data out from the data warehouse. It could be Query tools, reporting tools, managed
query tools, Analysis tools and Data mining tools.
The data warehouse is based on an RDBMS server which is a central information repository that is
surrounded by some key components to make the entire environment functional, manageable and
accessible
The central database is the foundation of the data warehousing environment. This database is
implemented on the RDBMS technology. Although, this kind of implementation is constrained by the
fact that traditional RDBMS system is optimized for transactional database processing and not for data
warehousing. For instance, ad-hoc query, multi-table joins, aggregates are resource intensive and slow
down performance.
In a data warehouse, relational databases are deployed in parallel to allow for scalability.
Parallel relational databases also allow shared memory or shared nothing model on various
multiprocessor configurations or massively parallel processors.
New index structures are used to bypass relational table scan and improve speed.
Use of multidimensional database (MDDBs) to overcome any limitations which are placed
because of the relational data model. Example: Essbase from Oracle.
The data sourcing, transformation, and migration tools are used for performing all the conversions,
summarizations, and all the changes needed to transform data into a unified format in the data
warehouse. They are also called Extract, Transform and Load (ETL) Tools.
These Extract, Transform, and Load tools may generate cron jobs, background jobs, Cobol programs,
shell scripts, etc. that regularly update data in data warehouse. These tools are also helpful to
maintain the Metadata.
These ETL Tools have to deal with challenges of Database & Data heterogeneity.
Metadata
The name Meta Data suggests some high- level technological concept. However, it is quite simple.
Metadata is data about data which defines the data warehouse. It is used for building, maintaining and
managing the data warehouse.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the source,
usage, values, and features of data warehouse data. It also defines how data can be changed and
processed. It is closely connected to the data warehouse.
What tables, attributes, and keys does the Data Warehouse contain?
Where did the data come from?
How many times do data get reloaded?
What transformations were applied with cleansing?
1. Technical Meta Data: This kind of Metadata contains information about warehouse which is
used by Data warehouse designers and administrators.
2. Business Meta Data: This kind of Metadata contains detail that gives end-users a way easy to
understand information stored in the data warehouse.
Query Tools
One of the primary objects of data warehousing is to provide information to businesses to make
strategic decisions. Query tools allow users to interact with the data warehouse system.
Reporting tools
Managed query tools
Reporting tools: Reporting tools can be further divided into production reporting tools and desktop
report writer.
1. Report writers: This kind of reporting tool is tools designed for end-users for their analysis.
2. Production reporting: This kind of tools allows organizations to generate regular operational
reports. It also supports high volume batch jobs like printing and calculating. Some popular
reporting tools are Brio, Business Objects, Oracle, Power Soft, SAS Institute.
This kind of access tools helps end users to resolve snags in database and SQL and database structure
by inserting meta-layer between users and database.
Sometimes built-in graphical and analytical tools do not satisfy the analytical needs of an organization.
In such cases, custom reports are developed using Application development tools.
Data mining is a process of discovering meaningful new correlation, pattens, and trends by mining
large amount data. Data mining tools are used to make this process automatic.
4. OLAP tools:
These tools are based on concepts of a multidimensional database. It allows users to analyse the data
using elaborate and complex multidimensional views.
Data warehouse Bus determines the flow of data in your warehouse. The data flow in a data
warehouse can be categorized as Inflow, Upflow, Downflow, Outflow and Meta flow.
While designing a Data Bus, one needs to consider the shared dimensions, facts across data marts.
A data mart is an access layer which is used to get data out to the users. It is presented as an option for
large size data warehouse as it takes less time and money to build. However, there is no standard
definition of a data mart is differing from person to person.
In a simple word Data mart is a subsidiary of a data warehouse. The data mart is used for partition of
data which is created for the specific group of users.
Data marts could be created in the same database as the Data warehouse or a physically separate
Database.
To design Data Warehouse Architecture, you need to follow below given best practices:
Use a data model which is optimized for information retrieval which can be the dimensional
mode, denormalized or hybrid approach.
Need to assure that Data is processed quickly and accurately. At the same time, you should take
an approach which consolidates data into a single version of the truth.
Carefully design the data acquisition and cleansing process for Data warehouse.
Design a MetaData architecture which allows sharing of metadata between components of Data
Warehouse
Consider implementing an ODS model when information retrieval need is near the bottom of
the data abstraction pyramid or when there are multiple operational sources required to be
accessed.
One should make sure that the data model is integrated and not just consolidated. In that case,
you should consider 3NF data model. It is also ideal for acquiring ETL and Data cleansing tools
Summary:
Data warehouse is an information system that contains historical and commutative data from
single or multiple sources.
A data warehouse is subject oriented as it offers information regarding subject instead of
organization's ongoing operations.
In Data Warehouse, integration means the establishment of a common unit of measure for all
similar data from the different databases
Data warehouse is also non-volatile means the previous data is not erased when new data is
entered in it.
A Data warehouse is Time-variant as the data in a DW has high shelf life.
There are 5 main components of a Data warehouse. 1) Database 2) ETL Tools 3) Meta Data 4)
Query Tools 5) Data Marts
These are four main categories of query tools 1. Query and reporting, tools 2. Application
Development tools, 3. Data mining tools 4. OLAP tools
The data sourcing, transformation, and migration tools are used for performing all the
conversions and summarizations.
In the Data Warehouse Architecture, meta-data plays an important role as it specifies the
source, usage, values, and features of data warehouse data.
K.BABU/CSE/ASST PROFESSOR Page 6
Building a Data Warehouse:
In general, building any data warehouse consists of the following steps:
1. Extracting the transactional data from the data sources into a staging area
Fortunately for many small to mid-size companies, Microsoft has come up with an excellent tool for
data extraction. Data Transformation Services (DTS), which is part of Microsoft SQL Server 7.0 and
2000, allows you to import and export data from any OLE DB or ODBC-compliant database as long as
you have an appropriate provider. This tool is available at no extra cost when you purchase Microsoft
SQL Server. The sad reality is that you won't always have an OLE DB or ODBC-compliant data source
to work with, however. If not, you're bound to make a considerable investment of time and effort in
writing a custom program that transfers data from the original source into the staging database.
Most companies have their data spread out in a number of various database management systems: MS
Access, MS SQL Server, Oracle, Sybase, and so on. Many companies will also have much of their data in
flat files, spreadsheets, mail systems and other types of data stores. When building a data warehouse,
you need to relate data from all of these sources and build some type of a staging area that can handle
data extracted from any of these source systems. After all the data is in the staging area, you have to
massage it and give it a common shape. Prior to massaging data, you need to figure out a way to relate
tables and columns of one system to the tables and columns coming from the other systems.
The relational format is not very efficient when it comes to building reports with summary and
aggregate values. The dimensional approach, on the other hand, provides a way to improve query
performance without affecting data integrity. However, the query performance improvement comes
with a storage space penalty; a dimensional database will generally take up much more space than its
relational counterpart. These days, storage space is fairly inexpensive, and most companies can afford
large hard disks with a minimal effort.
The dimensional model consists of the fact and dimension tables. The fact tables consist of foreign
keys to each dimension table, as well as measures. The measures are a factual representation of how
well (or how poorly) your business is doing (for instance, the number of parts produced per hour or
the number of cars rented per day). Dimensions, on the other hand, are what your business users
expect in the reports—the details about the measures. For example, the time dimension tells the user
that 2000 parts were produced between 7 a.m. and 7 p.m. on the specific day; the plant dimension
specifies that these parts were produced by the Northern plant.
Just like any modeling exercise the dimensional modeling is not to be taken lightly. Figuring out the
needed dimensions is a matter of discussing the business requirements with your users over and over
again. When you first talk to the users they have very minimal requirements: "Just give me those
reports that show me how each portion of the company performs." Figuring out what "each portion of
the company" means is your job as a DW architect. The company may consist of regions, each of which
report to a different vice president of operations. Each region, on the other hand, might consist of
areas, which in turn might consist of individual stores. Each store could have several departments.
When the DW is complete, splitting the revenue among the regions won't be enough. That's when
your users will demand more features and additional drill-down capabilities. Instead of waiting for
K.BABU/CSE/ASST PROFESSOR Page 8
that to happen, an architect should take proactive measures to get all the necessary requirements
ahead of time.
It's also important to realize that not every field you import from each data source may fit into the
dimensional model. Indeed, if you have a sequential key on a mainframe system, it won't have much
meaning to your business users. Other columns might have had significance eons ago when the system
was built. Since then, the management might have changed its mind about the relevance of such
columns. So don't worry if all of the columns you imported are not part of your dimensional model.
Keep in mind that such data transformations can be performed at either of the two stages: while
extracting the data from their origins or while loading data into the dimensional model. I wouldn't
recommend one way over the other—make a decision depending on the project. If your users need to
be sure that they can extract all the data first, wait until all data is extracted prior to transforming it. If
the dimensions are known prior to extraction, go on and transform the data while extracting it.
Prior to generating aggregations, you need to make an important choice about which dimensional
model to use: ROLAP (Relational OLAP), MOLAP (Multidimensional OLAP), or HOLAP (Hybrid OLAP).
The ROLAP model builds additional tables for storing the aggregates, but this takes much more
storage space than a dimensional database, so be careful! The MOLAP model stores the aggregations
as well as the data in multidimensional format, which is far more efficient than ROLAP. The HOLAP
approach keeps the data in the relational format, but builds aggregations in multidimensional format,
so it's a combination of ROLAP and MOLAP.
Regardless of which dimensional model you choose, ensure that SQL Server has as much memory as
possible. Building aggregations is a memory-intensive operation, and the more memory you provide,
the less time it will take to build aggregate values.
There are several major vendors on the market that have top-notch analytical tools. In addition to the
third-party tools, Microsoft has just released its own tool, Data Analyzer, which can be a cost-effective
alternative. Consider purchasing one of these suites before delving into the process of developing your
own software because reinventing the wheel is not always beneficial or affordable. Building OLAP
tools is not a trivial exercise by any means.
Multidimensional data model stores data in the form of data cube.Mostly, data warehousing supports
two or three-dimensional cubes.
A data cube allows data to be viewed in multiple dimensions.A dimensions are entities with respect to
which an organization wants to keep records.For example in store sales record, dimensions allow the
store to keep track of things like monthly sales of items and the branches and locations.A
multidimensional databases helps to provide data-related answers to complex business queries
quickly and accurately.Data warehouses and Online Analytical Processing (OLAP) tools are based on a
multidimensional data model.OLAP in data warehousing enables users to view data from different
angles and dimensions
Schema:
Schema is a logical description of the entire database. It includes the name and description of records
of all record types including all associated data-items and aggregates. Much like a database, a data
warehouse also requires to maintain a schema. A database uses relational model, while a data
warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we will discuss the
schemas used in a data warehouse.
Star Schema
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the four dimensions,
namely time, item, branch, and location.
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows
managers, and analysts to get an insight of the information through fast, consistent, and interactive
access to information. This chapter cover the types of OLAP, operations on OLAP, difference between
OLAP, and statistical databases and OLTP.
Types of OLAP Servers
We have four types of OLAP servers −
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is
the relational database system. We use the back endtools and utilities to feed data into the
bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh
functions.
Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either
of the following ways.
K.BABU/CSE/ASST PROFESSOR Page 17
o By Relational OLAP (ROLAP), which is an extended relational database management
system? The ROLAP maps the operations on multidimensional data to standard
relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations?
Top-Tier − This tier is the front-end client layer. This layer holds the query tools and
reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse −
Virtual Warehouse
Data mart
Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a
virtual warehouse. Building a virtual warehouse requires excess capacity on operational database
servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to specific
groups of an organization.
The major distinguishing features between OLTP and OLAP are summarized as follows.
1. Users and system orientation: An OLTP system is customer-oriented and is used for transaction
and query processing by clerks, clients, and information technology professionals. An OLAP system is
market-oriented and is used for data analysis by knowledge workers, including managers, executives,
and analysts.
2. Data contents: An OLTP system manages current data that, typically, are too detailed to be easily
used for decision making. An OLAP system manages large amounts of historical data, provides
facilities for summarization and aggregation, and stores and manages information at different levels of
granularity. These features make the data easier for use in informed decision making.
3. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an
application oriented database design. An OLAP system typically adopts either a star or snowflake
model and a subject-oriented database design.
4. View: An OLTP system focuses mainly on the current data within an enterprise or department,
without referring to historical data or data in different organizations. In contrast, an OLAP system
often spans multiple versions of a database schema. OLAP systems also deal with information that
K.BABU/CSE/ASST PROFESSOR Page 19
originates from different organizations, integrating information from many data stores. Because of
their huge volume, OLAP data are stored on multiple storage media.
5. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms. However,
accesses to OLAP systems are mostly read-only operations although many could be complex queries.
Comparison between OLTP and OLAP systems.
Integrated OLAP and OLAM Architecture Online Analytical Mining integrates with Online Analytical
Processing with data mining and mining knowledge in multidimensional databases. Here is the
diagram that shows the integration of both OLAP and OLAM
Importance of OLAM
OLAM is important for the following reasons −
High quality of data in data warehouses − The data mining tools are required to work on
integrated, consistent, and cleaned data. These steps are very costly in the preprocessing of
data. The data warehouses constructed by such preprocessing are valuable sources of high
quality data for OLAP and data mining as well.
Available information processing infrastructure surrounding data warehouses −
Information processing infrastructure refers to accessing, integration, consolidation, and