0% found this document useful (0 votes)
128 views72 pages

Business Intelligence - Chapter 3

1. The document discusses data warehousing, including design considerations, development approaches, architecture, and OLAP. 2. Key design considerations for a data warehouse include being subject-oriented, integrated, time-variant, nonvolatile, summarized, and not normalized. 3. There are two main approaches to developing a data warehouse - top-down, which provides consistency, and bottom-up, which leads to local ownership.

Uploaded by

OSAMA MASHAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views72 pages

Business Intelligence - Chapter 3

1. The document discusses data warehousing, including design considerations, development approaches, architecture, and OLAP. 2. Key design considerations for a data warehouse include being subject-oriented, integrated, time-variant, nonvolatile, summarized, and not normalized. 3. There are two main approaches to developing a data warehouse - top-down, which provides consistency, and bottom-up, which leads to local ownership.

Uploaded by

OSAMA MASHAL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

BUSINESS INTELLIGENCE

Dr. Nasim AbdulWahab Matar


CHAPTER 3

Data Warehousing

2
DECISION TYPES
There are two main kinds of decisions: strategic decisions and operational decisions.
BI can help make both better.
1. Strategic decisions are those that impact the direction of the company. The decision to reach out to a new
customer set would be a strategic decision.
2. Operational decisions are more routine and tactical decisions, focused on developing greater efficiency.
Updating an old website with new features will be an operational decision.

3
DESIGN CONSIDERATIONS FOR DW
The objective of DW is to provide business knowledge to support decision making. For DW to serve its objective, it
should be aligned around those decisions. It should be comprehensive, easy to access, and up-todate. Here are some
requirements for a good DW:

1. Subject-oriented: To be effective, DW should be designed around a subject domain, that is, to help solve
a certain category of problems.
2. Integrated: DW should include data from many functions that can shed light on a particular subject area.
Thus, the organization can benefit from a comprehensive view of the subject area.
3. Time-variant (time series): The data in DW should grow at daily or other chosen intervals. That allows
latest comparisons over time.
4. Nonvolatile: DW should be persistent, that is, it should not be created on the fly from the operations
databases. Thus, DW is consistently available for analysis, across the organization and over time.
5. Summarized: DW contains rolled-up data at the right level for queries and analysis. The rolling up helps
create consistent granularity for effective comparisons. It helps reduces the number of variables or
dimensions of the data to make them more meaningful for the decision makers.

4
DESIGN CONSIDERATIONS FOR DW
6. Not normalized: DW often uses a star schema, which is a rectangular central table, surrounded by some lookup
tables. The single-table view significantly enhances speed of queries.
7. Metadata: Many of the variables in the database are computed from other variables in the operational database.
For example, total daily sales may be a computed field. The method of its calculation for each variable should be
effectively documented. Every element in DW should be sufficiently well-defined.
8. Near real-time and/or right-time (active): DWs should be updated in near real-time in many high-transaction
volume industries, such as airlines. The cost of implementing and updating DW in real time could discourage others.
Another downside of real-time DW is the possibilities of inconsistencies in reports drawn just a few minutes apart.

5
OLTP AND DW

A database which is built for online transaction processing, OLTP, is generally regarded as unsuitable for data
warehousing as they have been designed with a different set of needs in mind (i.e., maximizing transaction capacity
and typically having hundreds of tables in order not to lock out users etc.). Data warehouses are interested in query
processing as opposed to transaction processing.

6
DW DEVELOPMENT APPROACHES
There are two fundamentally different approaches to developing DW:
• Top down and bottom up.
The top-down approach is to make a comprehensive DW that covers all the reporting needs of the enterprise.
• The bottom-up approach
The bottom-up approach is to produce small data marts, for the reporting needs of different departments or
functions, as needed. The smaller data marts will eventually align to deliver comprehensive EDW
capabilities.

7
DW DEVELOPMENT APPROACHES

8
DW DEVELOPMENT APPROACHES

9
DW DEVELOPMENT APPROACHES

The top-down approach provides


consistency but takes time and resources.

The bottom-up approach leads to healthy


local ownership and maintainability of data

10
COMPARING DATA MART AND DATA WAREHOUSE

11
DW ARCHITECTURE
DW has four key elements
The first element is the data sources that provide the raw data. The second element is the process of transforming
that data to meet the decision needs. The third element is the methods of regularly and accurately loading of that
data into EDW or data marts. The fourth element is the data access and analysis part, where devices and
applications use the data from DW to deliver insights and other benefits to users.

12
DW ARCHITECTURE - DATA SOURCES
DWs are created from structured data sources. Unstructured data, such as text data,
would need to be structured before inserted into DW.
1. Operations data include data from all business applications, including from ERPs
systems that form the backbone of an organization’s IT systems. The data to be extracted
will depend upon the subject matter of DW. For example, for a sales/marketing DW, only
the data about customers, orders, customer service, and so on would be extracted.

2. Other applications, such as point-of-sale (POS) terminals and e-commerce applications,


provide customer-facing data. Supplier data could come from supply chain management
systems. Planning and budget data should also be added as needed for making
comparisons against targets.

3. External syndicated data, such as weather or economic activity data, could also be
added to DW, as needed, to provide good contextual information to decision makers.

13
DW ARCHITECTURE - DATATRANSFORMATION PROCESSES
The heart of a useful DW is the processes to populate the DW with good quality data. This is called the
extract-transform-load (ETL) cycle.
1. Data should be extracted from many operational (transactional) database sources on a regular basis.
2. Extracted data should be aligned together by key fields. It should be cleansed of any irregularities or
missing values. It should be rolled up together to the same level of granularity. Desired fields, such as daily
sales totals, should be computed. The entire data should then be brought to the same format as the
central table of DW.
3. The transformed data should then be uploaded into DW.

14
ETL PROCESS

The ETL process is generally composed of the following ten steps:


1) Determine all the target data needed in the DW;
2) Determine all the data sources, both internal and external;
3) Prepare data mapping for target data elements from sources;
4) Establish comprehensive data extraction rules;
5) Determine data transformation and cleansing rules;
6) Plan for aggregate tables;
7) Organize data staging area and test tools;
8) Write procedures for all data loads;
9) ETL for dimension tables;
10) ETL for fact tables.

15
DW DESIGN
Star schema is the preferred data
architecture for most DWs. There is a
central fact table that provides most of
the information of interest. There are
lookup tables that provide detailed values
for codes used in the central table. For
example, the central table may use digits
to represent a sales person. The lookup
table will help provide the name for that
sales person code. Here is an example of a
star schema for a data mart for
16
monitoring sales performance
DW DESIGN

17
DW DESIGN
Other schemas include the snowflake architecture.
The difference between a star and snowflake is that in the latter, the lookup tables can have their own further
lookup tables.

Snowflake
Start Schema
Schema

18
DW ACCESS
Data from DW could be accessed for many purposes, through many devices.
1. A primary use of DW is to produce routine management and monitoring reports. For example, a sales
performance report would show sales by many dimensions, and compared with plan. A dashboarding
system will use data from the warehouse and present analysis to users. The data from DW can be used to
populate customized performance dashboards for executives. The dashboard could include drill-down
capabilities to analyze the performance data for root cause analysis.
2. The data from the warehouse could be used for ad hoc queries and any other applications that make
use of the internal data.
3. Data from DW is used to provide data for mining purposes. Parts of the data would be extracted, and
then combined with other relevant data, for data mining.

19
OLAP
OLAP (Online Analytical Processing) is a software that enables business analysts, managers and executives to analyze
and visualize business data quickly, consistently and primarily interactively. OLAP functionality is characterized by
dynamic, multidimensional analysis of an organization's consolidated data, allowing end-user activities to be both
analytical and navigational. OLAP tools are typically designed to work with denormalized databases. These tools are
able to navigate data from a Data Warehouse, having a structure suitable for both research and presentation of
information.

20
OLAP

21
OLAP

22
OLAP CHARACTERISTICS

23
OLAP CHARACTERISTICS

24
OLAP CHARACTERISTICS
Six essential characteristics can be seen in OLAP:
1. Query-oriented technology - the main operation in OLAP environment is querying
data;
2. Data not changed - data is added to DW using the ETL process. Older data are not
replaced by the new data. However, it can be migrated to a backup server;
3. Data and queries are managed - it is important to guarantee a good performance of
data stored in DW and also a good optimization process for the queries;
4. Multidimensional data view - data are organized several dimensions of analysis;
5. Complex calculations - math functions can be used to perform calculations on data;
6. Time series - associated with data we have the notion of time.

25
OLAP CHARACTERISTICS
There are significant differences between OLTP and OLAP systems. OLTP systems are designed for office workers while
the OLAP systems are designed for decision makers.

26
OLAP TERMINOLOGY
The OLAP basic terminology is composed of several elements: (i) cube; (ii) dimensions; (iii) measures; (iv)
hierarchies; (v) member; and (vi) aggregation.

A cube is a data structure that aggregates measures by the levels and hierarchies of each of the dimensions. Cubes
combine multiple dimensions (such as time, geography, and product lines) with summary data (such as sales or record
numbers).

27
OLAP TERMINOLOGY

Dimensions are the business elements by which the data can be queried. They can be thought of as the ‘by’ part of
reporting. For example “I want to see sales by region, by product by time”. In this case region, product and time would
be three dimensions within the cube, and sales would be a measure, below. A cube based environment allows the
user to easily navigate and choose elements or combinations of elements within the dimensional structure.

28
OLAP TERMINOLOGY
Measures are the units of numerical interest, the values being reported on. Typical examples would be unit sales,
sales value and cost.

29
OLAP TERMINOLOGY
Hierarchies are really navigable or drill paths through the dimension. They are structured like a family tree, and use
some of the same naming conventions (children / parent / descendant). Hierarchies are what brings much of the
power to OLAP reporting, because they allow the user to easily select data at different granularity (day / month /
year), and to drill down through data to additional levels of detail

30
OLAP TERMINOLOGY

A member is any single element within a hierarchy. For example, in a standard Time hierarchy, 1st January 2008 would
be a member, as would 20th February 2008. However January 2008, or 2008 itself could also be members. The latter
two would be aggregations of the days which belong to them. Members can be physical or calculated. Calculated
members mean that common business calculations and metrics can be encapsulated into the cube, and are available
for easy selection by the user, for example in the simplest case Profit = Sales - Cost.

31
OLAP TERMINOLOGY
Aggregation is a key part of the speed of cube based reporting. The reason why a cube can be very fast when for
example selecting data for an entire year, is because it has already calculated the answer. Whereas a typical relational
database would potentially sum millions of day level records on the fly to get an annual total, Analysis Services cubes
calculate these aggregations during the cube build and hence a well designed cube can return the answer quickly.
Sum is the most common aggregation method, but it’s also possible to use average, max etc. For example, if storing
dates as measures it makes no sense to sum them.

32
OLAP TERMINOLOGY
A data cube can be represented in a 2-D table, 3-D table or in a 3-D data cube. Let's consider a scenario where a
company intends to keep track of sales, considering the dimensions: item, time, branch, and location. A possible
representation of this information in a 2-D table is given in Figure

33
OLAP TERMINOLOGY
But here in this 2-D table, we have records with respect to time and item only. The sales for New Delhi are shown with
respect to time, and item dimensions according to the type of items sold. If we want to view the sales data with one
more dimension, say, the location dimension, then the 3-D view would be useful. The 3-D view of the sales data with
respect to time, item, and location is shown in Figure

34
OLAP TERMINOLOGY
However, the 3-D table can be represented as a 3-D data cube as shown in the Figure

35
OLAP OPERATIONS
OLAP offers a wide range of operations. The most common operations are:
o Drill-down - disaggregates a dimension;
o Drill-across - involve more than one fact table and go down in the hierarchy;
o Roll-up - adds a dimension by going up in hierarchy;
o Drill-through - details beyond the cube. Goes to the records level;
o Slice - restricts a value across a dimension;
o Dice - it performs restrictions of values in several dimensions. It is applied to the values of the cells;
o Pivot - switches the view axis;
o Rank - sorts the members of a dimension according to some criteria;
o Rotate - performs a rotation of the dimension axes in a given direction;
o Split - performs permutation of values;
o Nest/Unest - reducing of dimensions;
o Push/Pull - merge values

Besides those most common operations, we can also use standard SQL operations such as junctions, unions,
intersections and differences.

36
OLAP OPERATIONS

37
OLAP OPERATIONS - DRILL-DOWN

Drill-down refers to the process of viewing


data at a level of increased detail.
Considering our example, we will perform a
drill-down (country). In Figure 22, we can
see the inclusion of a new dimension called
"city". The total of sales is 45 for the
product "BBB" in Portugal for the year of
2017. For that we need to sum the number
of sales for the city of Porto and Lisbon. A
similar situation happens in the other rows
of the table.

38
OLAP OPERATIONS - ROLL-UP

Roll-up refers to the process of


viewing data with decreasing detail.
Considering our example, we will
perform a roll-up (country). Several
situations can happen, but we will
consider two scenarios. In the first
scenario we have the information
regarding the continent that belongs
each country. The result for such
scenario is presented in the figure. It
appears four lines for the "Europe"
continent, because there are two
products and two years. 39
OLAP OPERATIONS - ROLL-UP

In the second scenario we consider that the


"country" dimension will disappear. The
data are aggregated considering this
dimension. The result is depicted in Figure.
There are only four lines in the table with
information regarding each product and
year.

40
OLAP OPERATIONS - SLICE AND DICE
Slice and dice refer to a strategy for
segmenting, viewing and understanding
data in a database. Users slices and dice by
cutting a large segment of data into smaller
parts, and repeating this process until
arriving at the right level of detail for
analysis. Slicing and dicing helps provide a
closer view of data for analysis and presents
data in new and diverse perspectives.
First, we will explain the use of slice
considering our example. Let's make a slide
per year like it is shown in Figure 25 (e.g.,
slice(Year="2017")). All information
regarding the year of 2016 was omitted. 41
OLAP OPERATIONS - SLICE AND DICE
Then, we will use the dice operation that has a very similar function, but it uses more than one dimension. Let's
consider a dice per year and number of sales. The operation will be the following: dice(Year="2017" and No. of
sales > 100). The result of this operation is depicted in Figure 26. Only records that have both conditions
appeared in the result.

42
OLAP OPERATIONS - PIVOTING
Pivoting doesn't make any effect on the data but changes the
way how dimensions are shown. Considering our example, we
will perform a pivoting per year. Therefore the "year"
dimension will be our first column in our table, as shown in
Figure

43
OLAP OPERATIONS - RANK
The "rank" operation is another function that doesn't perform
any change in the data. However, the elements are ordered by
a given dimension. Let's consider the following function:
rank(No. of sales). The result of such operation is depicted in
Figure

44
OLAP - ARCHITECTURE
The most common architectures for OLAP are: (i) ROLAP; (ii) MOLAP;
and (iii) HOLAP. A comparative analysis among these architectures
are given in this figure

There are also other architectures not so common that appears in OLAP. Among them, we
highlight the DOLAP, JOLAP and SOLAP.

45
OLAP - ARCHITECTURE
The DOLAP architecture is an OLAP desktop architecture, that is, it is a tool for users who
have a copy of the multidimensional database or a subset of it, or who want to access a
central data repository locally. The user accesses this repository, triggers an SQL
statement and accesses the existing cubes in the multidimensional database residing on
the OLAP server and returns one to be analyzed on its workstation. The advantage of this
architecture is to reduce the overhead on the database server since all OLAP processing
happens on the client machine and the disadvantage is the size of the micro-cube that
cannot be very large, otherwise the analysis can be time-consuming and client doesn't
support it.
JOLAP is a Java API for OLAP, and SOLAP is the application of OLAP for geographic
information systems.

46
OLAP - ARCHITECTURE - DOLAP

47
OLAP - ARCHITECTURE - MOLAP
In the MOLAP architecture the data is stored in a multidimensional
database, where the MOLAP server operates and the user works, mounts
and manipulates the different data on the server. Data from a
multidimensional database is stored in a space smaller than that used to
store the same data in a relational database. In the multidimensional
database, data are kept in array data structures in order to provide better
performance when accessing them. In addition to being a fast architecture
another advantage is the rich and complex set of analysis functions present
in multidimensional databases.

48
OLAP - ARCHITECTURE - MOLAP

49
OLAP - ARCHITECTURE - MOLAP
One of its limitations is the possibility of the data being sparse (not all crossing the
dimensions contains data), occurring the so-called data storage explosion, that is, a huge
multidimensional database containing little data stored. Other limitations of this tool are
related to the fact that multidimensional banks are proprietary systems that do not
follow standards, that is, each developer creates his own structure for the bank and the
support tools themselves.

50
OLAP - ARCHITECTURE - MOLAP

51
OLAP - ARCHITECTURE - MOLAP

52
OLAP - ARCHITECTURE - MOLAP

The main advantages of MOLAP include:


o High performance - cubes are built for fast data recovery;
o Can perform complex calculations - all calculations are pre-generated when the cube
is created and can be easily applied at the time of the data search.

On the other side, the main disadvantages are:


o Low scalability - its advantage of achieving high performance with the pre-generation
of all calculations at the time of cube creation makes MOLAP limited to a small amount
of data. This deficiency can be circumvented by including only the summary of the
calculations when constructing the cube;
o High investments: this model requires huge additional investments as a proprietary
technology hub.

53
OLAP - ARCHITECTURE - ROLAP

The ROLAP architecture is a simulation of OLAP technology made in relational databases


that, by using the relational structure, has the advantage of not restricting the volume of
data storage. This tool does not use pre-calculated cubes like MOLAP. As the user mounts
his query in a graphical interface, the tool accesses the metadata or any other resources
that it has, to generate an SQL query.
Its main feature is the possibility of making any query, better serving users who do not
have a well defined analysis scope. This tool has the advantage of using established
technology, open architecture and standardized, benefiting from the diversity of
platforms, scalability and hardware parallelism. Its disadvantage is the poor set of
functions for dimensional analysis and the poor performance of the SQL language in the
execution of heavy queries.

54
OLAP - ARCHITECTURE - ROLAP

55
OLAP - ARCHITECTURE - ROLAP
The main advantages of ROLAP include:
o High scalability - using the ROLAP architecture, there is no restriction on the quantity
of data to be analyzed, being this limitation only in terms of the relational database used;
o Take advantage of the inherent functionality of the relational database - many
relational databases already come with a number of features and the ROLAP architecture
can leverage these features.

On the other side, the main disadvantages of ROLAP are:


o Low performance - each ROLAP report is basically an SQL query (or multiple SQL
queries) in the relational database and a query can be significant time consuming if there
is a large amount of data;
o Limited by SQL features: ROLAP relies primarily on generating SQL statements to
query the relational database, but these statements do not meet all the requirements.
For example, it is difficult to perform complex calculations using SQL.
56
OLAP - ARCHITECTURE - HOLAP
HOLAP architecture, or hybrid processing, has become more popular for today's products
because it can combine the capabilities and scalability of ROLAP tools with the superior
performance of multidimensional databases. For example, assume a base of 50,000
customers in 300 cities, 20 states, 5 regions and a grand total. Up to the cities level
multidimensional storage would resolve queries to raise sales totals. However, if it were
necessary to query a customer's total sales, the relational database would respond much
faster to the request. This situation is typical for indicating the HOLAP architecture.

The main advantages of HOLAP include:


o High performance - dimensional cubes only store information synthesis;
o High scalability - the details of the information are stored in a relational database.

On the other side, the main disadvantages are:


o Complex architecture - this model presents the highest acquisition and maintenance 57

costs.
OLAP - ARCHITECTURE - HOLAP

58
VIRTUAL CUBES
A virtual cube is a logical view of parts of one or more cubes, in which dimensions and
measurements are selected from the original cubes and included in the virtual cube.
Virtual cubes are often likened to views in a relational database. A virtual cube merges
portions of two existing cubes so that a combination of dimensions and measures can be
analyzed through the single, virtual cube. For example, a retailer may have two cubes:
one that stores the number of visitors to its website and another that stores purchases. A
virtual cube could be used to correlate the data from both cubes to calculate the average
sales per website visit (Search Data Management).
Virtual cubes can also be used to prevent unauthorized users from viewing private or
sensitive information. For example, if a cube has both sensitive and non-sensitive
information, the non-sensitive information can be made available in a virtual cube for
those users who need it. The sensitive data, meanwhile, remain in the existing cube
where it is accessed by authorized users.
59
VIRTUAL CUBES

60
VIRTUAL CUBES

61
VIRTUAL CUBES

62
VIRTUAL CUBES OFFER THE FOLLOWING BENEFITS:
Storage and performance can be optimized on
a case-by-case basis. In this way, it becomes
possible to maintain the best design approach
for each individual cube;

Allows the possibility of having overall analysis,


keeping for the sake of simplicity, the separate
cubes. In this sense, users can query cubes
together as long as they share at least one
common dimension.

63
PARTITIONING
Partitioning is done to improve performance and make data management easier.
Partitioning also helps balance the various system requirements. It optimizes hardware
performance and simplifies data warehouse management by dividing each fact table into
several separate partitions.
Partitioning can be done for the following reasons :
o For easy management - the fact table in a data warehouse can grow up to hundreds of
gigabytes in size. This huge size of the fact table is very hard to manage as a single entity.
Therefore, it needs partitioning;
o To assist backup/recovery - If we do not partition the fact table, then we have to load
the complete fact table with all the data. Partitioning allows us to load only as much data
as is required on a regular basis. It reduces the time to load and also enhances the
performance of the system;
o To enhance performance - by partitioning the fact table into sets of data, the query
procedures can be enhanced. Query performance is enhanced because now the query
scans only those partitions that are relevant. It does not have to scan the whole data. 64
PARTITIONING
There are generally three types of partitioning: (i) horizontal; (ii) vertical; and (iii)
hardware.
In the horizontal partitioning the fact table is partitioned after the first few thousand
entries. This is because in most cases, not all the information in the fact table needed all
the time. Therefore, horizontal partitioning helps to reduce the query access time, by
directly cutting down the amount of data to be scanned by the queries. Horizontal
partitioning the fact table is a good way to seep up queries, by mining the set of data to
be scanned (without using an index).

65
PARTITIONING
Different strategies can be used for horizontal partitioning. Among them we highlight:
o Partitioning by time, which typically conduced to different sized segments;
o Partitioning by geographical location, which typically conduces very asymmetric sized segments;
o Partitioning by size of table, which typically implies that are tables that will never be partitioned;
o Using round robin partitions, which is typically more difficult to manage.

66
PARTITIONING
Vertical partitioning, splits the data vertically. Vertical partitioning can be performed using
a normalization or a row splitting technique. The following figure depicts how vertical
partitioning is done.

67
PARTITIONING
Another possibility is to use hardware partitioning. The idea is to optimize the database
by respecting the specific hardware architecture. The exact details of optimization
depend on the hardware platforms. However, some guidelines can be used:
o Maximize the processing power availability;
o Minimize disk accessed and I/O operations;
o Reduce bottlenecks at the CPU and I/O throughput.

68
OLAP OPERATIONS - RANK

69
CONCLUSION
DWs are special data management facilities intended for creating
reports and analysis to support managerial decision making. They
are designed to make reporting and querying simple and efficient.
The sources of data are operational systems and external data
sources. DW needs to be updated with new data regularly to keep it
useful. Data from DW provides a useful input for data mining
activities.

70
HOMEWORK

71
THANK
YOU
Dr. Nasim AbdulWahab Matar
Head of E-Business and MIS Department @ University of Petra

nmatar@uop.edu.jo

EXT: 9400

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy