Business Intelligence - Chapter 3
Business Intelligence - Chapter 3
Data Warehousing
2
DECISION TYPES
There are two main kinds of decisions: strategic decisions and operational decisions.
BI can help make both better.
1. Strategic decisions are those that impact the direction of the company. The decision to reach out to a new
customer set would be a strategic decision.
2. Operational decisions are more routine and tactical decisions, focused on developing greater efficiency.
Updating an old website with new features will be an operational decision.
3
DESIGN CONSIDERATIONS FOR DW
The objective of DW is to provide business knowledge to support decision making. For DW to serve its objective, it
should be aligned around those decisions. It should be comprehensive, easy to access, and up-todate. Here are some
requirements for a good DW:
1. Subject-oriented: To be effective, DW should be designed around a subject domain, that is, to help solve
a certain category of problems.
2. Integrated: DW should include data from many functions that can shed light on a particular subject area.
Thus, the organization can benefit from a comprehensive view of the subject area.
3. Time-variant (time series): The data in DW should grow at daily or other chosen intervals. That allows
latest comparisons over time.
4. Nonvolatile: DW should be persistent, that is, it should not be created on the fly from the operations
databases. Thus, DW is consistently available for analysis, across the organization and over time.
5. Summarized: DW contains rolled-up data at the right level for queries and analysis. The rolling up helps
create consistent granularity for effective comparisons. It helps reduces the number of variables or
dimensions of the data to make them more meaningful for the decision makers.
4
DESIGN CONSIDERATIONS FOR DW
6. Not normalized: DW often uses a star schema, which is a rectangular central table, surrounded by some lookup
tables. The single-table view significantly enhances speed of queries.
7. Metadata: Many of the variables in the database are computed from other variables in the operational database.
For example, total daily sales may be a computed field. The method of its calculation for each variable should be
effectively documented. Every element in DW should be sufficiently well-defined.
8. Near real-time and/or right-time (active): DWs should be updated in near real-time in many high-transaction
volume industries, such as airlines. The cost of implementing and updating DW in real time could discourage others.
Another downside of real-time DW is the possibilities of inconsistencies in reports drawn just a few minutes apart.
5
OLTP AND DW
A database which is built for online transaction processing, OLTP, is generally regarded as unsuitable for data
warehousing as they have been designed with a different set of needs in mind (i.e., maximizing transaction capacity
and typically having hundreds of tables in order not to lock out users etc.). Data warehouses are interested in query
processing as opposed to transaction processing.
6
DW DEVELOPMENT APPROACHES
There are two fundamentally different approaches to developing DW:
• Top down and bottom up.
The top-down approach is to make a comprehensive DW that covers all the reporting needs of the enterprise.
• The bottom-up approach
The bottom-up approach is to produce small data marts, for the reporting needs of different departments or
functions, as needed. The smaller data marts will eventually align to deliver comprehensive EDW
capabilities.
7
DW DEVELOPMENT APPROACHES
8
DW DEVELOPMENT APPROACHES
9
DW DEVELOPMENT APPROACHES
10
COMPARING DATA MART AND DATA WAREHOUSE
11
DW ARCHITECTURE
DW has four key elements
The first element is the data sources that provide the raw data. The second element is the process of transforming
that data to meet the decision needs. The third element is the methods of regularly and accurately loading of that
data into EDW or data marts. The fourth element is the data access and analysis part, where devices and
applications use the data from DW to deliver insights and other benefits to users.
12
DW ARCHITECTURE - DATA SOURCES
DWs are created from structured data sources. Unstructured data, such as text data,
would need to be structured before inserted into DW.
1. Operations data include data from all business applications, including from ERPs
systems that form the backbone of an organization’s IT systems. The data to be extracted
will depend upon the subject matter of DW. For example, for a sales/marketing DW, only
the data about customers, orders, customer service, and so on would be extracted.
3. External syndicated data, such as weather or economic activity data, could also be
added to DW, as needed, to provide good contextual information to decision makers.
13
DW ARCHITECTURE - DATATRANSFORMATION PROCESSES
The heart of a useful DW is the processes to populate the DW with good quality data. This is called the
extract-transform-load (ETL) cycle.
1. Data should be extracted from many operational (transactional) database sources on a regular basis.
2. Extracted data should be aligned together by key fields. It should be cleansed of any irregularities or
missing values. It should be rolled up together to the same level of granularity. Desired fields, such as daily
sales totals, should be computed. The entire data should then be brought to the same format as the
central table of DW.
3. The transformed data should then be uploaded into DW.
14
ETL PROCESS
15
DW DESIGN
Star schema is the preferred data
architecture for most DWs. There is a
central fact table that provides most of
the information of interest. There are
lookup tables that provide detailed values
for codes used in the central table. For
example, the central table may use digits
to represent a sales person. The lookup
table will help provide the name for that
sales person code. Here is an example of a
star schema for a data mart for
16
monitoring sales performance
DW DESIGN
17
DW DESIGN
Other schemas include the snowflake architecture.
The difference between a star and snowflake is that in the latter, the lookup tables can have their own further
lookup tables.
Snowflake
Start Schema
Schema
18
DW ACCESS
Data from DW could be accessed for many purposes, through many devices.
1. A primary use of DW is to produce routine management and monitoring reports. For example, a sales
performance report would show sales by many dimensions, and compared with plan. A dashboarding
system will use data from the warehouse and present analysis to users. The data from DW can be used to
populate customized performance dashboards for executives. The dashboard could include drill-down
capabilities to analyze the performance data for root cause analysis.
2. The data from the warehouse could be used for ad hoc queries and any other applications that make
use of the internal data.
3. Data from DW is used to provide data for mining purposes. Parts of the data would be extracted, and
then combined with other relevant data, for data mining.
19
OLAP
OLAP (Online Analytical Processing) is a software that enables business analysts, managers and executives to analyze
and visualize business data quickly, consistently and primarily interactively. OLAP functionality is characterized by
dynamic, multidimensional analysis of an organization's consolidated data, allowing end-user activities to be both
analytical and navigational. OLAP tools are typically designed to work with denormalized databases. These tools are
able to navigate data from a Data Warehouse, having a structure suitable for both research and presentation of
information.
20
OLAP
21
OLAP
22
OLAP CHARACTERISTICS
23
OLAP CHARACTERISTICS
24
OLAP CHARACTERISTICS
Six essential characteristics can be seen in OLAP:
1. Query-oriented technology - the main operation in OLAP environment is querying
data;
2. Data not changed - data is added to DW using the ETL process. Older data are not
replaced by the new data. However, it can be migrated to a backup server;
3. Data and queries are managed - it is important to guarantee a good performance of
data stored in DW and also a good optimization process for the queries;
4. Multidimensional data view - data are organized several dimensions of analysis;
5. Complex calculations - math functions can be used to perform calculations on data;
6. Time series - associated with data we have the notion of time.
25
OLAP CHARACTERISTICS
There are significant differences between OLTP and OLAP systems. OLTP systems are designed for office workers while
the OLAP systems are designed for decision makers.
26
OLAP TERMINOLOGY
The OLAP basic terminology is composed of several elements: (i) cube; (ii) dimensions; (iii) measures; (iv)
hierarchies; (v) member; and (vi) aggregation.
A cube is a data structure that aggregates measures by the levels and hierarchies of each of the dimensions. Cubes
combine multiple dimensions (such as time, geography, and product lines) with summary data (such as sales or record
numbers).
27
OLAP TERMINOLOGY
Dimensions are the business elements by which the data can be queried. They can be thought of as the ‘by’ part of
reporting. For example “I want to see sales by region, by product by time”. In this case region, product and time would
be three dimensions within the cube, and sales would be a measure, below. A cube based environment allows the
user to easily navigate and choose elements or combinations of elements within the dimensional structure.
28
OLAP TERMINOLOGY
Measures are the units of numerical interest, the values being reported on. Typical examples would be unit sales,
sales value and cost.
29
OLAP TERMINOLOGY
Hierarchies are really navigable or drill paths through the dimension. They are structured like a family tree, and use
some of the same naming conventions (children / parent / descendant). Hierarchies are what brings much of the
power to OLAP reporting, because they allow the user to easily select data at different granularity (day / month /
year), and to drill down through data to additional levels of detail
30
OLAP TERMINOLOGY
A member is any single element within a hierarchy. For example, in a standard Time hierarchy, 1st January 2008 would
be a member, as would 20th February 2008. However January 2008, or 2008 itself could also be members. The latter
two would be aggregations of the days which belong to them. Members can be physical or calculated. Calculated
members mean that common business calculations and metrics can be encapsulated into the cube, and are available
for easy selection by the user, for example in the simplest case Profit = Sales - Cost.
31
OLAP TERMINOLOGY
Aggregation is a key part of the speed of cube based reporting. The reason why a cube can be very fast when for
example selecting data for an entire year, is because it has already calculated the answer. Whereas a typical relational
database would potentially sum millions of day level records on the fly to get an annual total, Analysis Services cubes
calculate these aggregations during the cube build and hence a well designed cube can return the answer quickly.
Sum is the most common aggregation method, but it’s also possible to use average, max etc. For example, if storing
dates as measures it makes no sense to sum them.
32
OLAP TERMINOLOGY
A data cube can be represented in a 2-D table, 3-D table or in a 3-D data cube. Let's consider a scenario where a
company intends to keep track of sales, considering the dimensions: item, time, branch, and location. A possible
representation of this information in a 2-D table is given in Figure
33
OLAP TERMINOLOGY
But here in this 2-D table, we have records with respect to time and item only. The sales for New Delhi are shown with
respect to time, and item dimensions according to the type of items sold. If we want to view the sales data with one
more dimension, say, the location dimension, then the 3-D view would be useful. The 3-D view of the sales data with
respect to time, item, and location is shown in Figure
34
OLAP TERMINOLOGY
However, the 3-D table can be represented as a 3-D data cube as shown in the Figure
35
OLAP OPERATIONS
OLAP offers a wide range of operations. The most common operations are:
o Drill-down - disaggregates a dimension;
o Drill-across - involve more than one fact table and go down in the hierarchy;
o Roll-up - adds a dimension by going up in hierarchy;
o Drill-through - details beyond the cube. Goes to the records level;
o Slice - restricts a value across a dimension;
o Dice - it performs restrictions of values in several dimensions. It is applied to the values of the cells;
o Pivot - switches the view axis;
o Rank - sorts the members of a dimension according to some criteria;
o Rotate - performs a rotation of the dimension axes in a given direction;
o Split - performs permutation of values;
o Nest/Unest - reducing of dimensions;
o Push/Pull - merge values
Besides those most common operations, we can also use standard SQL operations such as junctions, unions,
intersections and differences.
36
OLAP OPERATIONS
37
OLAP OPERATIONS - DRILL-DOWN
38
OLAP OPERATIONS - ROLL-UP
40
OLAP OPERATIONS - SLICE AND DICE
Slice and dice refer to a strategy for
segmenting, viewing and understanding
data in a database. Users slices and dice by
cutting a large segment of data into smaller
parts, and repeating this process until
arriving at the right level of detail for
analysis. Slicing and dicing helps provide a
closer view of data for analysis and presents
data in new and diverse perspectives.
First, we will explain the use of slice
considering our example. Let's make a slide
per year like it is shown in Figure 25 (e.g.,
slice(Year="2017")). All information
regarding the year of 2016 was omitted. 41
OLAP OPERATIONS - SLICE AND DICE
Then, we will use the dice operation that has a very similar function, but it uses more than one dimension. Let's
consider a dice per year and number of sales. The operation will be the following: dice(Year="2017" and No. of
sales > 100). The result of this operation is depicted in Figure 26. Only records that have both conditions
appeared in the result.
42
OLAP OPERATIONS - PIVOTING
Pivoting doesn't make any effect on the data but changes the
way how dimensions are shown. Considering our example, we
will perform a pivoting per year. Therefore the "year"
dimension will be our first column in our table, as shown in
Figure
43
OLAP OPERATIONS - RANK
The "rank" operation is another function that doesn't perform
any change in the data. However, the elements are ordered by
a given dimension. Let's consider the following function:
rank(No. of sales). The result of such operation is depicted in
Figure
44
OLAP - ARCHITECTURE
The most common architectures for OLAP are: (i) ROLAP; (ii) MOLAP;
and (iii) HOLAP. A comparative analysis among these architectures
are given in this figure
There are also other architectures not so common that appears in OLAP. Among them, we
highlight the DOLAP, JOLAP and SOLAP.
45
OLAP - ARCHITECTURE
The DOLAP architecture is an OLAP desktop architecture, that is, it is a tool for users who
have a copy of the multidimensional database or a subset of it, or who want to access a
central data repository locally. The user accesses this repository, triggers an SQL
statement and accesses the existing cubes in the multidimensional database residing on
the OLAP server and returns one to be analyzed on its workstation. The advantage of this
architecture is to reduce the overhead on the database server since all OLAP processing
happens on the client machine and the disadvantage is the size of the micro-cube that
cannot be very large, otherwise the analysis can be time-consuming and client doesn't
support it.
JOLAP is a Java API for OLAP, and SOLAP is the application of OLAP for geographic
information systems.
46
OLAP - ARCHITECTURE - DOLAP
47
OLAP - ARCHITECTURE - MOLAP
In the MOLAP architecture the data is stored in a multidimensional
database, where the MOLAP server operates and the user works, mounts
and manipulates the different data on the server. Data from a
multidimensional database is stored in a space smaller than that used to
store the same data in a relational database. In the multidimensional
database, data are kept in array data structures in order to provide better
performance when accessing them. In addition to being a fast architecture
another advantage is the rich and complex set of analysis functions present
in multidimensional databases.
48
OLAP - ARCHITECTURE - MOLAP
49
OLAP - ARCHITECTURE - MOLAP
One of its limitations is the possibility of the data being sparse (not all crossing the
dimensions contains data), occurring the so-called data storage explosion, that is, a huge
multidimensional database containing little data stored. Other limitations of this tool are
related to the fact that multidimensional banks are proprietary systems that do not
follow standards, that is, each developer creates his own structure for the bank and the
support tools themselves.
50
OLAP - ARCHITECTURE - MOLAP
51
OLAP - ARCHITECTURE - MOLAP
52
OLAP - ARCHITECTURE - MOLAP
53
OLAP - ARCHITECTURE - ROLAP
54
OLAP - ARCHITECTURE - ROLAP
55
OLAP - ARCHITECTURE - ROLAP
The main advantages of ROLAP include:
o High scalability - using the ROLAP architecture, there is no restriction on the quantity
of data to be analyzed, being this limitation only in terms of the relational database used;
o Take advantage of the inherent functionality of the relational database - many
relational databases already come with a number of features and the ROLAP architecture
can leverage these features.
costs.
OLAP - ARCHITECTURE - HOLAP
58
VIRTUAL CUBES
A virtual cube is a logical view of parts of one or more cubes, in which dimensions and
measurements are selected from the original cubes and included in the virtual cube.
Virtual cubes are often likened to views in a relational database. A virtual cube merges
portions of two existing cubes so that a combination of dimensions and measures can be
analyzed through the single, virtual cube. For example, a retailer may have two cubes:
one that stores the number of visitors to its website and another that stores purchases. A
virtual cube could be used to correlate the data from both cubes to calculate the average
sales per website visit (Search Data Management).
Virtual cubes can also be used to prevent unauthorized users from viewing private or
sensitive information. For example, if a cube has both sensitive and non-sensitive
information, the non-sensitive information can be made available in a virtual cube for
those users who need it. The sensitive data, meanwhile, remain in the existing cube
where it is accessed by authorized users.
59
VIRTUAL CUBES
60
VIRTUAL CUBES
61
VIRTUAL CUBES
62
VIRTUAL CUBES OFFER THE FOLLOWING BENEFITS:
Storage and performance can be optimized on
a case-by-case basis. In this way, it becomes
possible to maintain the best design approach
for each individual cube;
63
PARTITIONING
Partitioning is done to improve performance and make data management easier.
Partitioning also helps balance the various system requirements. It optimizes hardware
performance and simplifies data warehouse management by dividing each fact table into
several separate partitions.
Partitioning can be done for the following reasons :
o For easy management - the fact table in a data warehouse can grow up to hundreds of
gigabytes in size. This huge size of the fact table is very hard to manage as a single entity.
Therefore, it needs partitioning;
o To assist backup/recovery - If we do not partition the fact table, then we have to load
the complete fact table with all the data. Partitioning allows us to load only as much data
as is required on a regular basis. It reduces the time to load and also enhances the
performance of the system;
o To enhance performance - by partitioning the fact table into sets of data, the query
procedures can be enhanced. Query performance is enhanced because now the query
scans only those partitions that are relevant. It does not have to scan the whole data. 64
PARTITIONING
There are generally three types of partitioning: (i) horizontal; (ii) vertical; and (iii)
hardware.
In the horizontal partitioning the fact table is partitioned after the first few thousand
entries. This is because in most cases, not all the information in the fact table needed all
the time. Therefore, horizontal partitioning helps to reduce the query access time, by
directly cutting down the amount of data to be scanned by the queries. Horizontal
partitioning the fact table is a good way to seep up queries, by mining the set of data to
be scanned (without using an index).
65
PARTITIONING
Different strategies can be used for horizontal partitioning. Among them we highlight:
o Partitioning by time, which typically conduced to different sized segments;
o Partitioning by geographical location, which typically conduces very asymmetric sized segments;
o Partitioning by size of table, which typically implies that are tables that will never be partitioned;
o Using round robin partitions, which is typically more difficult to manage.
66
PARTITIONING
Vertical partitioning, splits the data vertically. Vertical partitioning can be performed using
a normalization or a row splitting technique. The following figure depicts how vertical
partitioning is done.
67
PARTITIONING
Another possibility is to use hardware partitioning. The idea is to optimize the database
by respecting the specific hardware architecture. The exact details of optimization
depend on the hardware platforms. However, some guidelines can be used:
o Maximize the processing power availability;
o Minimize disk accessed and I/O operations;
o Reduce bottlenecks at the CPU and I/O throughput.
68
OLAP OPERATIONS - RANK
69
CONCLUSION
DWs are special data management facilities intended for creating
reports and analysis to support managerial decision making. They
are designed to make reporting and querying simple and efficient.
The sources of data are operational systems and external data
sources. DW needs to be updated with new data regularly to keep it
useful. Data from DW provides a useful input for data mining
activities.
70
HOMEWORK
71
THANK
YOU
Dr. Nasim AbdulWahab Matar
Head of E-Business and MIS Department @ University of Petra
nmatar@uop.edu.jo
EXT: 9400