Unit I
Unit I
Evolution of Decision Support Systems - Data Warehousing Components – Building a Data Warehouse ,
Data Warehouse and DBMS , Data Marts , Meta Data , Multidimensional Data Model , OLAP vs OLTP ,
OLAP operations, Data cubes, Schemas for Multidimensional Database : Stars, Snowflakes and Fact
constellations.
*****************************************************************************
Data Warehouse :
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing. It includes historical data derived from transaction data from single and
multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular group
of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
It is a database designed for investigative tasks, using data from various applications.
It supports a relatively small number of clients with relatively long interactions.
It includes current and historical data to provide a historical perspective of information.
Its usage is read-intensive.
It contains a few large tables.
Characteristics of Data Warehouse :
1. Subject-Oriented :
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore, data
warehouses typically provide a concise and straightforward view around a particular subject, such as
customer, product, or sales, instead of the global organization's ongoing operations. This is done by excluding
data that are not useful concerning the subject and including all data needed by the users to understand the
subject.
2. Integrated :
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online
transaction records. It requires performing data cleaning and integration during data warehousing to ensure
consistency in naming conventions, attributes types, etc., among different data sources.
3. Time-Variant :
Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6
months, 12 months, or even previous data from a data warehouse. These variations with a transactions
system, where often only the most current file is kept.
4. Non-Volatile :
The data warehouse is a physically separate data storage, which is transformed from the source operational
RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update, insert, and delete
operations are not performed. It usually requires only two procedures in data accessing: Initial loading of data
and access to data. Therefore, the DW does not require transaction processing, recovery, and concurrency
capabilities, which allows for substantial speedup of data retrieval. Non-Volatile defines that once entered
into the warehouse, and data should not change.
History of Data Warehouse :
The idea of data warehousing came to the late 1980's when IBM researchers Barry Devlin and Paul
Murphy established the "Business Data Warehouse."
In essence, the data warehousing idea was planned to support an architectural model for the flow of
information from the operational system to decisional support environments. The concept attempt to
address the various problems associated with the flow, mainly the high costs associated with it.
In the absence of data warehousing architecture, a vast amount of space was required to support
multiple decision support environments. In large corporations, it was ordinary for various decision
support environments to operate independently.
1) Business User: Business users require a data warehouse to view summarized data from the past. Since
these people are non-technical, the data may be presented to them in an elementary form.
2) Store historical data: Data Warehouse is required to store the time variable data from the past. This
input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the data warehouse. So,
data warehouse contributes to making strategic decisions.
4) For data consistency and quality: Bringing the data from different sources at a commonplace, the user
can effectively undertake to bring the uniformity and consistency in data.
5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types of
queries, which demands a significant degree of flexibility and quick response time.
The figure shows the essential elements of a typical warehouse. We see the Source Data component shows on
the left. The Data staging element serves as the next building block. In the middle, we see the Data Storage
component that handles the data warehouses data. This element not only stores and manages the data; it also
keeps track of data using the metadata repository. The Information Delivery component shows on the right
consists of all the different ways of making the information from the data warehouses available to the users.
Production Data: This type of data comes from the different operating systems of the enterprise. Based on
the data requirements in the data warehouse, we choose segments of the data from the various operational
modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports, customer profiles,
and sometimes even department databases. This is the internal data, part of which could be useful in a data
warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In every operational
system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large percentage of the
information they use. They use statistics associating to their industry produced by the external department.
After we have been extracted data from various operational systems and external sources, we have to
prepare the files for storing in the data warehouse. The extracted data coming from several different
sources need to be changed, converted, and made ready in a format that is relevant to be saved for
querying and analysis.
We will now discuss the three primary functions that take place in the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have to employ the appropriate
techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many different sources. If data
extraction for a data warehouse posture big challenges, data transformation present even significant
challenges. We perform several individual tasks as part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of misspellings or may
deal with providing default values for missing data elements, or elimination of duplicates when we bring in
the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data transformation contains
many forms of combining pieces of data from different sources. We combine data from single source record
or related data parts from many source records.
On the other hand, data transformation also contains purging source data that is not useful and separating
outsource records into new combinations. Sorting and merging of data take place on a large scale in the data
staging area. When the data transformation function ends, we have a collection of integrated data that is
cleaned, standardized, and summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we complete the
structure and construction of the data warehouse and go live for the first time, we do the initial loading of the
information into the data warehouse storage. The initial load moves high volumes of data using up a
substantial amount of time.
Data Storage Components
Data storage for the data warehousing is a split repository. The data repositories for the operational systems
generally include only the current data. Also, these data repositories include the data structured in highly
normalized for fast and efficient processing.
Information Delivery Component
The information delivery element is used to enable the process of subscribing for data warehouse files and
having it transferred to one or more destinations according to some customer-specified scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database management
system. In the data dictionary, we keep the data about the logical data structures, the data about the records
and addresses, the information about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The scope is confined
to particular selected subjects. Data in a data warehouse should be a fairly current, but not mainly up to the
minute, although development in the data warehouse industry has made standard and incremental data dumps
more achievable. Data marts are lower than data warehouses and usually contain organization. The current
trends in data warehousing are to developed a data warehouse with several smaller related data marts for
particular kinds of queries and reports.
Management and Control Component
The management and control elements coordinate the services and functions within the data warehouse.
These components control the data transformation and the data transfer into the data warehouse storage. On
the other hand, it moderates the data delivery to the clients. Its work with the database management systems
and authorizes data to be correctly saved in the repositories. It monitors the movement of information into the
staging method and from there into the data warehouses storage itself.
Why we need a separate Data Warehouse?
1. Data Warehouse queries are complex because they involve the computation of large groups of data at
summarized levels.
2. It may require the use of distinctive data organization, access, and implementation method based on
multidimensional views.
3. Performing OLAP queries in operational database degrade the performance of functional tasks.
4. Data Warehouse is used for analysis and decision making in which extensive database is required,
including historical data, which operational database does not typically maintain.
5. The separation of an operational database from data warehouses is based on the different structures
and uses of data in these systems.
6. Because the two systems provide different functionalities and require different kinds of data, it is
necessary to maintain separate databases.
Difference between Database and Data Warehouse
1. It is used for Online Transactional Processing (OLTP) but 1. It is used for Online Analytical
can be used for other objectives such as Data Warehousing. Processing (OLAP). This reads the
This records the data from the clients for history. historical information for the customers
for business decisions.
2. The tables and joins are complicated since they are 2. The tables and joins are accessible
normalized for RDBMS. This is done to reduce redundant files since they are de-normalized. This is
and to save storage space. done to minimize the response time for
analytical queries.
4. Entity: Relational modeling procedures are used for RDBMS 4. Data: Modeling approach are used
database design. for the Data Warehouse design.
7. The database is the place where the data is taken as a base 7. Data Warehouse is the place where
and managed to get available fast and efficient access. the application data is handled for
analysis and reporting objectives.
Organizations embarking on data warehousing development can chose on of the two approaches
Top-down approach: Meaning that the organization has developed an enterprise data model, collected
enterprise wide business requirement, and decided to build an enterprise data warehouse with subset data
marts
Bottom-up approach: Implying that the business priorities resulted in developing individual data marts,
which are then integrated into the enterprise data warehouse. Organizational Issues The requirements
and environments associated with the informational applications of a data warehouse are different. Therefore
an organization will need to employ different development practices than the ones it uses for operational
applications
Design Consideration In general, a data warehouse‘s design point is to consolidate data from multiple, often
heterogeneous, sources into a query data base. The main factors include
Heterogeneity of data sources, which affects data conversion, quality, time-lines
Use of historical data, which implies that data may be‖ old‖ Tendency of database to grow very large
Data Content: Typically a data warehouse may contain detailed data, but the data is cleaned up and
transformed to fit the warehouse model, and certain transactional attributes of the data are filtered out. The
content and the structure of the data warehouses are reflected in its data model. The data model is a template
for how information will be organized with in the integrated data warehouse framework.
Meta data: Defines the contents and location of the data in the warehouse, relationship between the
operational databases and the data warehouse, and the business view of the warehouse data that are accessible
by end-user tools. The warehouse design should prevent any direct access to the warehouse data if it does not
use meta data definitions to gain the access.
Data distribution: As the data volumes continue to grow, the data base size may rapidly outgrow a single
server. Therefore, it becomes necessary to know how the data should be divided across multiple servers. The
data placement and distribution design should consider several options including data distribution by subject
area, location, or time.
Tools: Data warehouse designers have to be careful not to sacrifice the overall design to fit to a specific tool.
Selected tools must be compatible with the given data warehousing environment each other.
Performance consideration: Rapid query processing is a highly desired feature that should be designed into
the data warehouse.
Technical Considerations
A number of technical issues are to be considered when designing and implementing a data warehouse
environment .these issues includes. The hardware platform that would house the data warehouse. The data
base management system that supports the warehouse data base. The communication infrastructure that
connects the warehouse, data marts, operational systems, and end users. The hardware platform and software
to support the meta data repository The systems management framework that enables the centralized
management and administration of the entire environment.
Implementation Considerations
A data warehouse cannot be simply bought and installed-its implementation requires the integration of many
products within a data ware house.
Data warehouse is an environment, not a product which is based on relational database management system
that functions as the central repository for informational data.
The central repository information is surrounded by number of key components designed to make the
environment is functional, manageable and accessible.
The data source for data warehouse is coming from operational applications. The data entered into the data
warehouse transformed into an integrated structure and format. The transformation process involves
conversion, summarization, filtering and condensation. The data warehouse must be capable of holding and
managing large volumes of data as well as different structure of data structures over the time.
There are three DBMS software architecture styles for parallel processing:
• Shared memory or shared everything Architecture
• Shared disk architecture
• Shred nothing architecture Shared Memory Architecture:
Tightly coupled shared memory systems, illustrated in following figure have the following characteristics:
• Multiple PUs share memory.
• Each PU has full access to all shared memory through a common bus.
• Communication between nodes occurs via shared memory.
• Performance is limited by the bandwidth of the memory bus.
• It is simple to implement and provide a single system image, implementing an RDBMS on
SMP(symmetric multiprocessor)
A disadvantage of shared memory systems for parallel processing is as follows:
• Scalability is limited by bus bandwidth and latency, and by available memory.
Shared Disk Architecture
Shared disk systems are typically loosely coupled. Such systems, illustrated in following figure, have the
following characteristics:
• Each node consists of one or more PUs and associated memory.
• Memory is not shared between nodes.
• Communication occurs over a common high-speed bus.
• Each node has access to the same disks and other resources.
• A node can be an SMP if the hardware supports it.
• Bandwidth of the high-speed bus limits the number of nodes (scalability) of the system.
• The Distributed Lock Manager (DLM ) is required.
• Parallel processing advantages of shared disk systems are as follows:
• Shared disk systems permit high availability. All data is accessible even if one node dies.
• These systems have the concept of one database, which is an advantage over shared nothing systems.
• Shared disk systems provide for incremental growth.
• Parallel processing disadvantages of shared disk systems are these:
• Inter-node synchronization is required, involving DLM overhead and greater dependency on high-
speed interconnect.
• If the workload is not partitioned well, there may be high synchronization overhead.
Shared Nothing Architecture
• Shared nothing systems are typically loosely coupled. In shared nothing systems only one CPU is
connected to a given disk. If a table or database is located on that disk Shared nothing systems are
concerned with access to disks, not access to memory.
• Adding more PUs and disks can improve scale up.
• Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
• Shared nothing systems provide for incremental growth.
• System growth is practically unlimited.
• MPPs are good for read-only databases and decision support applications.
• Failure is local: if one node fails, the others stay up.
Disadvantages
• More coordination is required.
• More overhead is required for a process working on a disk belonging to another node.
• If there is a heavy workload of updates or inserts, as in an online transaction processing system, it may
be worthwhile to consider data-dependent routing to alleviate contention.
These Requirements include
• Support for function shipping
• Parallel join strategies
• Support for data repartitioning
• Query compilation
• Support for database transactions
• Support for the single system image of the database environment
Combined architecture
• Interserver parallelism: each query is parallelized across multiple servers
• Intraserver parallelism: the query is parallelized with in a server
• The combined architecture supports inter server parallelism of distributed memory MPPs and cluster
and interserver parallelism of SMP nodes
Parallel DBMS features
• Scope and techniques of parallel DBMS operations Optimizer implementation
• Application transparency
• Parallel environment: which allows the DBMS server to take full advantage of the existing facilities
on a very low level?
• DBMS management tools: help to configure, tune, admin and monitor a parallel RDBMS as
effectively as if it were a serial RDBMS.
• Price / Performance: The parallel RDBMS can demonstrate a non linear speed up and scale up at
reasonable costs.
Alternative technologies
• For improving performance in dw environment includes
• Advanced database indexing products
• Multidimensional databases
• Specialized RDBMS
• Advance indexing techniques REFER --SYSBASE IQ
PARALLEL DBMS VENDORS
A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose or primary
data subject which may be distributed to provide business needs. Data Marts are analytical record stores
designed to focus on particular business functions for a specific community within an organization. Data
marts are derived from subsets of data in a data warehouse, though in the bottom-up data warehouse design
methodology, the data warehouse is created from the union of organizational data marts.
A dependent data marts is a logical subset of a physical subset of a higher data warehouse. According to this
technique, the data marts are treated as the subsets of a data warehouse. In this technique, firstly a data
warehouse is created from which further various data marts can be created. These data mart are dependent on
the data warehouse and extract the essential record from it. In this technique, as the data warehouse creates
the data mart; therefore, there is no need for data mart integration. It is also known as a top-down approach.
1.5M
1. Build a CMS using OOP PHP tutorial | PHP MVC design pattern [2020]
The second approach is Independent data marts (IDM) Here, firstly independent data marts are created, and
then a data warehouse is designed using these independent multiple data marts. In this approach, as all the
data marts are designed independently; therefore, the integration of data marts is required. It is also termed as
a bottom-up approach as the data marts are integrated to develop a data warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."
It allows us to combine input from sources other than a data warehouse. This could be helpful for many
situations; especially when Adhoc integrations are needed, such as after a new group or product is added to
the organizations.
The significant steps in implementing a data mart are to design the schema, construct the physical storage,
populate the data mart with data from source systems, access it to make informed decisions and manage it
over time. So, the steps are:
Designing
The design step is the first in the data mart process. This phase covers all of the functions from initiating the
request for a data mart through gathering data about the requirements and developing the logical and physical
design of the data mart.
It involves the following tasks:
Gathering the business and technical requirements
Identifying data sources
Selecting the appropriate subset of data
Designing the logical and physical architecture of the data mart.
Constructing
This step contains creating the physical database and logical structures associated with the data mart to
provide fast and efficient access to the data.
It involves the following tasks:
Creating the physical database and logical structures such as tablespaces associated with the data mart.
creating the schema objects such as tables and indexes describe in the design step.
Determining how best to set up the tables and access structures.
Populating
This step includes all of the tasks related to the getting data from the source, cleaning it up, modifying it to the
right format and level of detail, and moving it into the data mart.
It involves the following tasks:
Mapping data sources to target data sources
Extracting data
Cleansing and transforming the information.
Loading data into the data mart
Creating and storing metadata
Accessing
This step involves putting the data to use: querying the data, analyzing it, creating reports, charts and graphs
and publishing them.
It involves the following tasks:
Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer translates database
operations and objects names into business conditions so that the end-clients can interact with the data
mart using words which relates to the business functions.
Set up and manage database architectures like summarized tables which help queries agree through
the front-end tools execute rapidly and efficiently.
Managing
This step contains managing the data mart over its lifetime. In this step, management functions are performed
as:
It may hold multiple subject areas. It holds only one subject area. For
example, Finance or Sales.
It holds very detailed information. It may hold more summarized data.
In data warehousing, Fact constellation is used. In Data Mart, Star Schema and
Snowflake Schema are used.
Metadata Repository
The metadata itself is housed in and controlled by the metadata repository. The software of metadata
repository management can be used to map the source data to the target database, integrate and transform the
data, generate code for data transformation, and to move data to the warehouse.
Benefits of Metadata Repository
1. It provides a set of tools for enterprise-wide metadata management.
2. It eliminates and reduces inconsistency, redundancy, and underutilization.
3. It improves organization control, simplifies management, and accounting of information assets.
4. It increases coordination, understanding, identification, and utilization of information assets.
5. It enforces CASE development standards with the ability to share and reuse metadata.
6. It leverages investment in legacy systems and utilizes existing applications.
7. It provides a relational model for heterogeneous RDBMS to share information.
OLAP OLTP
Source of data Operational data; OLTPs are the original Consolidation data; OLAP data comes from the
source of the data. various OLTP
Databases
Purpose of To control and run fundamental business To help with planning, problem solving, and
data tasks decision support
What the data Reveals a snapshot of ongoing Multi-dimensional views of various kinds of
business activities
business processes
Inserts and Short and fast inserts and updates initiated Periodic long-running batch jobs refresh the
Updates by end users data
Queries Relatively standardized and simple queries Often complex queries involving aggregations
Returning relatively few records
Processing Typically very fast Depends on the amount of data involved; batch
Speed data refreshes and complex queries may take
many hours; query speed can be improved by
creating indexes
Space Can be relatively small if historical data is Larger due to the existence of aggregation
Requireme archived structures and history data; requires more
nts indexes than
OLTP
Database Highly normalized with many tables Typically de-normalized with fewer tables; use
Design of star and/or snowflake schemas
Backup and Backup religiously; operational data is Instead of regular backups, some environments
Recovery critical to run the business, data loss is likely may consider simply reloading the OLTP data
to entail significant monetary loss as a recovery method
and legal liability
Purpose of To control and run fundamental business To help with planning, problem solving, and
data tasks decision support
In the multidimensional model, the records are organized into various dimensions, and each dimension
includes multiple levels of abstraction described by concept hierarchies. This organization support users with
the flexibility to view data from various perspectives. A number of OLAP data cube operation exist to
demonstrate these different views, allowing interactive queries and search of the record at hand. Hence,
OLAP supports a user-friendly environment for interactive data analysis.
Consider the OLAP operations which are to be performed on multidimensional data. The figure shows data
cubes for sales of a shop. The cube contains the dimensions, location, and time and item, where the location is
aggregated with regard to city values, time is aggregated with respect to quarters, and an item is aggregated
with respect to item types.
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on a data cube,
by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is like zooming-out on the data
cubes. Figure shows the result of roll-up operations performed on the dimension location. The hierarchy for
the location is defined as the Order Street, city, province, or state, country. The roll-up operation aggregates
the data by ascending the location hierarchy from the level of the city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed from the cube.
For example, consider a sales data cube having two dimensions, location and time. Roll-up may be performed
by removing, the time dimensions, appearing in an aggregation of the total sales by location, relatively than
by location and by time.
Example
Consider the following cubes illustrating temperature of certain days recorded weekly:
Temperature 64 65 68 69 70 71 72 75 80 81 83 85
Week1 1 0 1 0 1 0 0 0 0 0 1 0
Week2 0 0 0 1 0 0 1 2 0 1 0 0
Consider that we want to set up levels (hot (80-85), mild (70-75), cool (64-69)) in temperature from the above
cubes.
To do this, we have to group column and add up the value according to the concept hierarchies. This
operation is known as a roll-up.
By doing this, we contain the following cube:
Week1 2 1 1
Week2 2 1 1
The roll-up operation groups the information by levels of temperature.
The following diagram illustrates how roll-up works.
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down is
like zooming-in on the data cube. It navigates from less detailed record to more detailed data. Drill-down can
be performed by either stepping down a concept hierarchy for a dimension or adding additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a concept hierarchy
which is defined as day, month, quarter, and year. Drill-down appears by descending the time hierarchy from
the level of the quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a new dimension
to a cube. For example, a drill-down on the central cubes of the figure can occur by introducing an additional
dimension, such as a customer group.
Example
Drill-down adds more details to the given data
Day 1 0 0 0
Day 2 0 0 0
Day 3 0 0 1
Day 4 0 1 0
Day 5 1 0 0
Day 6 0 0 0
Day 7 1 0 0
Day 8 0 0 0
Day 9 1 0 0
Day 10 0 1 0
Day 11 0 1 0
Day 12 0 1 0
Day 13 0 0 1
Day 14 0 0 0
The following diagram illustrates how Drill-down works.
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of the dimension.
For example, a slice operation is executed when the customer wants a selection on one dimension of a three-
dimensional cube resulting in a two-dimensional site. So, the Slice operations perform a selection on one
dimension of the given cube, thus resulting in a subcube.
For example, if we make the selection, temperature=cool we will obtain the following cube:
Temperature Cool
Day 1 0
Day 2 0
Day 3 0
Day 4 0
Day 5 1
Day 6 1
Day 7 1
Day 8 1
Day 9 1
Day 11 0
Day 12 0
Day 13 0
Day 14 0
The following diagram illustrates how Slice works.
Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
It will form a new sub-cubes by selecting one or more dimensions.
Dice
The dice operation describes a subcube by operating a selection on two or more dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature = cool OR
temperature = hot) to the original cubes we get the following subcube (still two-dimensional)
Day 3 0 1
Day 4 0 0
Consider the following diagram, which shows the dice operations.
The dice operation on the cubes based on the following selection criteria involves three dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the data axes in
view to provide an alternative presentation of the data. It may contain swapping the rows and columns or
moving one of the row-dimensions into the column dimensions.
Consider the following diagram, which shows the pivot operation.
A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured, such as a sale or log in. A dimension
includes reference data about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose design represents a multidimensional
data model. The star schema is the explicit data warehouse schema. It is known as star schema because the
entity-relationship diagram of this schemas simulates a star, with points, diverge from a central table. The
center of the schema consists of a large fact table, and the points of the star are the dimension tables.
Fact Tables
1. A table in a star schema which contains facts and connected to dimensions. A fact table has two types
of columns: those that include fact and those that are foreign keys to the dimension table. The primary
key of the fact tables is generally a composite key that is made up of all of its foreign keys.
2. A fact table might involve either detail level fact or fact that have been aggregated (fact tables that
include aggregated fact are often instead called summary tables).
3. A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a
dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of
each of the dimensions table are part of the composite primary keys of the fact table. Dimensional
attributes help to define the dimensional value. They are generally descriptive, textual values.
Dimensional tables are usually small in size than fact table.
4. Fact tables store data about sales while dimension tables data about the geographic region (markets,
cities), clients, products, times, channels.
Query Performance
A star schema database has a limited number of table and clear join paths, the query run faster than they do
against OLTP systems. Small single-table queries, frequently of a dimension table, are almost instantaneous.
Large join queries that contain multiple tables takes only seconds or minutes to run.
In a star schema database design, the dimension is connected only through the central fact table. When the
two-dimension table is used in a query, only one join path, intersecting the fact tables, exist between those
two tables. This design feature enforces authentic and consistent query results.
Load performance and administration
Structural simplicity also decreases the time required to load large batches of record into a star schema
database. By describing facts and dimensions and separating them into the various table, the impact of a load
structure is reduced. Dimension table can be populated once and occasionally refreshed. We can add new
facts regularly and selectively by appending records to a fact table.
Built-in referential integrity
A star schema has referential integrity built-in when information is loaded. Referential integrity is enforced
because each data in dimensional tables has a unique primary key, and all keys in the fact table are legitimate
foreign keys drawn from the dimension table. A record in the fact table which is not related correctly to a
dimension cannot be given the correct key value to be retrieved.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through the fact table. These
joins are more significant to the end-user because they represent the fundamental relationship between parts
of the underlying business. Customer can also browse dimension table attributes before constructing a query.
Disadvantage of Star Schema
There is some condition which cannot be meet by star schemas like the relationship between the user, and
bank account cannot describe as star schema as the relationship between them is many to many.
Example: Suppose a star schema is composed of a fact table, SALES, and several dimension tables connected
to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has columns for each
item_Key, item_name, brand, type, supplier_type. The BRANCH table has columns for each branch_key,
branch_name, branch_type. The LOCATION table has columns of geographic data, including street, city,
state, and country.
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time, Product,
Line, and Family dimension tables. The Market dimension has two dimension tables with Store as the
primary dimension table, and Location as the outrigger dimension table. The product dimension has three
dimension tables with Product as the primary dimension table, and the Line and Family table are the outrigger
dimension tables.
Advantage of Snowflake Schema
The primary advantage of the snowflake schema is the development in query performance due to minimized
disk storage requirements and joining smaller lookup tables.
It provides greater scalability in the interrelationship between dimension levels and components.
No redundancy, so it is easier to maintain.
Fact Constellation Schema is a sophisticated database design that is difficult to summarize information. Fact
Constellation Schema can implement between aggregate Fact tables or decompose a complex Fact table into
independent simplex Fact tables.
Example: A fact constellation schema is shown in the figure below.
This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions, namely,
time, item, branch, and location. The schema contains a fact table for sales that includes keys to each of the
four dimensions, along with two measures: Rupee_sold and units_sold. The shipping table has five
dimensions, or keys: item_key, time_key, shipper_key, from_location, and to_location, and two measures:
Rupee_cost and units_shipped.
The primary disadvantage of the fact constellation schema is that it is a more challenging design because
many variants for specific kinds of aggregation must be considered and selected.
Information Processing
It deals with querying, statistical analysis, and reporting via tables, charts, or graphs. Nowadays, information
processing of data warehouse is to construct a low cost, web-based accessing tools typically integrated with
web browsers.
Analytical Processing
It supports various online analytical processing such as drill-down, roll-up, and pivoting. The historical data is
being processed in both summarized and detailed format.