0% found this document useful (0 votes)
123 views33 pages

Unit I

The document provides an overview of key concepts in data warehousing including: - The evolution of data warehousing and its components like data marts and metadata. - Characteristics of a data warehouse such as being subject-oriented, integrated, time-variant, and non-volatile. - The history, goals, needs, and benefits of data warehousing. - The typical components and building blocks of a data warehouse including source data, data staging, data storage, and information delivery.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views33 pages

Unit I

The document provides an overview of key concepts in data warehousing including: - The evolution of data warehousing and its components like data marts and metadata. - Characteristics of a data warehouse such as being subject-oriented, integrated, time-variant, and non-volatile. - The history, goals, needs, and benefits of data warehousing. - The typical components and building blocks of a data warehouse including source data, data staging, data storage, and information delivery.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

UNIT I

INTRODUCTION TO DATA WAREHOUSING :

Evolution of Decision Support Systems - Data Warehousing Components – Building a Data Warehouse ,
Data Warehouse and DBMS , Data Marts , Meta Data , Multidimensional Data Model , OLAP vs OLTP ,
OLAP operations, Data cubes, Schemas for Multidimensional Database : Stars, Snowflakes and Fact
constellations.
*****************************************************************************

Data Warehouse :

 A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing. It includes historical data derived from transaction data from single and
multiple sources.
 A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.
 A Data Warehouse is a group of data specific to the entire organization, not only to a particular group
of users.
 It is not used for daily operations and transaction processing but used for making decisions.
 A Data Warehouse can be viewed as a data system with the following attributes:
 It is a database designed for investigative tasks, using data from various applications.
 It supports a relatively small number of clients with relatively long interactions.
 It includes current and historical data to provide a historical perspective of information.
 Its usage is read-intensive.
 It contains a few large tables.

Characteristics of Data Warehouse :

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of


management's decisions."

1. Subject-Oriented :

A data warehouse target on the modeling and analysis of data for decision-makers. Therefore, data
warehouses typically provide a concise and straightforward view around a particular subject, such as
customer, product, or sales, instead of the global organization's ongoing operations. This is done by excluding
data that are not useful concerning the subject and including all data needed by the users to understand the
subject.
2. Integrated :
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online
transaction records. It requires performing data cleaning and integration during data warehousing to ensure
consistency in naming conventions, attributes types, etc., among different data sources.

3. Time-Variant :
Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6
months, 12 months, or even previous data from a data warehouse. These variations with a transactions
system, where often only the most current file is kept.

4. Non-Volatile :
The data warehouse is a physically separate data storage, which is transformed from the source operational
RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update, insert, and delete
operations are not performed. It usually requires only two procedures in data accessing: Initial loading of data
and access to data. Therefore, the DW does not require transaction processing, recovery, and concurrency
capabilities, which allows for substantial speedup of data retrieval. Non-Volatile defines that once entered
into the warehouse, and data should not change.
History of Data Warehouse :

 The idea of data warehousing came to the late 1980's when IBM researchers Barry Devlin and Paul
Murphy established the "Business Data Warehouse."
 In essence, the data warehousing idea was planned to support an architectural model for the flow of
information from the operational system to decisional support environments. The concept attempt to
address the various problems associated with the flow, mainly the high costs associated with it.
 In the absence of data warehousing architecture, a vast amount of space was required to support
multiple decision support environments. In large corporations, it was ordinary for various decision
support environments to operate independently.

Goals of Data Warehousing

 To help reporting as well as analysis


 Maintain the organization's historical information
 Be the foundation for decision making.

Need for Data Warehouse

1) Business User: Business users require a data warehouse to view summarized data from the past. Since
these people are non-technical, the data may be presented to them in an elementary form.
2) Store historical data: Data Warehouse is required to store the time variable data from the past. This
input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the data warehouse. So,
data warehouse contributes to making strategic decisions.
4) For data consistency and quality: Bringing the data from different sources at a commonplace, the user
can effectively undertake to bring the uniformity and consistency in data.
5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types of
queries, which demands a significant degree of flexibility and quick response time.

Benefits of Data Warehouse


 Understand business trends and make better forecasting decisions.
 Data Warehouses are designed to perform well enormous amounts of data.
 The structure of data warehouses is more accessible for end-users to navigate, understand, and query.
 Queries that would be complex in many normalized databases could be easier to build and maintain in
data warehouses.
 Data warehousing is an efficient method to manage demand for lots of information from lots of users.
 Data warehousing provide the capabilities to analyze a large amount of historical data.
 Components or Building Blocks of Data Warehouse
 Architecture is the proper arrangement of the elements. We build a data warehouse with software and
hardware components. To suit the requirements of our organizations, we arrange these building we
may want to boost up another part with extra tools and services. All of these depends on our
circumstances.

The figure shows the essential elements of a typical warehouse. We see the Source Data component shows on
the left. The Data staging element serves as the next building block. In the middle, we see the Data Storage
component that handles the data warehouses data. This element not only stores and manages the data; it also
keeps track of data using the metadata repository. The Information Delivery component shows on the right
consists of all the different ways of making the information from the data warehouses available to the users.

Source Data Component

Production Data: This type of data comes from the different operating systems of the enterprise. Based on
the data requirements in the data warehouse, we choose segments of the data from the various operational
modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports, customer profiles,
and sometimes even department databases. This is the internal data, part of which could be useful in a data
warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In every operational
system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large percentage of the
information they use. They use statistics associating to their industry produced by the external department.

Data Staging Component

 After we have been extracted data from various operational systems and external sources, we have to
prepare the files for storing in the data warehouse. The extracted data coming from several different
sources need to be changed, converted, and made ready in a format that is relevant to be saved for
querying and analysis.
 We will now discuss the three primary functions that take place in the staging area.

1) Data Extraction: This method has to deal with numerous data sources. We have to employ the appropriate
techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many different sources. If data
extraction for a data warehouse posture big challenges, data transformation present even significant
challenges. We perform several individual tasks as part of data transformation.
First, we clean the data extracted from each source. Cleaning may be the correction of misspellings or may
deal with providing default values for missing data elements, or elimination of duplicates when we bring in
the same data from various source systems.
Standardization of data components forms a large part of data transformation. Data transformation contains
many forms of combining pieces of data from different sources. We combine data from single source record
or related data parts from many source records.
On the other hand, data transformation also contains purging source data that is not useful and separating
outsource records into new combinations. Sorting and merging of data take place on a large scale in the data
staging area. When the data transformation function ends, we have a collection of integrated data that is
cleaned, standardized, and summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we complete the
structure and construction of the data warehouse and go live for the first time, we do the initial loading of the
information into the data warehouse storage. The initial load moves high volumes of data using up a
substantial amount of time.
Data Storage Components
Data storage for the data warehousing is a split repository. The data repositories for the operational systems
generally include only the current data. Also, these data repositories include the data structured in highly
normalized for fast and efficient processing.
Information Delivery Component
The information delivery element is used to enable the process of subscribing for data warehouse files and
having it transferred to one or more destinations according to some customer-specified scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database management
system. In the data dictionary, we keep the data about the logical data structures, the data about the records
and addresses, the information about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The scope is confined
to particular selected subjects. Data in a data warehouse should be a fairly current, but not mainly up to the
minute, although development in the data warehouse industry has made standard and incremental data dumps
more achievable. Data marts are lower than data warehouses and usually contain organization. The current
trends in data warehousing are to developed a data warehouse with several smaller related data marts for
particular kinds of queries and reports.
Management and Control Component
The management and control elements coordinate the services and functions within the data warehouse.
These components control the data transformation and the data transfer into the data warehouse storage. On
the other hand, it moderates the data delivery to the clients. Its work with the database management systems
and authorizes data to be correctly saved in the repositories. It monitors the movement of information into the
staging method and from there into the data warehouses storage itself.
Why we need a separate Data Warehouse?
1. Data Warehouse queries are complex because they involve the computation of large groups of data at
summarized levels.
2. It may require the use of distinctive data organization, access, and implementation method based on
multidimensional views.
3. Performing OLAP queries in operational database degrade the performance of functional tasks.
4. Data Warehouse is used for analysis and decision making in which extensive database is required,
including historical data, which operational database does not typically maintain.
5. The separation of an operational database from data warehouses is based on the different structures
and uses of data in these systems.
6. Because the two systems provide different functionalities and require different kinds of data, it is
necessary to maintain separate databases.
Difference between Database and Data Warehouse

Database Data Warehouse

1. It is used for Online Transactional Processing (OLTP) but 1. It is used for Online Analytical
can be used for other objectives such as Data Warehousing. Processing (OLAP). This reads the
This records the data from the clients for history. historical information for the customers
for business decisions.

2. The tables and joins are complicated since they are 2. The tables and joins are accessible
normalized for RDBMS. This is done to reduce redundant files since they are de-normalized. This is
and to save storage space. done to minimize the response time for
analytical queries.

3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling procedures are used for RDBMS 4. Data: Modeling approach are used
database design. for the Data Warehouse design.

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis queries. 6. High performance for analytical


queries.

7. The database is the place where the data is taken as a base 7. Data Warehouse is the place where
and managed to get available fast and efficient access. the application data is handled for
analysis and reporting objectives.

Building a data warehouse

Organizations embarking on data warehousing development can chose on of the two approaches
Top-down approach: Meaning that the organization has developed an enterprise data model, collected
enterprise wide business requirement, and decided to build an enterprise data warehouse with subset data
marts
Bottom-up approach: Implying that the business priorities resulted in developing individual data marts,
which are then integrated into the enterprise data warehouse. Organizational Issues The requirements
and environments associated with the informational applications of a data warehouse are different. Therefore
an organization will need to employ different development practices than the ones it uses for operational
applications
Design Consideration In general, a data warehouse‘s design point is to consolidate data from multiple, often
heterogeneous, sources into a query data base. The main factors include
Heterogeneity of data sources, which affects data conversion, quality, time-lines
Use of historical data, which implies that data may be‖ old‖  Tendency of database to grow very large

Data Content: Typically a data warehouse may contain detailed data, but the data is cleaned up and
transformed to fit the warehouse model, and certain transactional attributes of the data are filtered out. The
content and the structure of the data warehouses are reflected in its data model. The data model is a template
for how information will be organized with in the integrated data warehouse framework.

Meta data: Defines the contents and location of the data in the warehouse, relationship between the
operational databases and the data warehouse, and the business view of the warehouse data that are accessible
by end-user tools. The warehouse design should prevent any direct access to the warehouse data if it does not
use meta data definitions to gain the access.

Data distribution: As the data volumes continue to grow, the data base size may rapidly outgrow a single
server. Therefore, it becomes necessary to know how the data should be divided across multiple servers. The
data placement and distribution design should consider several options including data distribution by subject
area, location, or time.
Tools: Data warehouse designers have to be careful not to sacrifice the overall design to fit to a specific tool.
Selected tools must be compatible with the given data warehousing environment each other.
Performance consideration: Rapid query processing is a highly desired feature that should be designed into
the data warehouse.

Nine decisions in the design of a data warehouse:


1. Choosing the subject matter
2. Deciding what a fact table represents
3. Identifying and conforming the decisions
4. Choosing the facts
5. Storing pre calculations in the fact table
6. Rounding out the dimension table
7. Choosing the duration of the data base
8. The need to track slowly changing dimensions
9. Deciding the query priorities and the query modes

Technical Considerations
A number of technical issues are to be considered when designing and implementing a data warehouse
environment .these issues includes. The hardware platform that would house the data warehouse. The data
base management system that supports the warehouse data base. The communication infrastructure that
connects the warehouse, data marts, operational systems, and end users. The hardware platform and software
to support the meta data repository The systems management framework that enables the centralized
management and administration of the entire environment.
Implementation Considerations
A data warehouse cannot be simply bought and installed-its implementation requires the integration of many
products within a data ware house.
Data warehouse is an environment, not a product which is based on relational database management system
that functions as the central repository for informational data.
The central repository information is surrounded by number of key components designed to make the
environment is functional, manageable and accessible.
The data source for data warehouse is coming from operational applications. The data entered into the data
warehouse transformed into an integrated structure and format. The transformation process involves
conversion, summarization, filtering and condensation. The data warehouse must be capable of holding and
managing large volumes of data as well as different structure of data structures over the time.

1 Data warehouse database


This is the central part of the data warehousing environment. This is the item number 2 in the above arch.
diagram. This is implemented based on RDBMS technology.
2 Sourcing, Acquisition, Clean up, and Transformation Tools
This is item number 1 in the above arch diagram. They perform conversions, summarization, key changes,
structural changes and condensation. The data transformation is required so that the information can by used
by decision support tools. The transformation produces programs, control statements, JCL code, COBOL
code, UNIX scripts, and SQL DDL code etc., to move the data into data warehouse from multiple operational
systems. The functionalities of these tools are listed below:
· To remove unwanted data from operational db
· Converting to common data names and attributes
· Calculating summaries and derived data
· Establishing defaults for missing data
· Accommodating source data definition changes
Issues to be considered while data sourcing, cleanup, extract and transformation:
Data heterogeneity: It refers to DBMS different nature such as it may be in different data modules, it may
have different access languages, it may have data navigation methods, operations, concurrency, integrity and
recovery processes etc.,
Data heterogeneity: It refers to the different way the data is defined and used in different modules.
Some experts involved in the development of such tools:
Prism Solutions, Evolutionary Technology Inc., Vality, Praxis and Carleton
3 Meta data
It is data about data. It is used for maintaining, managing and using the data warehouse. It is classified into
two:
Technical Meta data: It contains information about data warehouse data used by warehouse designer,
administrator to carry out development and management tasks. It includes,
· Info about data stores
· Transformation descriptions. That is mapping methods from operational db to warehouse db
· Warehouse Object and data structure definitions for target data
· The rules used to perform clean up, and data enhancement
· Data mapping operations
· Access authorization, backup history, archive history, info delivery history, data acquisition history, data
access etc.,
Business Meta data: It contains info that gives info stored in data warehouse to users. It includes,
· Subject areas, and info object type including queries, reports, images, video, audio clips etc.
· Internet home pages
· Info related to info delivery system
· Data warehouse operational info such as ownerships, audit trails etc.,
Meta data helps the users to understand content and find the data. Meta data are stored in a separate data
stores which is known as informational directory or Metadata repository which helps to integrate, maintain
and view the contents of the data warehouse. The following lists the characteristics of info directory/ Meta
data:
· It is the gateway to the data warehouse environment
· It supports easy distribution and replication of content for high performance and availability
· It should be searchable by business oriented key words
· It should act as a launch platform for end user to access data and analysis tools
· It should support the sharing of info
· It should support scheduling options for request
· IT should support and provide interface to other applications
· It should support end user monitoring of the status of the data warehouse environment
4 Access tools
Its purpose is to provide info to business users for decision making. There are five categories: · Data query
and reporting tools
· Application development tools
· Executive info system tools (EIS)
· OLAP tools
· Data mining tools
Query and reporting tools are used to generate query and report. There are two types of reporting tools. They
are:
· Production reporting tool used to generate regular operational reports
· Desktop report writer are inexpensive desktop tools designed for end users.
Managed Query tools: used to generate SQL query. It uses Meta layer software in between users and
databases which offers a point-and-click creation of SQL statement. This tool is a preferred choice of users to
perform segment identification, demographic analysis, territory management and preparation of customer
mailing lists etc.
Application development tools: This is a graphical data access environment which integrates OLAP tools
with data warehouse and can be used to access all db systems
OLAP Tools: are used to analyze the data in multi dimensional and complex views. To enable
multidimensional properties it uses MDDB and MRDB where MDDB refers multi dimensional data base and
MRDB refers multi relational data bases.
Data mining tools: are used to discover knowledge from the data warehouse data also can be used for data
visualization and data correction purposes.
5 Data marts
Departmental subsets that focus on selected subjects. They are independent used by dedicated user group.
They are used for rapid delivery of enhanced decision support functionality to end users. Data mart is used in
the following situation:
· Extremely urgent user requirement
· The absence of a budget for a full scale data warehouse strategy
· The decentralization of business needs
· The attraction of easy to use tools and mind sized project Data mart presents two problems:
Scalability: A small data mart can grow quickly in multi dimensions. So that while designingit, the
organization has to pay more attention on system scalability, consistency and manageability issues
Data integration
6 Data warehouse admin and management
The management of data warehouse includes,
· Security and priority management
· Monitoring updates from multiple sources
· Data quality checks
· Managing and updating meta data
· Auditing and reporting data warehouse usage and status
· Purging data
· Replicating, sub setting and distributing data
· Backup and recovery
· Data warehouse storage management which includes capacity planning, hierarchical storage management
and purging of aged data etc.,
7 Information delivery system
It is used to enable the process of subscribing for data warehouse info.
Delivery to one or more destinations according to specified scheduling algorithm

DATABASE ARCHITECTURES FOR PARALLEL PROCESSING

There are three DBMS software architecture styles for parallel processing:
• Shared memory or shared everything Architecture
• Shared disk architecture
• Shred nothing architecture Shared Memory Architecture:
Tightly coupled shared memory systems, illustrated in following figure have the following characteristics:
• Multiple PUs share memory.
• Each PU has full access to all shared memory through a common bus.
• Communication between nodes occurs via shared memory.
• Performance is limited by the bandwidth of the memory bus.
• It is simple to implement and provide a single system image, implementing an RDBMS on
SMP(symmetric multiprocessor)
A disadvantage of shared memory systems for parallel processing is as follows:
• Scalability is limited by bus bandwidth and latency, and by available memory.
Shared Disk Architecture
Shared disk systems are typically loosely coupled. Such systems, illustrated in following figure, have the
following characteristics:
• Each node consists of one or more PUs and associated memory.
• Memory is not shared between nodes.
• Communication occurs over a common high-speed bus.
• Each node has access to the same disks and other resources.
• A node can be an SMP if the hardware supports it.
• Bandwidth of the high-speed bus limits the number of nodes (scalability) of the system.
• The Distributed Lock Manager (DLM ) is required.
• Parallel processing advantages of shared disk systems are as follows:
• Shared disk systems permit high availability. All data is accessible even if one node dies.
• These systems have the concept of one database, which is an advantage over shared nothing systems.
• Shared disk systems provide for incremental growth.
• Parallel processing disadvantages of shared disk systems are these:
• Inter-node synchronization is required, involving DLM overhead and greater dependency on high-
speed interconnect.
• If the workload is not partitioned well, there may be high synchronization overhead.
Shared Nothing Architecture
• Shared nothing systems are typically loosely coupled. In shared nothing systems only one CPU is
connected to a given disk. If a table or database is located on that disk  Shared nothing systems are
concerned with access to disks, not access to memory.
• Adding more PUs and disks can improve scale up.
• Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
• Shared nothing systems provide for incremental growth.
• System growth is practically unlimited.
• MPPs are good for read-only databases and decision support applications.
• Failure is local: if one node fails, the others stay up.
Disadvantages
• More coordination is required.
• More overhead is required for a process working on a disk belonging to another node.
• If there is a heavy workload of updates or inserts, as in an online transaction processing system, it may
be worthwhile to consider data-dependent routing to alleviate contention.
These Requirements include
• Support for function shipping
• Parallel join strategies
• Support for data repartitioning
• Query compilation
• Support for database transactions
• Support for the single system image of the database environment
Combined architecture
• Interserver parallelism: each query is parallelized across multiple servers
• Intraserver parallelism: the query is parallelized with in a server
• The combined architecture supports inter server parallelism of distributed memory MPPs and cluster
and interserver parallelism of SMP nodes
Parallel DBMS features
• Scope and techniques of parallel DBMS operations Optimizer implementation
• Application transparency
• Parallel environment: which allows the DBMS server to take full advantage of the existing facilities
on a very low level?
• DBMS management tools: help to configure, tune, admin and monitor a parallel RDBMS as
effectively as if it were a serial RDBMS.
• Price / Performance: The parallel RDBMS can demonstrate a non linear speed up and scale up at
reasonable costs.
Alternative technologies
• For improving performance in dw environment includes
• Advanced database indexing products
• Multidimensional databases
• Specialized RDBMS
• Advance indexing techniques REFER --SYSBASE IQ
PARALLEL DBMS VENDORS

Oracle: Parallel Query Option (PQO)


Architecture: shared disk arch
Data partition: Key range, hash, round robin
Parallel operations: hash joins, scan and sort
Informix: eXtended Parallel Server (XPS)
Architecture: Shared memory, shared disk and shared nothing models
Data partition: round robin, hash, schema, key range and user defined
Parallel operations: INSERT, UPDATE, DELELTE
IBM: DB2 Parallel Edition (DB2 PE)
Architecture: Shared nothing models
Data partition: hash
Parallel operations: INSERT, UPDATE, DELELTE, load, recovery, index creation, backup, table
reorganization
SYBASE: SYBASE MPP
Architecture: Shared nothing models
Data partition: hash, key range, Schema
Parallel operations: Horizontal and vertical parallelism

What is Data Mart

A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose or primary
data subject which may be distributed to provide business needs. Data Marts are analytical record stores
designed to focus on particular business functions for a specific community within an organization. Data
marts are derived from subsets of data in a data warehouse, though in the bottom-up data warehouse design
methodology, the data warehouse is created from the union of organizational data marts.

Reasons for creating a data mart


 Creates collective data by a group of users
 Easy access to frequently needed data
 Ease of creation
 Improves end-user response time
 Lower cost than implementing a complete data warehouses
 Potential clients are more clearly defined than in a comprehensive data warehouse
 It contains only essential business data and is less cluttered.
Types of Data Marts
There are mainly two approaches to designing data marts. These approaches are
1. Dependent Data Marts
2. Independent Data Marts
3. Dependent Data Marts

Dependent Data Marts

A dependent data marts is a logical subset of a physical subset of a higher data warehouse. According to this
technique, the data marts are treated as the subsets of a data warehouse. In this technique, firstly a data
warehouse is created from which further various data marts can be created. These data mart are dependent on
the data warehouse and extract the essential record from it. In this technique, as the data warehouse creates
the data mart; therefore, there is no need for data mart integration. It is also known as a top-down approach.
1.5M
1. Build a CMS using OOP PHP tutorial | PHP MVC design pattern [2020]

Independent Data Marts

The second approach is Independent data marts (IDM) Here, firstly independent data marts are created, and
then a data warehouse is designed using these independent multiple data marts. In this approach, as all the
data marts are designed independently; therefore, the integration of data marts is required. It is also termed as
a bottom-up approach as the data marts are integrated to develop a data warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."

Hybrid Data Marts

It allows us to combine input from sources other than a data warehouse. This could be helpful for many
situations; especially when Adhoc integrations are needed, such as after a new group or product is added to
the organizations.

Steps in Implementing a Data Mart

The significant steps in implementing a data mart are to design the schema, construct the physical storage,
populate the data mart with data from source systems, access it to make informed decisions and manage it
over time. So, the steps are:
Designing
The design step is the first in the data mart process. This phase covers all of the functions from initiating the
request for a data mart through gathering data about the requirements and developing the logical and physical
design of the data mart.
It involves the following tasks:
 Gathering the business and technical requirements
 Identifying data sources
 Selecting the appropriate subset of data
 Designing the logical and physical architecture of the data mart.
Constructing
This step contains creating the physical database and logical structures associated with the data mart to
provide fast and efficient access to the data.
It involves the following tasks:
 Creating the physical database and logical structures such as tablespaces associated with the data mart.
 creating the schema objects such as tables and indexes describe in the design step.
 Determining how best to set up the tables and access structures.
Populating
This step includes all of the tasks related to the getting data from the source, cleaning it up, modifying it to the
right format and level of detail, and moving it into the data mart.
It involves the following tasks:
 Mapping data sources to target data sources
 Extracting data
 Cleansing and transforming the information.
 Loading data into the data mart
 Creating and storing metadata
Accessing
This step involves putting the data to use: querying the data, analyzing it, creating reports, charts and graphs
and publishing them.
It involves the following tasks:
 Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer translates database
operations and objects names into business conditions so that the end-clients can interact with the data
mart using words which relates to the business functions.
 Set up and manage database architectures like summarized tables which help queries agree through
the front-end tools execute rapidly and efficiently.
Managing
This step contains managing the data mart over its lifetime. In this step, management functions are performed
as:

 Providing secure access to the data.


 Managing the growth of the data.
 Optimizing the system for better performance.
 Ensuring the availability of data event with system failures.

Difference between Data Warehouse and Data Mart

Data Warehouse Data Mart

A Data Warehouse is a vast repository of information A data mart is an only subtype of a


collected from various organizations or departments Data Warehouses. It is architecture to
within a corporation. meet the requirement of a specific user
group.

It may hold multiple subject areas. It holds only one subject area. For
example, Finance or Sales.
It holds very detailed information. It may hold more summarized data.

Works to integrate all data sources It concentrates on integrating data from


a given subject area or set of source
systems.

In data warehousing, Fact constellation is used. In Data Mart, Star Schema and
Snowflake Schema are used.

It is a Centralized System. It is a Decentralized System.

Data Warehousing is the data-oriented. Data Marts is a project-oriented.

What is Meta Data


Metadata is data about the data or documentation about the information which is required by the users. In data
warehousing, metadata is one of the essential aspects.
Metadata includes the following:
 The location and descriptions of warehouse systems and components.
 Names, definitions, structures, and content of data-warehouse and end-users views.
 Identification of authoritative data sources.
 Integration and transformation rules used to populate data.
 Integration and transformation rules used to deliver information to end-user analytical tools.
 Subscription information for information delivery to analysis subscribers.
 Metrics used to analyze warehouses usage and performance.
 Security authorizations, access control list, etc.
 Metadata is used for building, maintaining, managing, and using the data warehouses. Metadata allow
users access to help understand the content and find data.
Types of Metadata
Metadata in a data warehouse fall into three major parts:
1. Operational Metadata
2. Extraction and Transformation Metadata
3. End-User Metadata
Operational Metadata
As we know, data for the data warehouse comes from various operational systems of the enterprise. These
source systems include different data structures. The data elements selected for the data warehouse have
various fields lengths and data types.
In selecting information from the source systems for the data warehouses, we divide records, combine factor
of documents from different source files, and deal with multiple coding schemes and field lengths. When we
deliver information to the end-users, we must be able to tie that back to the source data sets. Operational
metadata contains all of this information about the operational data sources.
Extraction and Transformation Metadata
Extraction and transformation metadata include data about the removal of data from the source systems,
namely, the extraction frequencies, extraction methods, and business rules for the data extraction. Also, this
category of metadata contains information about all the data transformation that takes place in the data
staging area.
End-User Metadata
The end-user metadata is the navigational map of the data warehouses. It enables the end-users to find data
from the data warehouses. The end-user metadata allows the end-users to use their business terminology and
look for the information in those ways in which they usually think of the business.

Metadata Repository
The metadata itself is housed in and controlled by the metadata repository. The software of metadata
repository management can be used to map the source data to the target database, integrate and transform the
data, generate code for data transformation, and to move data to the warehouse.
Benefits of Metadata Repository
1. It provides a set of tools for enterprise-wide metadata management.
2. It eliminates and reduces inconsistency, redundancy, and underutilization.
3. It improves organization control, simplifies management, and accounting of information assets.
4. It increases coordination, understanding, identification, and utilization of information assets.
5. It enforces CASE development standards with the ability to share and reuse metadata.
6. It leverages investment in legacy systems and utilizes existing applications.
7. It provides a relational model for heterogeneous RDBMS to share information.

OLAP OLTP
Source of data Operational data; OLTPs are the original Consolidation data; OLAP data comes from the
source of the data. various OLTP
Databases

Purpose of To control and run fundamental business To help with planning, problem solving, and
data tasks decision support

What the data Reveals a snapshot of ongoing Multi-dimensional views of various kinds of
business activities
business processes

Inserts and Short and fast inserts and updates initiated Periodic long-running batch jobs refresh the
Updates by end users data

Queries Relatively standardized and simple queries Often complex queries involving aggregations
Returning relatively few records
Processing Typically very fast Depends on the amount of data involved; batch
Speed data refreshes and complex queries may take
many hours; query speed can be improved by
creating indexes

Space Can be relatively small if historical data is Larger due to the existence of aggregation
Requireme archived structures and history data; requires more
nts indexes than
OLTP

Database Highly normalized with many tables Typically de-normalized with fewer tables; use
Design of star and/or snowflake schemas

Backup and Backup religiously; operational data is Instead of regular backups, some environments
Recovery critical to run the business, data loss is likely may consider simply reloading the OLTP data
to entail significant monetary loss as a recovery method
and legal liability

Purpose of To control and run fundamental business To help with planning, problem solving, and
data tasks decision support

OLAP Operations in the Multidimensional Data Model

In the multidimensional model, the records are organized into various dimensions, and each dimension
includes multiple levels of abstraction described by concept hierarchies. This organization support users with
the flexibility to view data from various perspectives. A number of OLAP data cube operation exist to
demonstrate these different views, allowing interactive queries and search of the record at hand. Hence,
OLAP supports a user-friendly environment for interactive data analysis.
Consider the OLAP operations which are to be performed on multidimensional data. The figure shows data
cubes for sales of a shop. The cube contains the dimensions, location, and time and item, where the location is
aggregated with regard to city values, time is aggregated with respect to quarters, and an item is aggregated
with respect to item types.
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on a data cube,
by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is like zooming-out on the data
cubes. Figure shows the result of roll-up operations performed on the dimension location. The hierarchy for
the location is defined as the Order Street, city, province, or state, country. The roll-up operation aggregates
the data by ascending the location hierarchy from the level of the city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed from the cube.
For example, consider a sales data cube having two dimensions, location and time. Roll-up may be performed
by removing, the time dimensions, appearing in an aggregation of the total sales by location, relatively than
by location and by time.
Example
Consider the following cubes illustrating temperature of certain days recorded weekly:
Temperature 64 65 68 69 70 71 72 75 80 81 83 85

Week1 1 0 1 0 1 0 0 0 0 0 1 0

Week2 0 0 0 1 0 0 1 2 0 1 0 0
Consider that we want to set up levels (hot (80-85), mild (70-75), cool (64-69)) in temperature from the above
cubes.
To do this, we have to group column and add up the value according to the concept hierarchies. This
operation is known as a roll-up.
By doing this, we contain the following cube:

Temperature cool mild hot

Week1 2 1 1

Week2 2 1 1
The roll-up operation groups the information by levels of temperature.
The following diagram illustrates how roll-up works.
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down is
like zooming-in on the data cube. It navigates from less detailed record to more detailed data. Drill-down can
be performed by either stepping down a concept hierarchy for a dimension or adding additional dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a concept hierarchy
which is defined as day, month, quarter, and year. Drill-down appears by descending the time hierarchy from
the level of the quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a new dimension
to a cube. For example, a drill-down on the central cubes of the figure can occur by introducing an additional
dimension, such as a customer group.
Example
Drill-down adds more details to the given data

Temperature cool mild hot

Day 1 0 0 0

Day 2 0 0 0

Day 3 0 0 1
Day 4 0 1 0

Day 5 1 0 0

Day 6 0 0 0

Day 7 1 0 0

Day 8 0 0 0

Day 9 1 0 0

Day 10 0 1 0

Day 11 0 1 0

Day 12 0 1 0

Day 13 0 0 1

Day 14 0 0 0
The following diagram illustrates how Drill-down works.

Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of the dimension.
For example, a slice operation is executed when the customer wants a selection on one dimension of a three-
dimensional cube resulting in a two-dimensional site. So, the Slice operations perform a selection on one
dimension of the given cube, thus resulting in a subcube.
For example, if we make the selection, temperature=cool we will obtain the following cube:

Temperature Cool

Day 1 0

Day 2 0

Day 3 0

Day 4 0

Day 5 1

Day 6 1

Day 7 1

Day 8 1

Day 9 1

Day 11 0

Day 12 0

Day 13 0

Day 14 0
The following diagram illustrates how Slice works.
Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
It will form a new sub-cubes by selecting one or more dimensions.
Dice
The dice operation describes a subcube by operating a selection on two or more dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature = cool OR
temperature = hot) to the original cubes we get the following subcube (still two-dimensional)

Temperature cool Hot

Day 3 0 1

Day 4 0 0
Consider the following diagram, which shows the dice operations.
The dice operation on the cubes based on the following selection criteria involves three dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the data axes in
view to provide an alternative presentation of the data. It may contain swapping the rows and columns or
moving one of the row-dimensions into the column dimensions.
Consider the following diagram, which shows the pivot operation.

MULTIDIMENSIONAL DATA MODEL


Multidimensional data model stores data in the form of data cube.Mostly, data warehousing supports two or
three-dimensional cubes. A data cube allows data to be viewed in multiple dimensions.A dimensions are
entities with respect to which an organization wants to keep records.For example in store sales record,
dimensions allow the store to keep track of things like monthly sales of items and the branches and locations.
A multidimensional databases helps to provide data-related answers to complex business queries quickly and
accurately. Data warehouses and Online Analytical Processing (OLAP) tools are based on a multidimensional
data model.OLAP in data warehousing enables users to view data from different angles and dimensions.

Schemas for Multidimensional Data Model are:-


1. Star Schema
2. Snowflakes Schema
3. Fact Constellations Schema

What is Star Schema

A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured, such as a sale or log in. A dimension
includes reference data about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose design represents a multidimensional
data model. The star schema is the explicit data warehouse schema. It is known as star schema because the
entity-relationship diagram of this schemas simulates a star, with points, diverge from a central table. The
center of the schema consists of a large fact table, and the points of the star are the dimension tables.

Fact Tables
1. A table in a star schema which contains facts and connected to dimensions. A fact table has two types
of columns: those that include fact and those that are foreign keys to the dimension table. The primary
key of the fact tables is generally a composite key that is made up of all of its foreign keys.
2. A fact table might involve either detail level fact or fact that have been aggregated (fact tables that
include aggregated fact are often instead called summary tables).
3. A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a
dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of
each of the dimensions table are part of the composite primary keys of the fact table. Dimensional
attributes help to define the dimensional value. They are generally descriptive, textual values.
Dimensional tables are usually small in size than fact table.
4. Fact tables store data about sales while dimension tables data about the geographic region (markets,
cities), clients, products, times, channels.

Characteristics of Star Schema


1. The star schema is intensely suitable for data warehouse database design because of the following
features:
2. It creates a DE-normalized database that can quickly provide query responses.
3. It provides a flexible design that can be changed easily or added to throughout the development cycle,
and as the database grows.
4. It provides a parallel in design to how end-users typically think of and use the data.
5. It reduces the complexity of metadata for both developers and end-users.
Advantages of Star Schema
Star Schemas are easy for end-users and application to understand and navigate. With a well-designed
schema, the customer can instantly analyze large, multidimensional data sets.
The main advantage of star schemas in a decision-support environment are:

Query Performance
A star schema database has a limited number of table and clear join paths, the query run faster than they do
against OLTP systems. Small single-table queries, frequently of a dimension table, are almost instantaneous.
Large join queries that contain multiple tables takes only seconds or minutes to run.
In a star schema database design, the dimension is connected only through the central fact table. When the
two-dimension table is used in a query, only one join path, intersecting the fact tables, exist between those
two tables. This design feature enforces authentic and consistent query results.
Load performance and administration
Structural simplicity also decreases the time required to load large batches of record into a star schema
database. By describing facts and dimensions and separating them into the various table, the impact of a load
structure is reduced. Dimension table can be populated once and occasionally refreshed. We can add new
facts regularly and selectively by appending records to a fact table.
Built-in referential integrity
A star schema has referential integrity built-in when information is loaded. Referential integrity is enforced
because each data in dimensional tables has a unique primary key, and all keys in the fact table are legitimate
foreign keys drawn from the dimension table. A record in the fact table which is not related correctly to a
dimension cannot be given the correct key value to be retrieved.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through the fact table. These
joins are more significant to the end-user because they represent the fundamental relationship between parts
of the underlying business. Customer can also browse dimension table attributes before constructing a query.
Disadvantage of Star Schema
There is some condition which cannot be meet by star schemas like the relationship between the user, and
bank account cannot describe as star schema as the relationship between them is many to many.
Example: Suppose a star schema is composed of a fact table, SALES, and several dimension tables connected
to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has columns for each
item_Key, item_name, brand, type, supplier_type. The BRANCH table has columns for each branch_key,
branch_name, branch_type. The LOCATION table has columns of geographic data, including street, city,
state, and country.

What is Snowflake Schema?


 A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one or
more dimension tables do not connect directly to the fact table but must join through other dimension
tables."
 The snowflake schema is an expansion of the star schema where each point of the star explodes into
more points. It is called snowflake schema because the diagram of snowflake schema resembles a
snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR schemas. When
we normalize all the dimension tables entirely, the resultant structure resembles a snowflake with the
fact table in the middle.
 Snowflaking is used to develop the performance of specific queries. The schema is diagramed with
each fact surrounded by its associated dimensions, and those dimensions are related to other
dimensions, branching out into a snowflake pattern.
 The snowflake schema consists of one fact table which is linked to many dimension tables, which can
be linked to other dimension tables through a many-to-one relationship. Tables in a snowflake schema
are generally normalized to the third normal form. Each dimension table performs exactly one level in
a hierarchy.
 The following diagram shows a snowflake schema with two dimensions, each having three levels. A
snowflake schemas can have any number of dimension, and each dimension can have any number of
levels.


Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time, Product,
Line, and Family dimension tables. The Market dimension has two dimension tables with Store as the
primary dimension table, and Location as the outrigger dimension table. The product dimension has three
dimension tables with Product as the primary dimension table, and the Line and Family table are the outrigger
dimension tables.
Advantage of Snowflake Schema
The primary advantage of the snowflake schema is the development in query performance due to minimized
disk storage requirements and joining smaller lookup tables.
It provides greater scalability in the interrelationship between dimension levels and components.
No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema


The primary disadvantage of the snowflake schema is the additional maintenance efforts required due to the
increasing number of lookup tables. It is also known as a multi fact star schema.
There are more complex queries and hence, difficult to understand.
More tables more join so more query execution time.

What is Fact Constellation Schema


A Fact constellation means two or more fact tables sharing one or more dimensions. It is also called Galaxy
schema.
Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact Constellation
Schema can design with a collection of de-normalized FACT, Shared, and Conformed Dimension tables.

Fact Constellation Schema is a sophisticated database design that is difficult to summarize information. Fact
Constellation Schema can implement between aggregate Fact tables or decompose a complex Fact table into
independent simplex Fact tables.
Example: A fact constellation schema is shown in the figure below.

This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions, namely,
time, item, branch, and location. The schema contains a fact table for sales that includes keys to each of the
four dimensions, along with two measures: Rupee_sold and units_sold. The shipping table has five
dimensions, or keys: item_key, time_key, shipper_key, from_location, and to_location, and two measures:
Rupee_cost and units_shipped.
The primary disadvantage of the fact constellation schema is that it is a more challenging design because
many variants for specific kinds of aggregation must be considered and selected.

Data Warehouse Applications

Information Processing
It deals with querying, statistical analysis, and reporting via tables, charts, or graphs. Nowadays, information
processing of data warehouse is to construct a low cost, web-based accessing tools typically integrated with
web browsers.
Analytical Processing
It supports various online analytical processing such as drill-down, roll-up, and pivoting. The historical data is
being processed in both summarized and detailed format.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy