Assignment No 2
Assignment No 2
The successful implementation of a data warehouse can bring major, benefits to an organization
including:
• Competitive advantage
The huge returns on investment for those companies that have successfully implemented a data
warehouse is evidence of the enormous competitive advantage that accompanies this technology.
The competitive advantage is gained by allowing decision-makers access to data that can reveal
previously unavailable, unknown, and untapped information on, for example, customers, trends,
and demands.
Data warehousing helps to reduce the overall cost of the· product· by reducing the number of
channels.
2. Features of Datawarehouse?
Ans:
Subject Oriented– One of the key features of a data warehouse is the orientation it
follows. Data warehouses focus on past subjects, like for example, sales, revenue, and not
on ongoing and current organization data. This enables it to be used for data analysis
which is a key element of decision-making.
Collaboration – Adding on as another feature for ease of analysis of data, a data
warehouse’s core is its integration of data from several different sources which aren’t
homologous in nature, for example, flat files, relational databases, and other such
sources. This plays a key role in enhancing the efficacy of data analysis.
Non-volatile–The data in a warehouse is of the non-volatile type which ensures that your
previous data is not lost as new data is updated which separates them for operational
databases which are subject to frequent changes.
Time Variant –What’s the significance of data without a time stamp? Data uploaded into
a warehouse can be identified with a certain timeline making it a multidimensional
historical view whenever you access data.
No Additional Controls – As the warehouse is maintained separate and has a separate
storage from the operational databases, it doesn’t require any concurrency controls,
tweaks in processing, recovery mechanisms.
3. Explain Datawarehouse Architecture?
Ans:
1. Top-down approach:
The essential components are discussed below:
1. External Sources –
External source is a source from where data is collected irrespective of the type of data.
Data can be structured, semi structured and unstructured as well.
Stage Area –
Since the data, extracted from the external sources does not follow a particular format, so there is
a need to validate this data to load into Datawarehouse. For this purpose, it is recommended to
use ETL tool.
This approach is defined by Inmon as – datawarehouse as a central repository for the complete
organization and data marts are created from it after the complete datawarehouse has been
created.
2. Bottom-up approach:
1. First, the data is extracted from external sources (same as happens in top-down
approach).
2. Then, the data go through the staging area (as explained above) and loaded into data
marts instead of datawarehouse. The data marts are created first and provide reporting
capability. It addresses a single business area.
3. These data marts are then integrated into datawarehouse.
This approach is given by Kinball as – data marts are created first and provides a thin view for
analyses and datawarehouse is created after complete data marts have been created.
4. Difference between OLAP & OLTP?
Ans:
Ans:
a) Roll – Up:
b) Drill Down:
Drill down is a dimension expansion technique that can be applied on the data cube.
Dimension expansion means, adding new dimension or expanding existing dimensions
across any axis of the data cube using the notion of concept hierarchy.
Consider the example of sales of four companies C1, C2, C3 &C4 per quarter based on
product category (Men’s, Women’s, Electronics &Home). Out of the four companies,
two companies are form India (C1 & C2) and two are from America (C3 & C4). So, if we
want to perform the Drill down operation on given data cube, we can do it by expanding
the available existing shopping categories such as:
o Men’s: Clothing & Footwear.
o Women’s: Clothing & Footwear.
o Home: Appliances & Decor.
o Electronics: Mobile & Camera.
Performing slice operation, a single dimension of the data cube can be extracted out to
form a new cube. Similarly, more than one dimension can also be extracted out from
same data cube as required.
Consider the example of sales of four companies C1, C2, C3 &C4 per quarter on the
basis of product category (Men’s, Women’s, Electronics &Home). Out of the four
companies, two companies are form India (C1 & C2) and two are from America(C3 &
C4). A dimension (Shopping, Sales Per Quarter) can be sliced from the data cube through
the technique of slice operation.
Through Dice operation, a sub cube can be generated by selecting two or more than two
dimensions from the data cube.
Consider the example of sales of four companies C1, C2, C3 &C4 per quarter based on
product category (Men’s, Women’s, Electronics &Home). Out of the four companies,
two companies are form India (C1 & C2) and two are from America (C3 & C4). So, if we
want to perform Dice operation on the given data cube, we can do it by selecting any two
parameters across all the three dimensions i.e. Companies (C1, C2), Category (Home,
Appliances) & Sales(Q1,Q2).
d) Pivot (Rotate):
Rotation of data cube’s orientation to check for its other data views is known as pivot
operation. Pivot operation provides alternate views of data available to the users.
Consider the example of sales of four companies C1, C2, C3 &C4 per quarter based on
product category (Men’s, Women’s, Electronics &Home). Out of the four companies,
two companies are form India (C1 & C2) and two are from America (C3 & C4). So, if we
want to perform Pivot operation, we can do it by rotating any one the dimension of the
data cube.
Ans:
Star schema is the fundamental schema among the data mart schema and it is simplest. This
schema is widely used to develop or build a data warehouse and dimensional data marts. It
includes one or more fact tables indexing any number of dimensional tables. The star schema is a
necessary case of the snowflake schema. It is also efficient for handling basic queries.
It is said to be star as its physical model resembles to the star shape having a fact table at its
center and the dimension tables at its peripheral representing the star’s points. Below is an
example to demonstrate the Star Schema:
In the above demonstration, SALES is a fact table having attributes i.e. (Product ID, Order ID,
Customer ID, Employer ID, Total, Quantity, Discount) which references to the dimension tables.
Employee dimension table contains the attributes: Emp ID, Emp Name, Title, Department and
Region. Product dimension table contains the attributes: Product ID, Product Name, Product
Category, Unit Price. Customer dimension table contains the attributes: Customer ID, Customer
Name, Address, City, Zip. Time dimension table contains the attributes: Order ID, Order Date,
Year, Quarter, Month.
Ans:
ETL provides a well-defined process for extracting data from varied source and loading it in the
data warehouse in a consolidated format.
Data Extraction
It is the net 1st step in ETL process. During this phase required data is first identified and
the extracted from varied sources like database systems and applications using as little
resources as possible.
During extraction stage a lot of data gets extracted than is actually required.
Size of extracted data can range from hundreds of kilobytes up to gigabytes.
Depending upon the capabilities of source system, sore transformation might take place
during extraction process itself.
To design and create an extraction process is most consuming part of ETL process.
Identification of Data Source.
The 1st stage of data extraction stage is identified of all the suitable data sources.
This process not only identifies data source but also ensures that the data source and the
extracted data will add weightage to data warehouse.
Let us assume that an organization designs a database to provide strategic information on
the orders that is fulfilled.
To do that, it needs the records of previous as well as current fulfilled and pending orders.
Now if orders are fulfilled through multiple channels, then organization also needs
reports about these channels.
The order fact table contains data related to order, such as data of delivery, item no., item
codes, discounts and credit limit.
The dimension table contains the details about products, customers and channels.
The organization also needs to ensure that it has the correct data sources needed for
database and this data source is able to supply correct data to each data element.
Identification of data source is a crucial step in the data extraction process, we need to go
through the source identification and ensure that whatever bit of data is entered into the data
warehouse must be authenticated.
8. Consider a data warehouse storing sales details of various goods sold and
the time of the sale, using this example the following OLAP operation
a) Roll - Up
b) Drill Down
c) Slice & Dice
d) Pivot (Rotate)
Ans:
1. Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
o Moving down in the concept hierarchy
o Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving
down in the concept hierarchy of Time dimension (Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:
o Climbing up in the concept hierarchy
o Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climbing
up in the concept hierarchy of Location dimension (City -> Country).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In
the cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
o Location = “Delhi” or “Kolkata”
o Time = “Q1” or “Q2”
o Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension
Time = “Q1”.
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation, performing
pivot operation gives a new view of it.
Categories of Metadata
Metadata can be broadly categorized into three categories −
Business Metadata − It has the data ownership information, business definition, and
changing policies.
Technical Metadata − It includes database system names, table and column names and
sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
Operational Metadata − It includes currency of data and data lineage. Currency of data
means whether the data is active, archived, or purged. Lineage of data means the history
of data migrated and transformation applied on it.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a warehouse is
different from the warehouse data, yet it plays an important role. The various roles of metadata
are explained below.
Metadata acts as a directory.
This directory helps the decision support system to locate the contents of the data
warehouse.
Metadata helps in decision support system for mapping of data when data is transformed
from operational environment to data warehouse environment.
Metadata helps in summarization between current detailed data and highly summarized
data.
Metadata also helps in summarization between lightly detailed data and highly
summarized data.
Metadata is used for query tools.
Metadata is used in extraction and cleansing tools.
Metadata is used in reporting tools.
Metadata is used in transformation tools.
Metadata plays an important role in loading functions.
The following diagram shows the roles of metadata.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
6. Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
o Moving down in the concept hierarchy
o Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving
down in the concept hierarchy of Time dimension (Quarter -> Month).
7. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:
o Climbing up in the concept hierarchy
o Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by climbing
up in the concept hierarchy of Location dimension (City -> Country).
8. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In
the cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
o Location = “Delhi” or “Kolkata”
o Time = “Q1” or “Q2”
o Item = “Car” or “Bus”
9. Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension
Time = “Q1”.
10. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation, performing
pivot operation gives a new view of it.
Here, the pink colored Dimension tables are the common ones among both the star schemas.
Green colored fact tables are the fact tables of their respective star schemas.
Example:
In above demonstration:
Placement is a fact table having attributes: (Stud_roll, Company_id, TPO_id) with facts:
(Number of students eligible, Number of students placed).
Workshop is a fact table having attributes: (Stud_roll, Institute_id, TPO_id) with facts:
(Number of students selected, Number of students attended the workshop).
Company is a dimension table having attributes: (Company_id, Name, Offer_package).
Student is a dimension table having attributes: (Student_roll, Name, CGPA).
TPO is a dimension table having attributes: (TPO_id, Name, Age).
Training Institute is a dimension table having attributes: (Institute_id, Name,
Full_course_fee).
So, there are two fact tables namely, Placement and Workshop which are part of two different
star schemas having dimension tables – Company, Student and TPO in Star schema with fact
table Placement and dimension tables – Training Institute, Student and TPO in Star schema with
fact table Workshop. Both the star schema have two dimension tables common and hence,
forming a fact constellation or galaxy schema.
Advantage: Provides a flexible schema.
Disadvantage: It is much more complex and hence, hard to implement and maintain.
Q.12 Draw the Star Schema and Snowflake Schema.
Consider the PLACEMENT CELL DEPARTMENT
Ans:
Star Schema:
Snowflake Schema: