Chapter 1
Chapter 1
Data is a collection of raw, unorganized facts and details like text, observations,
figures, symbols and descriptions of things etc.
OR
“Data is a collection of facts and figure that can be recorded; it can be in text,
number, speech, video, and image. Database means a huge amount of inter-
related data is stored, retrieved and collect at one place in the database; In short,
it is a collection of inter-related data stored in the database. Management is a
collection of the program for security manages, retrieved and stored the data.”
What is Information?
Data Information
A Database Management System (DBMS) stores data in the form of tables and
uses an ER model and the goal is ACID properties. For example, a DBMS of a
college has tables for students, faculty, etc.
A Data Warehouse is separate from DBMS, it stores a huge amount of data,
which is typically collected from multiple heterogeneous sources like files,
DBMS, etc. The goal is to produce statistical results that may help in decision-
making. For example, a college might want to see quick different results, like
how the placement of CS students has improved over the last 10 years, in terms
of salaries, counts, etc.
Issues Occur while Building the Warehouse
● When and how to gather data: In a source-driven architecture for
gathering data, the data sources transmit new information, either continually
(as transaction processing takes place), or periodically (nightly, for
example). In a destination-driven architecture, the data warehouse
periodically sends requests for new data to the sources. Unless updates at the
sources are replicated at the warehouse via two phase commit, the warehouse
will never be quite up to-date with the sources. Two-phase commit is usually
far too expensive to be an option, so data warehouses typically have slightly
out-of-date data. That, however, is usually not a problem for decision-
support systems.
● What schema to use: Data sources that have been constructed
independently are likely to have different schemas. In fact, they may even
use different data models. Part of the task of a warehouse is to perform
schema integration, and to convert data to the integrated schema before they
are stored. As a result, the data stored in the warehouse are not just a copy of
the data at the sources. Instead, they can be thought of as a materialized view
of the data at the sources.
● Data transformation and cleansing: The task of correcting and
preprocessing data is called data cleansing. Data sources often deliver data
with numerous minor inconsistencies, which can be corrected. For example,
names are often misspelled, and addresses may have street, area, or city
names misspelled, or postal codes entered incorrectly. These can be
corrected to a reasonable extent by consulting a database of street names and
postal codes in each city. The approximate matching of data required for this
task is referred to as fuzzy lookup.
● How to propagate update: Updates on relations at the data sources must be
propagated to the data warehouse. If the relations at the data warehouse are
exactly the same as those at the data source, the propagation is
straightforward. If they are not, the problem of propagating updates is
basically the view-maintenance problem.
● What data to summarize: The raw data generated by a transaction-
processing system may be too large to store online. However, we can answer
many queries by maintaining just summary data obtained by aggregation on
a relation, rather than maintaining the entire relation. For example, instead of
storing data about every sale of clothing, we can store total sales of clothing
by item name and category.
Need for Data Warehouse
1.An ordinary Database can store MBs to GBs of data and that too for a specific
purpose. For storing data of TB size, the storage shifted to the Data Warehouse.
2. a transactional database doesn’t offer itself to analytics.
3.To effectively perform analytics, an organization keeps a central Data
Warehouse to closely study its business by organizing, understanding, and using
its historical data for making strategic decisions and analyzing trends.
Benefits of Data Warehouse
● Better business analytics: Data warehouse plays an important role in every
business to store and analysis of all the past data and records of the
company. which can further increase the understanding or analysis of data
for the company.
● Faster Queries: The data warehouse is designed to handle large queries
that’s why it runs queries faster than the database.
● Improved data Quality: In the data warehouse the data you gathered from
different sources is being stored and analyzed it does not interfere with or
add data by itself so your quality of data is maintained and if you get any
issue regarding data quality then the data warehouse team will solve this.
● Historical Insight: The warehouse stores all your historical data which
contains details about the business so that one can analyze it at any time and
extract insights from it.
Data Warehouse vs DBMS
Database Data Warehouse
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-
makers. Therefore, data warehouses typically provide a concise and
straightforward view around a particular subject, such as customer, product, or
sales, instead of the global organization's ongoing operations. This is done by
excluding data that are not useful concerning the subject and including all data
needed by the users to understand the subject.
Integrated
Time-Variant
Non-Volatile
Data from operational databases and external sources (such as user profile data
provided by external consultants) are extracted using application program
interfaces called a gateway. A gateway is provided by the underlying DBMS
and allows customer programs to generate SQL code to be executed at a server.
A middle-tier which consists of an OLAP server for fast querying of the data
warehouse.
Data warehouses require increase loading of new data periodically basis within
narrow time windows; performance on the load process should be measured in
hundreds of millions of rows and gigabytes per hour and must not artificially
constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse,
including data conversion, filtering, reformatting, indexing, and metadata
update.
Query Performance
Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a
few to hundreds of gigabytes and terabyte-sized data warehouses.