Dbms Module 4
Dbms Module 4
Data warehousing
What is Data Warehousing?
A Data Warehousing (DW) is process for collecting and managing data from varied sources to provide
meaningful business insights. A Data warehouse is typically used to connect and analyze business data
from heterogeneous sources. The data warehouse is the core of the BI system which is built for data
analysis and reporting.
It is a blend of technologies and components which aids the strategic use of data. It is electronic storage
of a large amount of information by a business which is designed for query and analysis instead of
transaction processing. It is a process of transforming data into information and making it available to
users in a timely manner to make a difference.
However, the real concept was given by Inmon Bill. He was considered as a father of data warehouse.
The decision support database (Data Warehouse) is maintained separately from the organization's
operational database. However, the data warehouse is not a product but an environment. It is an
architectural construct of an information system which provides users with current and historical
decision support information which is difficult to access or present in the traditional operational data
store.
Analytic Application
Data Warehouse
Structured
Semi-structured
Unstructured data
The data is processed, transformed, and ingested so that users can access the processed data in the
Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A data warehouse
merges information coming from different sources into one comprehensive database.
By merging all of this information in one place, an organization can analyze its customers more
holistically. This helps to ensure that it has considered all the information available. Data warehousing
makes data mining possible. Data mining is looking for patterns in the data that may lead to higher sales
and profits.
Enterprise Data Warehouse (EDW) is a centralized warehouse. It provides decision support service
across the enterprise. It offers a unified approach for organizing and representing data. It also provide
the ability to classify data according to the subject and give access according to those divisions.
Operational Data Store, which is also called ODS, are nothing but data store required when neither Data
warehouse nor OLTP systems support organizations reporting needs. In ODS, Data warehouse is
refreshed in real time. Hence, it is widely preferred for routine activities like storing records of the
Employees.
3. Data Mart:
A data mart is a subset of the data warehouse. It specially designed for a particular line of business, such
as sales, finance, sales or finance. In an independent data mart, data can collect directly from sources.
The following are general stages of use of the data warehouse (DWH):
Offline Operational Database:In this stage, data is just copied from an operational system to
another server. In this way, loading, processing, and reporting of the copied data do not impact
the operational system's performance.
Offline Data Warehouse:Data in the Datawarehouse is regularly updated from the Operational
Database. The data in Datawarehouse is mapped and transformed to meet the Datawarehouse
objectives.
Real time Data Warehouse:In this stage, Data warehouses are updated whenever any
transaction takes place in operational database. For example, Airline or railway booking system.
Integrated Data Warehouse:In this stage, Data Warehouses are updated continuously when the
operational system performs a transaction. The Datawarehouse then generates transactions
which are passed back to the operational system.
1.Load manager: Load manager is also called the front component. It performs with all the operations
associated with the extraction and load of data into the warehouse. These operations include
transformations to prepare the data for entering into the Data warehouse.
2.Warehouse Manager: Warehouse manager performs operations associated with the management of
the data in the warehouse. It performs operations like analysis of data to ensure consistency, creation of
indexes and views, generation of denormalization and aggregations, transformation and merging of
source data and archiving and baking-up data.
3.Query Manager: Query manager is also known as backend component. It performs all the operation
operations related to the management of user queries. The operations of this Data warehouse
components are direct queries to the appropriate tables for scheduling the execution of queries.
What Is a Data Warehouse Used For?
Here, are most common sectors where Data warehouse is used:
Airline:In the Airline system, it is used for operation purpose like crew assignment, analyses of route
profitability, frequent flyer program promotions, etc.
Banking:It is widely used in the banking sector to manage the resources available on desk effectively.
Few banks also used for the market research, performance analysis of the product and operations.
Healthcare:Healthcare sector also used Data warehouse to strategize and predict outcomes, generate
patient's treatment reports, share data with tie-in insurance companies, medical aid services, etc.
Public sector:In the public sector, data warehouse is used for intelligence gathering. It helps government
agencies to maintain and analyze tax records, health policy records, for every individual.
Investment and Insurance sector:In this sector, the warehouses are primarily used to analyze data
patterns, customer trends, and to track market movements.
Retain chain:In retail chains, Data warehouse is widely used for distribution and marketing. It also helps
to track items, customer buying pattern, promotions and also used for determining pricing policy.
Telecommunication:A data warehouse is used in this sector for product promotions, sales decisions and
to make distribution decisions.
Hospitality Industry:This Industry utilizes warehouse services to design as well as estimate their
advertising and promotion campaigns where they want to target clients based on their feedback and
travel patterns.
Data warehouse allows business users to quickly access critical data from some sources all in
one place.
Data warehouse provides consistent information on various cross-functional activities. It is also
supporting ad-hoc reporting and query.
Data Warehouse helps to integrate many sources of data to reduce stress on the production
system.
Data warehouse helps to reduce total turnaround time for analysis and reporting.
Restructuring and Integration make it easier for the user to use for reporting and analysis.
Data warehouse allows users to access critical data from the number of sources in a single place.
Therefore, it saves user's time of retrieving data from multiple sources.
Data warehouse stores a large amount of historical data. This helps users to analyze different
time periods and trends to make future predictions.
Difficult to make changes in data types and ranges, data source schema, indexes, and queries.
The data warehouse may seem easy, but actually, it is too complex for the average users.
Despite best efforts at project management, data warehousing project scope will always
increase.
Organisations need to spend lots of their resources for training and Implementation purpose.
An operational system is a method used in data warehousing to refer to a system that is used to process
the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the system
must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding and work with
particular instances of data more accessible. For example, author, data build, and data changed, and file
size are examples of very basic document metadata.Metadata is used to direct a query to the most
appropriate data source.
The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated)
data generated by the warehouse manager.
The goals of the summarized information are to speed up query performance. The summarized record is
updated continuously as new information is loaded into the warehouse.
The principal purpose of a data warehouse is to provide information to the business managers for
strategic decision-making. These customers interact with the warehouse using end-client access tools.
1. Separation: Analytical and transactional processing should be keep apart as much as possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the data volume, which
has to be managed and processed, and the number of user's requirements, which have to be met,
progressively increase.
3. Extensibility: The architecture should be able to perform new operations and technologies without
redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data
warehouses.
5. Administerability: Data Warehouse management should not be complicated.
Data Transformation − Involves converting the data from legacy format to warehouse format.
Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building indices
and partitions.
1. Virtual Warehouse
2. Data mart
3. Enterprise Warehouse
Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a virtual
warehouse. Building a virtual warehouse requires excess capacity on operational database servers.
Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups
of an organization.
In other words, we can claim that data marts contain data specific to a particular group. For example, the
marketing data mart may contain data related to items, customers, and sales. Data marts are confined to
subjects.
Window-based or Unix/Linux-based servers are used to implement data marts. They are implemented
on low-cost servers.
The implementation data mart cycles is measured in short periods of time, i.e., in weeks rather than
months or years.
The life cycle of a data mart may be complex in long run, if its planning and design are not
organization-wide.Data marts are small in size.Data marts are customized by department.
The source of a data mart is departmentally structured data warehouse.Data mart are flexible.
Enterprise Warehouse
An enterprise warehouse collects all the information and the subjects spanning an entire organization
It provides us enterprise-wide data integration.The data is integrated from operational systems and
external information providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.
Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is the
relational database system. We use the back end tools and utilities to feed data into the bottom tier.
These back end tools and utilities perform the Extract, Clean, Load, and refresh functions.
Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either of the
following ways.
By Relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP
maps the operations on multidimensional data to standard relational operations.
By Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional data and
operations.
Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting tools,
analysis tools and data mining tools.
The process of extracting information to identify patterns, trends, and useful data that would allow the
business to take the data-driven decision from huge sets of data is called Data Mining.
Data mining is the act of automatically searching for large stores of information to find trends and
patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical
algorithms for data segments and evaluates the probability of future events.
This process includes various types of services such as text mining, web mining, audio and video mining,
pictorial data mining, and social media mining. It is done through software that is simple or highly
specific.
Types of Data Mining
Data mining can be performed on the following types of data:
Relational Database:
A relational database is a collection of multiple data sets formally organized by tables, records, and
columns from which data can be accessed in various ways without having to recognize the database
tables. Tables convey and share information, which facilitates data searchability, reporting, and
organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the organization
to provide meaningful business insights. The huge amount of data comes from multiple places such as
Marketing and Finance. The extracted data is utilized for analytical purposes and helps in decision-
making for a business organization. The data warehouse is designed for the analysis of data rather than
transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. For example, a group of
databases, where an organization has kept various kinds of information.
Object-Relational Database:
One of the primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model .
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the potential to
undo a database transaction if it is not performed appropriately.
Data mining enables organizations to make lucrative modifications in operation and production.
Compared with other statistical data applications, data mining is a cost-efficient.
It Facilitates the automated discovery of hidden patterns as well as the prediction of trends and
behaviors.
It is a quick process that makes it easy for new users to analyze enormous amounts of data in a
short time.
Many data mining analytics software is difficult to operate and needs advance training to work
on.
Different data mining instruments operate in distinct ways due to the different algorithms used
in their design. Therefore, the selection of the right data mining tools is a very challenging task.
The data mining techniques are not precise, so that it may lead to severe consequences in
certain conditions.
Data Mining in Market Basket Analysis:If you buy a specific group of products, then you are more likely
to buy another group of products. This technique may enable the retailer to understand the purchase
behavior of a buyer. This data may assist the retailer in understanding the requirements of the buyer
and altering the store's layout accordingly.
It explore knowledge from the data generated from educational Environments.An organization can use
data mining to make precise decisions and also to predict the results of the student. With the results,
the institution can concentrate on what to teach and how to teach.
Data mining tools can be beneficial to find patterns in a complex manufacturing process.It can also be
used to forecast the product development period, cost, and expectations among the other tasks.
Data Mining in Fraud detection:
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a little bit
time consuming and sophisticated. Data mining provides meaningful patterns and turning data into
information. An ideal fraud detection system should protect the data of all the users. Supervised
methods consist of a collection of sample records, and these records are classified as fraudulent or
non-fraudulent. A model is constructed using this data, and the technique is made to identify whether
the document is fraudulent or not.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very challenging task.
Law enforcement may use data mining techniques to investigate offenses, monitor suspected terrorist
communications, etc. This technique includes text mining also, and it seeks meaningful patterns in data,
which is usually unstructured text. The information collected from the previous investigations is
compared, and a model for lie detection is constructed.
The process of extracting useful data from large volumes of data is data mining. The data in the
real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually be inaccurate or
unreliable. These problems may occur due to data measuring instrument or because of human errors.
Suppose a retail chain collects phone numbers of customers who spend more than $ 500, and the
accounting employees put the information into their system. The person may make a digit mistake when
entering the phone number, which results in incorrect data. Even some customers may not be willing to
disclose their phone numbers, which results in incomplete data. The data could get changed due to
human or system error. All these consequences (noisy and incomplete data)makes data mining
challenging.
Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing environment. It might
be in a database, individual systems, or even on the internet. Practically, It is a quite tough task to make
all the data to a centralized data repository mainly due to organizational and technical concerns. For
example, various regional offices may have their servers to store their data. It is not feasible to store, all
the data from all the offices on a central server. Therefore, data mining requires the development of
tools and algorithms that allow the mining of distributed data.
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and video, images,
complex data, spatial data, time series, and so on. Managing these various types of data and extracting
useful information is a tough task. Most of the time, new technologies, new tools, and methodologies
would have to be refined to obtain specific information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and techniques
used. If the designed algorithm and techniques are not up to the mark, then the efficiency of the data
mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and privacy. For
example, if a retailer analyzes the details of the purchased items, then it reveals data about buying
habits and preferences of the customers without their permission.
This technique is used to obtain important and relevant information about data and metadata. This data
mining technique helps to classify data in different classes.
2. Clustering:
In other words, we can say that Clustering analysis is a data mining technique to identify similar data.
This technique helps to recognize the differences and similarities between the data. Clustering is very
similar to the classification, but it involves grouping chunks of data together based on their
similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modeling. For example, we might use it to project
certain costs, depending on other factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden pattern
in the data set.
Association rules are if-then statements that support to show the probability of interactions between
data items within large data sets in different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery items that you
have been buying for the last six months. It calculates a percentage of items being purchased together.
These are three major measurements technique:
Lift:
This measurement technique measures the accuracy of the confidence over how often item B is
purchased.
Support:
This measurement technique measures how often multiple items are purchased and compared it to the
overall dataset.
Confidence:
This measurement technique measures how often item B is purchased when item A is purchased as well.
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which do not
match an expected pattern or expected behavior. This technique may be used in various domains like
intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or Outilier mining. The
outlier is a data point that diverges too much from the rest of the dataset. The majority of the real-world
datasets have an outlier. Outlier detection plays a significant role in the data mining field. Outlier
detection is valuable in numerous fields like network interruption identification, credit or debit card
fraud detection, detecting outlying in wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to discover
sequential patterns. It comprises of finding interesting subsequences in a set of sequences, where the
stake of a sequence can be measured in terms of different criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering, classification,
etc. It analyzes past events or instances in the right sequence to predict a future event.
Big data
What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which may be
stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical
recording media.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many
thousand flights per day, generation of data reaches up to many Petabytes.
Structured
Unstructured
Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science has achieved greater success in
developing techniques for working with such kind of data (where the format is well known in advance)
and also deriving value out of it. However, nowadays, we are foreseeing issues when a size of such data
grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the size
being huge, un-structured data poses multiple challenges in terms of its processing for deriving value
out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don't know how to derive value out of it since this data is in
its raw form or unstructured format.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS. Example
of semi-structured data is a data represented in an XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
1. Volume
2. Variety
3. Velocity
4. Variability
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very
crucial role in determining value out of data. Also, whether a particular data can actually be considered
as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one characteristic which
needs to be considered while dealing with Big Data.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most of
the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio,
etc. are also being considered in the analysis applications. This variety of unstructured data poses
certain issues for storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.Big Data Velocity
deals with the speed at which data flows in from sources like business processes, application logs,
networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and
continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
Access to social data from search engines and sites like facebook, twitter are enabling organizations to
fine tune their business strategies.
Traditional customer feedback systems are getting replaced by new systems designed with Big Data
technologies. In these new systems, Big Data and natural language processing technologies are being
used to read and evaluate consumer responses.
Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of Big Data
technologies and data warehouse helps an organization to offload infrequently accessed data.