DWDM 2 Unit Notes
DWDM 2 Unit Notes
(KOE093)
UNIT-II
Data Warehouses are information gathered from multiple sources and saved under a schema
that is living on the identical site. It is made with the aid of diverse techniques, inclusive of
the following processes:
1. Data Cleanup: Data cleaning is the way of preparing statistics for analysis with the help
of getting rid of or enhancing incorrect, incomplete, irrelevant, duplicate, or irregularly
formatted information. This fact is no longer necessary or beneficial if you want to research
the statistics because it is able to interrupt the technique or supply false results.
2. Data Integration: Data integration is the process of integrating data from different
assets into a unified view. The integration method starts with a startup and includes steps
that include refinement, ETL mapping, and conversion. Data integration ultimately permits
analytics tools to create powerful and cheap enterprise intelligence. In a typical data
integration procedure, the client sends a request for information to the master server. The
master server prepares the vital records for internal and external assets. Extracts facts from
sources and then integrates them into a single information set. It is then returned to the
client for use.
3. Data Transformation: The process of converting information from one layout or shape
to another is referred to as data transformation. Data transformation is critical for features
that include data integration and information management. Data transformation has several
capabilities: you can change the record types based on the needs of your project; enrich or
aggregate the records by removing invalid or duplicate data. Generally, the technique
consists of two stages.
In the first step, you should:
Perform an information search that identifies assets and data types.
Mapping data to discover how character fields are mapped, edited, inserted, filtered, and
stored.
In the second step, you must:
Extract data from the original source. The size of the supply can range from a connected
tool to a dependable useful resource along with a database or streaming resources,
including telemetry or logging files from clients who use your web application.
The target may be a database or a data warehouse that manages structured and
unstructured records.
4. Loading Data: Data loading is the process of copying and loading data from a report,
folder, or application to a database or similar utility. This is usually done via copying
digital data from the source and pasting or loading the records into a data warehouse or
processing tool. Data-loading is used in data extraction and loading methods. Typically,
such information is loaded in a different format than the original location of the source.
5. Data Refreshing: In this process, the data stored in the warehouse is periodically
refreshed so that it maintains its integrity. A data warehouse is a model of multidimensional
data structures that are known as “Data Cubes” in which every dimension represents an
attribute or different set of attributes in the schema of the data and each cell is used to store
the value. Data is gathered from various sources such as hospitals, banks, organizations,
and many more and goes through a process called ETL (Extract, Transform, Load).
Extract: This process reads the data from the database of various sources.
Transform: It transforms the data stored inside the databases into data cubes so that it
can be loaded into the warehouse.
Load: It is a process of writing the transformed data into the data warehouse.
Data quality: Ensuring data quality in a data warehouse is a major challenge. The data
coming from various sources may have inconsistencies, duplications, and inaccuracies,
which can affect the overall quality of the data in the warehouse.
Data integration: Integrating data from various sources into a data warehouse can be
challenging, especially when dealing with data that is structured differently or has different
formats.
Data consistency: Maintaining data consistency across various data sources and over time
is a challenge. Changes in the source systems can affect the consistency of the data in the
warehouse.
Data governance: Managing the access, use, and security of the data in the warehouse is
another challenge. Ensuring compliance with legal and regulatory requirements can also be
challenging.
Performance: Ensuring that the data warehouse performs efficiently and delivers fast
query response times can be a challenge, particularly as the volume of data increases over
time.
Data modeling: Designing an effective data model that reflects the needs of the
organization and optimizes query performance can be a challenge.
Data security: Ensuring the security of the data in the warehouse is a critical challenge,
particularly as the data warehouse contains sensitive information.
Resource allocation: Building and maintaining a data warehouse requires significant
resources, including skilled personnel, hardware, and software, which can be a challenge to
allocate and manage effectively.
Advantages:
1. Improved decision making: Data warehousing and data mining can help to improve
decision making by providing insights and information that would otherwise be difficult
or impossible to obtain.
2. Increased efficiency: Data warehousing and data mining can help to increase
efficiency by automating the process of extracting, cleaning, and analyzing data.
3. Improved data quality: Data warehousing and data mining can help to improve the
quality of data by identifying and correcting errors, inconsistencies, and missing data.
4. Improved data security: Data warehousing and data mining can help to improve data
security by providing a central repository for storing data and controlling access to that
data.
5. Improved scalability: Data warehousing and data mining can help to improve
scalability by providing a way to manage and analyze large amounts of data.
Here are some general guidelines for hardware selection, not entirely specific to hardware for
the data warehouse.
Scalability. When your data warehouse grows in terms of the number of users, the number of
queries, and the complexity of the queries, ensure that your selected hardware could be scaled
up.
Support. Vendor support is crucial for hardware maintenance. Make sure that the support
from the hardware vendor is at the highest possible level.
Vendor Reference. It is important to check vendor references with other sites using
hardware from this vendor. You do not want to be caught with your data warehouse being
down because of hardware malfunctions when the CEO wants some critical analysis to be
completed.
Vendor Stability. Check on the stability and staying power of the vendor.
Client-Server Model
The Client-server model is a distributed application structure that partitions task or
workload between the providers of a resource or service, called servers, and service
requesters called clients. In the client-server architecture, when the client computer sends a
request for data to the server through the internet, the server accepts the requested process
and deliver the data packets requested back to the client. Clients do not share any of their
resources. Examples of Client-Server Model are Email, World Wide Web, etc.
How the Client-Server Model works ?
In this article we are going to take a dive into the Client-Server model and have a look at
how the Internet works via, web browsers. This article will help us in having a solid
foundation of the WEB and help in working with WEB technologies with ease.
Client: When we talk the word Client, it mean to talk of a person or an organization
using a particular service. Similarly in the digital world a Client is a computer (Host)
i.e. capable of receiving information or using a particular service from the service
providers (Servers).
Servers: Similarly, when we talk the word Servers, It mean a person or medium that
serves something. Similarly in this digital world a Server is a remote computer which
provides information (data) or access to particular services.
So, its basically the Client requesting something and the Server serving it as long as its
present in the database.
PARALLEL PROCESSORS
DEFINITION
The processing of large amounts of data is typical for data warehouse environments.
Depending on the available hardware resources, sooner or later the point is reached where a
job cannot be processed on a single processor resp. cannot be represented by a single process
anymore. The reasons for that are:
Parallel processing
Parallel processing is a method in computing of running two or more processors (CPUs) to
handle separate parts of an overall task. Breaking up different parts of a task among
multiple processors will help reduce the amount of time to run a program. Any system that
has more than one CPU can perform parallel processing, as well as multi-core processors
which are commonly found on computers today.
Parallel processing is commonly used to perform complex tasks and computations. Data
scientists will commonly make use of parallel processing for compute and data-intensive
tasks.
Typically a computer scientist will divide a complex task into multiple parts with a software
tool and assign each part to a processor, then each processor will solve its part, and the data is
reassembled by a software tool to read the solution or execute the task.
Typically each processor will operate normally and will perform operations in parallel as
instructed, pulling data from the computer’s memory. Processors will also rely on software to
communicate with each other so they can stay in sync concerning changes in data values.
Assuming all the processors remain in sync with one another, at the end of a task, software
will fit all the data pieces together.
Computers without multiple processors can still be used in parallel processing if they are
networked together to form a cluster.
Clustered Systems
Clustered systems are similar to parallel systems as they both have multiple CPUs. However
a major difference is that clustered systems are created by two or more individual computer
systems merged together. Basically, they have independent computer systems with a common
storage and the systems work together.
he clustered systems are a combination of hardware clusters and software clusters. The
hardware clusters help in sharing of high performance disks between the systems. The
software clusters makes all the systems work together .
Each node in the clustered systems contains the cluster software. This software monitors the
cluster system and makes sure it is working as required. If any one of the nodes in the
clustered system fail, then the rest of the nodes take control of its storage and resources and
try to restart.
There are primarily two types of clustered systems i.e. asymmetric clustering system and
symmetric clustering system. Details about these are given as follows −
In this system, one of the nodes in the clustered system is in hot standby mode and all the
others run the required applications. The hot standby mode is a failsafe in which a hot
standby node is part of the system . The hot standby node continuously monitors the server
and if it fails, the hot standby node takes its place.
In symmetric clustering system two or more nodes all run applications as well as monitor
each other. This is more efficient than asymmetric system as it uses all the hardware and
doesn't keep a node merely as a hot standby
There are many different purposes that a clustered system can be used for. Some of these can
be scientific calculations, web support etc. The clustering systems that embody some major
attributes are −
Performance
Clustered systems result in high performance as they contain two or more individual
computer systems merged together. These work as a parallel unit and result in much
better performance for the system.
Fault Tolerance
Clustered systems are quite fault tolerant and the loss of one node does not result in
the loss of the system. They may even contain one or more nodes in hot standby mode
which allows them to take the place of failed nodes.
Scalability
Clustered systems are quite scalable as it is easy to add a new node to the system.
There is no need to take the entire cluster down to add a new node.
Distributed Database System
A distributed database is basically a database that is not limited to one system, it is spread
over different sites, i.e, on multiple computers or over a network of computers. A
distributed database system is located on various sites that don’t share physical
components. This may be required when a particular database needs to be accessed by
various users globally. It needs to be managed such that for the users it looks like one single
database.
Types:
1. Homogeneous Database:
In a homogeneous database, all different sites store database identically. The operating
system, database management system, and the data structures used – all are the same at all
sites. Hence, they’re easy to manage.
2. Heterogeneous Database:
In a heterogeneous distributed database, different sites can use different schema and
software that can lead to problems in query processing and transactions. Also, a particular
site might be completely unaware of the other sites. Different computers may use a
different operating system, different database application. They may even use different data
models for the database. Hence, translations are required for different sites to
communicate.
Distributed Data Storage :
There are 2 ways in which data can be stored on different sites. These are:
1. Replication –
In this approach, the entire relationship is stored redundantly at 2 or more sites. If the entire
database is available at all sites, it is a fully redundant database. Hence, in replication,
systems maintain copies of data.
This is advantageous as it increases the availability of data at different sites. Also, now
query requests can be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated. Any
change made at one site needs to be recorded at every site that relation is stored or else it
may lead to inconsistency. This is a lot of overhead. Also, concurrency control becomes
way more complex as concurrent access now needs to be checked over a number of sites.
2. Fragmentation –
In this approach, the relations are fragmented (i.e., they’re divided into smaller parts) and
each of the fragments is stored in different sites where they’re required. It must be made
sure that the fragments are such that they can be used to reconstruct the original relation
(i.e, there isn’t any loss of data).
Applications of Distributed Database:
It is used in Corporate Management Information System.
It is used in multimedia applications.
Used in Military’s control system, Hotel chains etc.
It is also used in manufacturing control system.
In the past, data warehouse schemas were often strictly enforced across an enterprise, but in
modern implementations where storage is increasingly inexpensive, schemas have become
less constrained. Despite this loosening or sometimes total abandonment of data warehouse
schemas, knowledge of the foundational schema designs can be important to both
maintaining legacy resources and for creating modern data warehouse design that learns from
the past.
The basic components of all data warehouse schemas are fact and dimension tables. The
different combination of these two central elements compose almost the entirety of all data
warehouse schema designs.
Fact Table
A fact table aggregates metrics, measurements, or facts about business processes. In this
example, fact tables are connected to dimension tables to form a schema architecture
representing how data relates within the data warehouse. Fact tables store primary keys of
dimension tables as foreign keys within the fact table.
Dimension Table
Dimension tables are non-denormalized tables used to store data attributes or dimensions. As
mentioned above, the primary key of a dimension table is stored as a foreign key in the fact
table. Dimension tables are not joined together. Instead, they are joined via association
through the central fact table.
3 Types of Schema Used in Data Warehouses
History presents us with three prominent types of data warehouse schema known as Star
Schema, Snowflake Schema, and Galaxy Schema. Each of these data warehouse schemas
has unique design constraints and describes a different organizational structure for how data
is stored and how it relates to other data within the data warehouse
Star Schema
The star schema in a data warehouse is historically one of the most straightforward designs.
This schema follows some distinct design parameters, such as only permitting one central
table and a handful of single-dimension tables joined to the table. In following these design
constraints, star schema can resemble a star with one central table, and five dimension tables
joined (thus where the star schema got its name).
Star data warehouse schemas create a denormalized database that enables quick
querying responses
The primary key in the dimension table is joined to the fact table by the foreign key
Each dimension in the star schema maps to one dimension table
Dimension tables within a star scheme are not to be connected directly
Star schema creates denormalized dimension tables
Snowflake Schema
The Snowflake Schema is a data warehouse schema that encompasses a logical arrangement
of dimension tables. This data warehouse schema builds on the star schema by adding
additional sub-dimension tables that relate to first-order dimension tables joined to the fact
table.
Just like the relationship between the foreign key in the fact table and the primary key in the
dimension table, with the snowflake schema approach, a primary key in a sub-dimension
table will relate to a foreign key within the higher order dimension table.
Snowflake schema creates normalized dimension tables – a database structuring strategy that
organizes tables to reduce redundancy. The purpose of normalization is to eliminate any
redundant data to reduce overhead.
Galaxy Schema
The Galaxy Data Warehouse Schema, also known as a Fact Constellation Schema, acts as the
next iteration of the data warehouse schema. Unlike the Star Schema and Snowflake Schema,
the Galaxy Schema uses multiple fact tables connected with shared normalized dimension
tables. Galaxy Schema can be thought of as star schema interlinked and completely
normalized, avoiding any kind of redundancy or inconsistency of data.
Characteristics of the Galaxy Schema:
Prepared By: