0% found this document useful (0 votes)
61 views18 pages

DW Unit III Notes

This document discusses metadata, data marts, and partition strategies, emphasizing the importance of metadata in data management and its various types, roles, and challenges. It outlines the concept of data marts, their types, and the reasons for their creation, along with strategies for designing cost-effective data marts. The document also highlights the significance of metadata repositories and the benefits they provide in managing and improving data quality and accessibility.

Uploaded by

Preethika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views18 pages

DW Unit III Notes

This document discusses metadata, data marts, and partition strategies, emphasizing the importance of metadata in data management and its various types, roles, and challenges. It outlines the concept of data marts, their types, and the reasons for their creation, along with strategies for designing cost-effective data marts. The document also highlights the significance of metadata repositories and the benefits they provide in managing and improving data quality and accessibility.

Uploaded by

Preethika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT III

META DATA, DATA MART AND PARTITION STRATEGY

Meta Data – Categories of Metadata – Role of Metadata – Metadata Repository –


Challenges for Meta Management - Data Mart – Need of Data Mart- Cost Effective
Data Mart- Designing Data Marts- Cost of Data Marts- Partitioning Strategy – Vertical
partition – Normalization – Row Splitting – Horizontal Partition

Meta Data
Metadata is simply defined as data about data. The data that is used to represent other
data is known as metadata. For example, the index of a book serves as a metadata for
the contents in the book. In other words, we can say that metadata is the summarized
data that leads us to detailed data. In terms of data warehouse, we can define metadata
as follows.
• Metadata is the road-map to a data warehouse.
• Metadata in a data warehouse defines the warehouse objects.
• Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Metadata is data that describes and contextualizes other data. It provides information
about the content, format, structure, and other characteristics of data, and can be used
to improve the organization, discoverability, and accessibility of data.
Metadata can be stored in various forms, such as text, XML, or RDF, and can be
organized using metadata standards and schemas. There are many metadata standards
that have been developed to facilitate the creation and management of metadata, such
as Dublin Core, schema.org, and the Metadata Encoding and Transmission Standard
(METS). Metadata schemas define the structure and format of metadata and provide a
consistent framework for organizing and describing data.
Metadata can be used in a variety of contexts, such as libraries, museums, archives, and
online platforms. It can be used to improve the discoverability and ranking of content
in search engines and to provide context and additional information about search results.
Metadata can also support data governance by providing information about the
ownership, use, and access controls of data, and can facilitate interoperability by
providing information about the content, format, and structure of data, and by enabling
the exchange of data between different systems and applications. Metadata can also
support data preservation by providing information about the context, provenance, and
preservation needs of data, and can support data visualization by providing information
about the data’s structure and content, and by enabling the creation of interactive and
customizable visualizations.
Several Examples of Metadata:
Metadata is data that provides information about other data. Here are a few examples
of metadata:
1. File metadata: This includes information about a file, such as its name, size,
type, and creation date.
2. Image metadata: This includes information about an image, such as its
resolution, color depth, and camera settings.
3. Music metadata: This includes information about a piece of music, such as its
title, artist, album, and genre.
4. Video metadata: This includes information about a video, such as its length,
resolution, and frame rate.
5. Document metadata: This includes information about a document, such as its
author, title, and creation date.
6. Database metadata: This includes information about a database, such as its
structure, tables, and fields.
7. Web metadata: This includes information about a web page, such as its title,
keywords, and description.
Metadata is an important part of many different types of data and can be used to provide
valuable context and information about the data it relates to.
Categories of Metadata
There are many types of metadata that can be used to describe different aspects of data,
such as its content, format, structure, and provenance. Some common types of metadata
include:
1. Descriptive metadata: This type of metadata provides information about the
content, structure, and format of data, and may include elements such as title,
author, subject, and keywords. Descriptive metadata helps to identify and
describe the content of data and can be used to improve the discoverability of
data through search engines and other tools.
2. Administrative metadata: This type of metadata provides information about
the management and technical characteristics of data, and may include elements
such as file format, size, and creation date. Administrative metadata helps to
manage and maintain data over time and can be used to support data governance
and preservation.
3. Structural metadata: This type of metadata provides information about the
relationships and organization of data, and may include elements such as links,
tables of contents, and indices. Structural metadata helps to organize and connect
data and can be used to facilitate the navigation and discovery of data.
4. Provenance metadata: This type of metadata provides information about the
history and origin of data, and may include elements such as the creator, date of
creation, and sources of data. Provenance metadata helps to provide context and
credibility to data and can be used to support data governance and preservation.
5. Rights metadata: This type of metadata provides information about the
ownership, licensing, and access controls of data, and may include elements such
as copyright, permissions, and terms of use. Rights metadata helps to manage
and protect the intellectual property rights of data and can be used to support data
governance and compliance.
6. Educational metadata: This type of metadata provides information about the
educational value and learning objectives of data, and may include elements such
as learning outcomes, educational levels, and competencies. Educational
metadata can be used to support the discovery and use of educational resources,
and to support the design and evaluation of learning environments.
Metadata can be stored in various forms, such as text, XML, or RDF, and can be
organized using metadata standards and schemas. There are many metadata standards
that have been developed to facilitate the creation and management of metadata, such
as Dublin Core, schema.org, and the Metadata Encoding and Transmission Standard
(METS). Metadata schemas define the structure and format.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data, yet it plays an important role. The
various roles of metadata are explained below.
• Metadata acts as a directory.
• This directory helps the decision support system to locate the contents of the data
warehouse.
• Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
• Metadata helps in summarization between current detailed data and highly
summarized data.
• Metadata also helps in summarization between lightly detailed data and highly
summarized data.
• Metadata is used for query tools.
• Metadata is used in extraction and cleansing tools.
• Metadata is used in reporting tools.
• Metadata is used in transformation tools.
• Metadata plays an important role in loading functions.
The following diagram shows the roles of metadata.

Metadata Repository
A metadata repository is a database or other storage mechanism that is used to store
metadata about data. A metadata repository can be used to manage, organize, and
maintain metadata in a consistent and structured manner, and can facilitate the
discovery, access, and use of data.
A metadata repository may contain metadata about a variety of types of data, such as
documents, images, audio and video files, and other types of digital content. The
metadata in a metadata repository may include information about the content, format,
structure, and other characteristics of data, and may be organized using metadata
standards and schemas.
There are many types of metadata repositories, ranging from simple file systems or
spreadsheets to complex database systems. The choice of metadata repository will
depend on the needs and requirements of the organization, as well as the size and
complexity of the data that is being managed.
Metadata repositories can be used in a variety of contexts, such as libraries, museums,
archives, and online platforms. They can be used to improve the discoverability and
ranking of content in search engines, and to provide context and additional information
about search results. Metadata repositories can also support data governance by
providing information about the ownership, use, and access controls of data, and can
facilitate interoperability by providing information about the content, format, and
structure of data, and by enabling the exchange of data between different systems and
applications. Metadata repositories can also support data preservation by providing
information about the context, provenance, and preservation needs of data, and can
support data visualization by providing information about the data’s structure and
content, and by enabling the creation of interactive and customizable visualizations.
Benefits of Metadata Repository
A metadata repository is a centralized database or system that is used to store and
manage metadata. Some of the benefits of using a metadata repository include:
1. Improved data quality: A metadata repository can help ensure that metadata is
consistently structured and accurate, which can improve the overall quality of
the data.
2. Increased data accessibility: A metadata repository can make it easier for users
to access and understand the data, by providing context and information about
the data.
3. Enhanced data integration: A metadata repository can facilitate data
integration by providing a common place to store and manage metadata from
multiple sources.
4. Improved data governance: A metadata repository can help enforce metadata
standards and policies, making it easier to ensure that data is being used and
managed appropriately.
5. Enhanced data security: A metadata repository can help protect the privacy and
security of metadata, by providing controls to restrict access to sensitive or
confidential information.
Metadata repositories can provide many benefits in terms of improving the quality,
accessibility, and management of data.
Challenges for Metadata Management
There are several challenges that can arise when managing metadata:
1. Lack of standardization: Different organizations or systems may use different
standards or conventions for metadata, which can make it difficult to effectively
manage metadata across different sources.
2. Data quality: Poorly structured or incorrect metadata can lead to problems with
data quality, making it more difficult to use and understand the data.
3. Data integration: When integrating data from multiple sources, it can be
challenging to ensure that the metadata is consistent and aligned across the
different sources.
4. Data governance: Establishing and enforcing metadata standards and policies
can be difficult, especially in large organizations with multiple stakeholders.
5. Data security: Ensuring the security and privacy of metadata can be a challenge,
especially when working with sensitive or confidential information.

Metadata Management Software:


Software for managing metadata makes it easier to assess, curate, collect, and store
metadata. In order to enable data monitoring and accountability, organizations should
automate data management. Examples of this kind of software include the following:
• SAP Power Designer by SAP: This data management system has a good level
of stability. It is recognised for its ability to serve as a platform for model testing.
• SAP Information Steward by SAP: This solution’s data insights make it
valuable.
• IBM InfoSphere Information Governance Catalog by IBM: The ability to
use Open IGC to build unique assets and data lineages is a key feature of this
system.
• Alation Data Catalog by Alation: This provides a user-friendly, intuitive
interface. It is valued for the queries it can publish in Standard Query Language
(SQL).
• Informatica Enterprise Data Catalog by Informatica: The technology used
by this solution, which can both scan and gather information from diverse
sources, is highly respected.
Effective metadata management requires careful planning and coordination, as well as
robust processes and tools to ensure the quality, consistency, and security of the
metadata.

Data Mart
A Data Mart is a subset of a directorial information store, generally oriented to a
specific purpose or primary data subject which may be distributed to provide business
needs. Data Marts are analytical record stores designed to focus on particular business
functions for a specific community within an organization. Data marts are derived from
subsets of data in a data warehouse, though in the bottom-up data warehouse design
methodology, the data warehouse is created from the union of organizational data marts.
The fundamental use of a data mart is Business Intelligence (BI) applications. BI is
used to gather, store, access, and analyze record. It can be used by smaller businesses
to utilize the data they have accumulated since it is less expensive than implementing a
data warehouse.

Reasons for creating a data mart


o Creates collective data by a group of users
o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouses
o Potential clients are more clearly defined than in a comprehensive data
warehouse
o It contains only essential business data and is less cluttered.
Types of Data Marts
There are mainly two approaches to designing data marts. These approaches are
o Dependent Data Marts
o Independent Data Marts
Dependent Data Marts
A dependent data marts is a logical subset of a physical subset of a higher data
warehouse. According to this technique, the data marts are treated as the subsets of a
data warehouse. In this technique, firstly a data warehouse is created from which further
various data marts can be created. These data mart are dependent on the data warehouse
and extract the essential record from it. In this technique, as the data warehouse creates
the data mart; therefore, there is no need for data mart integration. It is also known as
a top-down approach.

Independent Data Marts


The second approach is Independent data marts (IDM) Here, firstly independent data
marts are created, and then a data warehouse is designed using these independent
multiple data marts. In this approach, as all the data marts are designed independently;
therefore, the integration of data marts is required. It is also termed as a bottom-up
approach as the data marts are integrated to develop a data warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data
Marts."
Hybrid Data Marts
It allows us to combine input from sources other than a data warehouse. This could be
helpful for many situations; especially when Adhoc integrations are needed, such as
after a new group or product is added to the organizations.

Cost Effective Data Mart


Follow the steps given below to make data marting cost-effective −
• Identify the Functional Splits
• Identify User Access Tool Requirements
• Identify Access Control Issues
Identify the Functional Splits
In this step, we determine if the organization has natural functional splits. We look for
departmental splits, and we determine whether the way in which departments use
information tend to be in isolation from the rest of the organization. Let's have an
example.
Consider a retail organization, where each merchant is accountable for maximizing the
sales of a group of products. For this, the following are the valuable information −
• sales transaction on a daily basis
• sales forecast on a weekly basis
• stock position on a daily basis
• stock movements on a daily basis
As the merchant is not interested in the products they are not dealing with, the data
marting is a subset of the data dealing which the product group of interest. The following
diagram shows data marting for different users.

Given below are the issues to be taken into account while determining the functional
split −
• The structure of the department may change.
• The products might switch from one department to other.
• The merchant could query the sales trend of other products to analyze what is
happening to the sales.
We need to determine the business benefits and technical feasibility of using a data mart.
Identify User Access Tool Requirements
We need data marts to support user access tools that require internal data structures.
The data in such structures are outside the control of data warehouse but need to be
populated and updated on a regular basis.
There are some tools that populate directly from the source system but some cannot.
Therefore additional requirements outside the scope of the tool are needed to be
identified for future.
In order to ensure consistency of data across all access tools, the data should not be
directly populated from the data warehouse, rather each tool must have its own data
mart.
Identify Access Control Issues
There should to be privacy rules to ensure the data is accessed by authorized users only.
For example a data warehouse for retail banking institution ensures that all the accounts
belong to the same legal entity. Privacy laws can force you to totally prevent access to
information that is not owned by the specific bank.
Data marts allow us to build a complete wall by physically separating data segments
within the data warehouse. To avoid possible privacy problems, the detailed data can be
removed from the data warehouse. We can create data mart for each legal entity and
load it via data warehouse, with detailed account data.
Designing Data Marts
Data marts should be designed as a smaller version of starflake schema within the data
warehouse and should match with the database design of the data warehouse. It helps
in maintaining control over database instances.
The summaries are data marted in the same way as they would have been designed
within the data warehouse. Summary tables help to utilize all dimension data in the
starflake schema.
Cost of Data Marting
The cost measures for data marting are as follows −
• Hardware and Software Cost
• Network Access
• Time Window Constraints
Hardware and Software Cost
Although data marts are created on the same hardware, they require some additional
hardware and software. To handle user queries, it requires additional processing power
and disk storage. If detailed data and the data mart exist within the data warehouse, then
we would face additional cost to store and manage replicated data.Data marting is more
expensive than aggregations, therefore it should be used as an additional strategy and
not as an alternative strategy.
Network Access
A data mart could be on a different location from the data warehouse, so we should
ensure that the LAN or WAN has the capacity to handle the data volumes being
transferred within the data mart load process.
Time Window Constraints
The extent to which a data mart loading process will eat into the available time window
depends on the complexity of the transformations and the data volumes being shipped.
The determination of how many data marts are possible depends on −
• Network capacity.
• Time window available
• Volume of data being transferred
• Mechanisms being used to insert data into a data mart

Partitioning Strategy
Partitioning is done to enhance performance and facilitate easy management of data.
Partitioning also helps in balancing the various requirements of the system. It optimizes
the hardware performance and simplifies the management of data warehouse by
partitioning each fact table into multiple separate partitions. In this chapter, we will
discuss different partitioning strategies.
Why is it Necessary to Partition?
Partitioning is important for the following reasons −
• For easy management,
• To assist backup/recovery,
• To enhance performance.
For Easy Management
The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This
huge size of fact table is very hard to manage as a single entity. Therefore it needs
partitioning.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with
all the data. Partitioning allows us to load only as much data as is required on a regular
basis. It reduces the time to load and also enhances the performance of the system.
To cut down on the backup size, all partitions other than the current partition can be
marked as read-only. We can then put these partitions into a state where they cannot be
modified. Then they can be backed up. It means only the current partition is to be backed
up.
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced.
Query performance is enhanced because now the query scans only those partitions that
are relevant. It does not have to scan the whole data.

Vertical partition
Vertical partitioning, splits the data vertically. The following images depicts how
vertical partitioning is done.
Vertical partitioning can be performed in the following two ways −
• Normalization
• Row Splitting
Normalization
Normalization is the standard relational method of database organization. In this
method, the rows are collapsed into a single row, hence it reduce space. Take a look at
the following tables that show how normalization is performed.
Table before Normalization

Product_id Qty Value sales_date Store_id Store_name Location Region

30 5 3.67 3-Aug-13 16 sunny Bangalore S

35 4 5.33 3-Sep-13 16 sunny Bangalore S

40 5 2.50 3-Sep-13 64 san Mumbai W

45 7 5.66 3-Sep-13 16 sunny Bangalore S


Table after Normalization

Store_id Store_name Location Region

16 Sunny Bangalore W

64 San Mumbai S

Product_id Quantity Value sales_date Store_id

30 5 3.67 3-Aug-13 16

35 4 5.33 3-Sep-13 16

40 5 2.50 3-Sep-13 64

45 7 5.66 3-Sep-13 16

Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row
splitting is to speed up the access to large table by reducing its size.
While using vertical partitioning, make sure that there is no requirement to perform a
major join operation between two partitions.
Identify Key to Partition
It is very crucial to choose the right partition key. Choosing a wrong partition key will
lead to reorganizing the fact table. Let's have an example. Suppose we want to partition
the following table.
Account_Txn_Table
transaction_id
account_id
transaction_type
value
transaction_date
region
branch_name
We can choose to partition on any key. The two possible keys could be
• region
• transaction_date
Suppose the business is organized in 30 geographical regions and each region has
different number of branches. That will give us 30 partitions, which is reasonable. This
partitioning is good enough because our requirements capture has shown that a vast
majority of queries are restricted to the user's own business region.
If we partition by transaction_date instead of region, then the latest transaction from
every region will be in one partition. Now the user who wants to look at data within his
own region has to query across multiple partitions.
Hence it is worth determining the right partitioning key.

Horizontal Partition
There are various ways in which a fact table can be partitioned. In horizontal
partitioning, we have to keep in mind the requirements for manageability of the data
warehouse.

Partitioning by Time into Equal Segments


In this partitioning strategy, the fact table is partitioned on the basis of time period. Here
each time period represents a significant retention period within the business. For
example, if the user queries for month to date data then it is appropriate to partition
the data into monthly segments. We can reuse the partitioned tables by removing the
data in them.
Partition by Time into Different-sized Segments
This kind of partition is done where the aged data is accessed infrequently. It is
implemented as a set of small partitions for relatively current data, larger partition for
inactive data.
• The detailed information remains available online.
• The number of physical tables is kept relatively small, which reduces the
operating cost.
• This technique is suitable where a mix of data dipping recent history and data
mining through entire history is required.
• This technique is not useful where the partitioning profile changes on a regular
basis, because repartitioning will increase the operation cost of data warehouse.
Partition on a Different Dimension
The fact table can also be partitioned on the basis of dimensions other than time such
as product group, region, supplier, or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like
on a state by state basis. If each region wants to query on information captured within
its region, it would prove to be more effective to partition the fact table into regional
partitions. This will cause the queries to speed up because it does not require to scan
information that is not relevant.
• The query does not have to scan irrelevant data which speeds up the query
process.
• This technique is not appropriate where the dimensions are unlikely to change in
future. So, it is worth determining that the dimension does not change in future.
• If the dimension changes, then the entire fact table would have to be
repartitioned.
We recommend to perform the partition only on the basis of time dimension, unless you
are certain that the suggested dimension grouping will not change within the life of the
data warehouse.
Partition by Size of Table
When there are no clear basis for partitioning the fact table on any dimension, then we
should partition the fact table on the basis of their size. We can set the predetermined
size as a critical point. When the table exceeds the predetermined size, a new table
partition is created.
• This partitioning is complex to manage.
• It requires metadata to identify what data is stored in each partition.
Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the
dimensions. Here we have to check the size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in
order to apply comparisons, that dimension may be very large. This would definitely
affect the response time.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is archived. It
uses metadata to allow user access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the data
warehouse.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy