0% found this document useful (0 votes)
77 views10 pages

Data Warehouse: Subject-Oriented Integrated Time Variant Non Volatile

The document discusses data warehouses, their key characteristics and components. It describes how a data warehouse integrates data from multiple sources, provides a single consistent view of data over time, and is optimized for analysis rather than transactions. It also outlines the typical architecture of a data warehouse including the extraction, transformation and loading of data, operational data stores, data marts and dimensional modeling.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views10 pages

Data Warehouse: Subject-Oriented Integrated Time Variant Non Volatile

The document discusses data warehouses, their key characteristics and components. It describes how a data warehouse integrates data from multiple sources, provides a single consistent view of data over time, and is optimized for analysis rather than transactions. It also outlines the typical architecture of a data warehouse including the extraction, transformation and loading of data, operational data stores, data marts and dimensional modeling.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Data Warehouse

Subject-Oriented
-focus on specific area of an Organization.

Integrated
-build by various data sources (Databases or others) such as ERP,CRM etc.,

Time-Variant
-contains specific records varying with of a specific customer or item.

Non-Volatile
-Data once entered into DW cannot change. Means it adds new record rather than modify/update old record.

Benefits of DW
Saves Time and Money
-Because if we want some portion of data, we need only to query/search the single database rather than multiple data sources.

High ROI
-By optimal planning & Decision Making.

Data Consistency and Quality


-Data in DW is produced from different department (sources) such as Sales, Manufacturing , Finance, Accounting and standardized to a common format. So Data is Accuracy and consistent.

Enhanced business Intelligence


-Through OLAP business process/Strategy of an Organization is Enhanced, So DW provides BI.

Disadvantages of DW
Initial implementation cost is high. Adding new Data Source is Cost & Difficult. Cannot actively monitor the changes. Data owners lose control over the data.

OLTP
Real Time System. Day-to-Day Business operations are stored. Involves faster inserts and updates.

Limited storage capacity and Data. Generally operations initiated by end user.

OLAP
Consolidated System Historical Data Involves only long Inserts Huge storage capacity and Data. Generally operations are initiated by Batch jobs/programs (Scheduled).

DW Architecture
It includes the following

Different Data Sources


-For Example ERP, CRM, SCM, E-Commerce, Legacy, External or Other (Flat files, Excel).

ETL
-It involves three jobs 1. Extract -Means Taking data from different sources. 2. Transformation -Changing the data into common format. 3. Loading -Insert/save the data into ODS (Operational Data Source).

ODS
-Temporary and Small amount of Database. -Support Tactical Decision Making. -Optional. -can be updated daily, hourly, or even immediately after transactions on operational data. -a Subject -Oriented.

DW
-Permanent and Huge Amount of Data. -Support Strategic Decision Making. -can be updated based on need of Organization.

Data Mart
-Subset of DW -provides collective view for group of users. -Less Cost

Dimensional Data Model

-It Contains 1) Fact Table -Have Measures/Facts -Have Foreign Keys in Dimensional Table 2) Dimensional Table (Look Up Table) -Have Attributes

Schemas
-Means Logical Grouping of Tables or Data. -It is 3 types

Star Schema
-Contains One Fact Table & Many Dimensional Tables -Dimensional Tables are De-Normalized. -All Measures in Fact Table has same Granularity level. -Simple

Snowflake Schema
-Complex - Contains One Fact Table & Many Dimensional Tables -Dimensional Tables are normalized. -Improved Query Performance.

Galaxy Schema
-Complex -Contains Multiple Fact Tables, Multiple Dimensional Table and -----Conformed Dimension Table

Types o f OLAP
Multi Dimensional OLAP
-Pre Aggregated. -Query is processed by itself

Relational OLAP
-Not Pre Aggregated. -Query is processed by DW

Hybrid OLAP
-Pre Aggregated. -Query is processed by itself up to available attributes.

DW IMPLEMENTATION LIFE CYCLE

Project Planning
-It addresses the definition and scoping of the data warehouse project. -Then project planning focuses on resource and skill-level staffing requirements coupled with project task assignments, duration and sequencing -Scope definition -Tasks identification -Scheduling -Resource planning -Workload assignment -The end document represents a blueprint of the project

Business Requirements Definition


-Understanding of the business end users and their analytical requirements. -Success of the project depends on a solid understanding of the business requirements!!! -Understanding the key factors driving the business is crucial for successful translation of the business requirements into design considerations Business requirements definition follow -3 concurrent tracks focusing on Technology -Overall architectural framework and vision -Considerations: the business requirements current technical environment planned strategic technical directions -Based on the designed technical architecture Evaluation and selection of Products that will deliver needed capabilities

Hardware platform Database management system Extract-transformation-load (ETL) tools Data access query tools Reporting tools must be evaluated

Installation of selected products/components/tools Testing of installed products to ensure appropriate end-to-end integration within the data warehouse environment

Data -Design of the dimensional model -The physical design of the model Extraction, transformation, and loading (ETL) of source data into the target models Business intelligence applications Arrows in the diagram indicate the activity workflow along each of the parallel tracks Dependencies between the tasks are illustrated by the vertical alignment of the task boxes

Program/Project Management
-Enforces the project plan -Activities: -Status monitoring -Issue tracking -Development of a comprehensive communication plan that addresses both the business and IT units

Dimensional Modeling
- determines the data needed to address business users analytical requirements. A. Conceptual Data Model -High level Relational are specified B. Logical Data Model -Entities + Relationships + Attributes + Foreign keys C. Physical Data Model -Shows Table Structures

Data Staging Design & Development


-Initial Load and Incremental Loads -ETL Tool is used.

Product Selection and Installation

-h/w, DBMS, Data staging s/w is selected and installed. End User Application Specification Defining the physical structures setting up the database environment Setting up appropriate security preliminary performance tuning strategies, from indexing to partitioning and aggregations. If appropriate, OLAP databases are also designed during this process. The MOST important stage 70% of the risk and effort in the DW project is attributed to this stage ETL system capabilities: Extraction Cleansing and conforming Delivery and management Raw data is extracted from the operational source systems and is being transformed into meaningful information for the business ETL processes must be architected long before any data is extracted from the source ETL system strives to deliver high throughput, as well as high quality output Incoming data is checked for reasonable quality Data quality conditions are continuously monitored Kimball calls ETL a data warehouse back room. It is crucial that adequate planning was performed to make sure that: the results of technology, data, and BI application tracks are tested and fit together properly Appropriate education and support infrastructure is in place. It is critical that deployment be well orchestrated Deployment should be deferred if all the pieces, such as training, documentation, and validated data, are not ready for production release.

Physical Design

ETL Design and Development


ETL

Deployment

Maintenance
Occurs when the system is in production Includes: technical operational tasks that are necessary to keep the system performing optimally usage monitoring performance tuning index maintenance

system backup Ongoing support, education, and communication with business users. DW systems tend to expand (if they were successful) Is considered as a sign of success New requests need to be prioritized Starting the cycle again Building upon the foundation that has already been established Focusing on the new requirements

Growth

Slowly changing dimension


The usual changes to dimension tables are classified into three types
Type 1 Type 2 Type 3 Type 1 The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming you won't ever need to know how it used to be misspelled in the past.) Here is an example of a database table that keeps supplier information: Supplier_Key Supplier_Code Supplier_Name Supplier_State

123

ABC

Acme Supply Co CA

In this example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the table will be unique by the natural key (Supplier_Code). However, the joins will perform better on an integer than on a character string. Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record: Supplier_Key Supplier_Code Supplier_Name Supplier_State

123

ABC

Acme Supply Co IL

The obvious disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. You can't tell if your suppliers are tending to move to the Midwest, for example. But an advantage to Type 1 SCDs is that they are very easy to maintain. If you have calculated an aggregate table summarizing facts by state, it will need to be recalculated when the Supplier_State is changed.[1] The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made. In the same example, if the supplier moves to Illinois, the table could look like this, with incremented version numbers to indicate the sequence of changes: Supplier_Key Supplier_Code Supplier_Name Supplier_State Version

123

ABC

Acme Supply Co CA

124

ABC

Acme Supply Co IL

Another popular method for tuple versioning is to add 'effective date' columns. Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date

123

ABC

Acme Supply Co CA

01-Jan-2000

21-Dec-2004

124

ABC

Acme Supply Co IL

22-Dec-2004

The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying. Transactions that reference a particular surrogate key (Supplier_Key) are then permanently bound to the time slices defined by that row of the slowly changing dimension table. An aggregate table summarizing facts by state continues to reflect the historical state, i.e. the state the supplier was in at the time of the transaction; no update is needed.

If there are retrospective changes made to the contents of the dimension, or if new attributes are added to the dimension (for example a Sales_Rep column) which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to change. The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns designated for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 adds additional columns to the tables. In the following example, an additional column has been added to the table so as to record the supplier's original state: (only the previous history is stored ) Supplier_Ke Supplier_Cod Supplier_Nam Original_Supplier_Sta Effective_Dat Current_Supplier_Sta y e e te e te

123

ABC

Acme Supply CA Co

22-Dec-2004 IL

Note that this recordhaving only a column for the original state and a column for the current state can not track all historical changes, such as when a supplier moves a second time. One variation of this type is to create the field Previous_Supplier_State Original_Supplier_State which would then track only the most recent historical change. instead of

Types of Dimensions
Conformed dimension
A conformed dimension is a set of data attributes that have been physically implemented in multiple database tables using the same structure, attributes, domain values, definitions and concepts in each implementation. A conformed dimension cuts across many facts.

Junk Dimension
Junk dimensions are dimensions that contain miscellaneous data (like flags and indicators) that do not fit in the base dimension table

Degenerate dimension
A degenerate dimension is data that is dimensional in nature but stored in a fact table. For example, if you have a dimension that only has Order Number and Order Line Number, you would have a 1:1 relationship with the Fact table. Do you want to have two tables with a billion rows or one table with a billion rows. Therefore, this would be a degenerate dimension and Order Number and Order Line Number would be stored in the Fact table. Here is a pointer to this question from a previous ATE column.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy