Data Warehouse: Subject-Oriented Integrated Time Variant Non Volatile
Data Warehouse: Subject-Oriented Integrated Time Variant Non Volatile
Subject-Oriented
-focus on specific area of an Organization.
Integrated
-build by various data sources (Databases or others) such as ERP,CRM etc.,
Time-Variant
-contains specific records varying with of a specific customer or item.
Non-Volatile
-Data once entered into DW cannot change. Means it adds new record rather than modify/update old record.
Benefits of DW
Saves Time and Money
-Because if we want some portion of data, we need only to query/search the single database rather than multiple data sources.
High ROI
-By optimal planning & Decision Making.
Disadvantages of DW
Initial implementation cost is high. Adding new Data Source is Cost & Difficult. Cannot actively monitor the changes. Data owners lose control over the data.
OLTP
Real Time System. Day-to-Day Business operations are stored. Involves faster inserts and updates.
Limited storage capacity and Data. Generally operations initiated by end user.
OLAP
Consolidated System Historical Data Involves only long Inserts Huge storage capacity and Data. Generally operations are initiated by Batch jobs/programs (Scheduled).
DW Architecture
It includes the following
ETL
-It involves three jobs 1. Extract -Means Taking data from different sources. 2. Transformation -Changing the data into common format. 3. Loading -Insert/save the data into ODS (Operational Data Source).
ODS
-Temporary and Small amount of Database. -Support Tactical Decision Making. -Optional. -can be updated daily, hourly, or even immediately after transactions on operational data. -a Subject -Oriented.
DW
-Permanent and Huge Amount of Data. -Support Strategic Decision Making. -can be updated based on need of Organization.
Data Mart
-Subset of DW -provides collective view for group of users. -Less Cost
-It Contains 1) Fact Table -Have Measures/Facts -Have Foreign Keys in Dimensional Table 2) Dimensional Table (Look Up Table) -Have Attributes
Schemas
-Means Logical Grouping of Tables or Data. -It is 3 types
Star Schema
-Contains One Fact Table & Many Dimensional Tables -Dimensional Tables are De-Normalized. -All Measures in Fact Table has same Granularity level. -Simple
Snowflake Schema
-Complex - Contains One Fact Table & Many Dimensional Tables -Dimensional Tables are normalized. -Improved Query Performance.
Galaxy Schema
-Complex -Contains Multiple Fact Tables, Multiple Dimensional Table and -----Conformed Dimension Table
Types o f OLAP
Multi Dimensional OLAP
-Pre Aggregated. -Query is processed by itself
Relational OLAP
-Not Pre Aggregated. -Query is processed by DW
Hybrid OLAP
-Pre Aggregated. -Query is processed by itself up to available attributes.
Project Planning
-It addresses the definition and scoping of the data warehouse project. -Then project planning focuses on resource and skill-level staffing requirements coupled with project task assignments, duration and sequencing -Scope definition -Tasks identification -Scheduling -Resource planning -Workload assignment -The end document represents a blueprint of the project
Hardware platform Database management system Extract-transformation-load (ETL) tools Data access query tools Reporting tools must be evaluated
Installation of selected products/components/tools Testing of installed products to ensure appropriate end-to-end integration within the data warehouse environment
Data -Design of the dimensional model -The physical design of the model Extraction, transformation, and loading (ETL) of source data into the target models Business intelligence applications Arrows in the diagram indicate the activity workflow along each of the parallel tracks Dependencies between the tasks are illustrated by the vertical alignment of the task boxes
Program/Project Management
-Enforces the project plan -Activities: -Status monitoring -Issue tracking -Development of a comprehensive communication plan that addresses both the business and IT units
Dimensional Modeling
- determines the data needed to address business users analytical requirements. A. Conceptual Data Model -High level Relational are specified B. Logical Data Model -Entities + Relationships + Attributes + Foreign keys C. Physical Data Model -Shows Table Structures
-h/w, DBMS, Data staging s/w is selected and installed. End User Application Specification Defining the physical structures setting up the database environment Setting up appropriate security preliminary performance tuning strategies, from indexing to partitioning and aggregations. If appropriate, OLAP databases are also designed during this process. The MOST important stage 70% of the risk and effort in the DW project is attributed to this stage ETL system capabilities: Extraction Cleansing and conforming Delivery and management Raw data is extracted from the operational source systems and is being transformed into meaningful information for the business ETL processes must be architected long before any data is extracted from the source ETL system strives to deliver high throughput, as well as high quality output Incoming data is checked for reasonable quality Data quality conditions are continuously monitored Kimball calls ETL a data warehouse back room. It is crucial that adequate planning was performed to make sure that: the results of technology, data, and BI application tracks are tested and fit together properly Appropriate education and support infrastructure is in place. It is critical that deployment be well orchestrated Deployment should be deferred if all the pieces, such as training, documentation, and validated data, are not ready for production release.
Physical Design
ETL
Deployment
Maintenance
Occurs when the system is in production Includes: technical operational tasks that are necessary to keep the system performing optimally usage monitoring performance tuning index maintenance
system backup Ongoing support, education, and communication with business users. DW systems tend to expand (if they were successful) Is considered as a sign of success New requests need to be prioritized Starting the cycle again Building upon the foundation that has already been established Focusing on the new requirements
Growth
123
ABC
Acme Supply Co CA
In this example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the table will be unique by the natural key (Supplier_Code). However, the joins will perform better on an integer than on a character string. Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record: Supplier_Key Supplier_Code Supplier_Name Supplier_State
123
ABC
Acme Supply Co IL
The obvious disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. You can't tell if your suppliers are tending to move to the Midwest, for example. But an advantage to Type 1 SCDs is that they are very easy to maintain. If you have calculated an aggregate table summarizing facts by state, it will need to be recalculated when the Supplier_State is changed.[1] The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made. In the same example, if the supplier moves to Illinois, the table could look like this, with incremented version numbers to indicate the sequence of changes: Supplier_Key Supplier_Code Supplier_Name Supplier_State Version
123
ABC
Acme Supply Co CA
124
ABC
Acme Supply Co IL
Another popular method for tuple versioning is to add 'effective date' columns. Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date
123
ABC
Acme Supply Co CA
01-Jan-2000
21-Dec-2004
124
ABC
Acme Supply Co IL
22-Dec-2004
The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying. Transactions that reference a particular surrogate key (Supplier_Key) are then permanently bound to the time slices defined by that row of the slowly changing dimension table. An aggregate table summarizing facts by state continues to reflect the historical state, i.e. the state the supplier was in at the time of the transaction; no update is needed.
If there are retrospective changes made to the contents of the dimension, or if new attributes are added to the dimension (for example a Sales_Rep column) which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to change. The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns designated for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 adds additional columns to the tables. In the following example, an additional column has been added to the table so as to record the supplier's original state: (only the previous history is stored ) Supplier_Ke Supplier_Cod Supplier_Nam Original_Supplier_Sta Effective_Dat Current_Supplier_Sta y e e te e te
123
ABC
Acme Supply CA Co
22-Dec-2004 IL
Note that this recordhaving only a column for the original state and a column for the current state can not track all historical changes, such as when a supplier moves a second time. One variation of this type is to create the field Previous_Supplier_State Original_Supplier_State which would then track only the most recent historical change. instead of
Types of Dimensions
Conformed dimension
A conformed dimension is a set of data attributes that have been physically implemented in multiple database tables using the same structure, attributes, domain values, definitions and concepts in each implementation. A conformed dimension cuts across many facts.
Junk Dimension
Junk dimensions are dimensions that contain miscellaneous data (like flags and indicators) that do not fit in the base dimension table
Degenerate dimension
A degenerate dimension is data that is dimensional in nature but stored in a fact table. For example, if you have a dimension that only has Order Number and Order Line Number, you would have a 1:1 relationship with the Fact table. Do you want to have two tables with a billion rows or one table with a billion rows. Therefore, this would be a degenerate dimension and Order Number and Order Line Number would be stored in the Fact table. Here is a pointer to this question from a previous ATE column.