5 Data Warehouse
5 Data Warehouse
DATA
WAREHOUSE &
DATA M I N I N G
WHAT IS AGGREGATION
The term "aggregation" refers to the process of collecting and combining multiple individual
items, data points, or elements into a single, summarized, or consolidated whole
WHAT IS OLAP
OLAP (Online Analytical Processing) is a category of software tools that provide analysis of
data stored in a database. OLAP tools enable users to interactively analyze multidimensional
data from multiple perspectives, providing the ability to perform complex calculations, trend
analysis, and data modeling
STUDY4SUB
2
OLAP Operations
1.Roll-up (Consolidation):
1. Aggregating data along a dimension. For example, rolling up daily sales data to a monthly or yearly level.
2.Drill-down:
1. Breaking down data into finer granularity. For example, drilling down from yearly sales data to quarterly, monthly, or
daily data.
3.Slice:
1. Taking a single layer of data from the cube. For example, analyzing sales data for a particular year.
4.Dice:
STUDY4SUB
1. Creating a sub-cube by selecting specific values for multiple dimensions. For example, analyzing sales data for a
specific product and region for a given time period.
5.Pivot (Rotate):
1. Reorienting the multidimensional view of data. For example, switching rows and columns to provide a different
3
perspective on the data
Types of OLAP Systems
1. Stores data in a multidimensional cube format. It is optimized for fast data retrieval and is highly efficient in
performing complex calculations.
1. Stores data in relational databases. It creates complex SQL queries to dynamically calculate data when needed.
1. Combines features of both MOLAP and ROLAP. It can store large volumes of detailed data in a relational database
STUDY4SUB
while aggregating data in multidimensional cubes.
2. Example: Microsoft Analysis Services (supports both MOLAP and ROLAP modes)
4
TYPES OF OLAP SERVERS
1. MOLAP (Multidimensional OLAP):Description: Stores data in multidimensional cubes.
2. ROLAP (Relational OLAP):Description: Stores data in relational databases and performs dynamic SQL queries to calculate needed data.
Advantages: Handles large volumes of data; leverages existing relational database infrastructure.
STUDY4SUB
3. HOLAP (Hybrid OLAP):Description: Combines features of both MOLAP and ROLAP.
Disadvantages: Complexity in implementation; may still have limitations in extremely large data environments.
5
4. DOLAP (Desktop OLAP): Description: Allows analysis to be performed on a desktop environment, often with data extracted from central
OLAP servers.
Disadvantages: Limited to the processing power and storage of the desktop machine.
Advantages: Accessibility from anywhere with an internet connection; no need for client software installation.
Disadvantages: Performance can be affected by internet speed; security concerns with web access.
STUDY4SUB
Advantages: Immediate insights and analysis on current data.
6
7 STUDY4SUB
EF. CODDS 12 GUIDELINES FOR OLAP(WHAT OLAP SHOULD PROVIDE )
E.F. Codd, a pioneer in the field of relational databases, proposed a set of guidelines for Online Analytical Processing (OLAP)
systems to ensure they provide robust and effective multidimensional data analysis. Here are Codd's 12 guidelines for OLAP
STUDY4SUB
8. Multi-User Support: Allows multiple concurrent users with robust security.
8
10. Intuitive Manipulation: Easy data manipulation (slice, dice, drill-down).
Data mining interfaces serve as the gateway for users to interact with and extract insights from large datasets. Here's a brief overview:
Purpose: Data mining interfaces facilitate the exploration and analysis of vast datasets to uncover patterns, trends, and associations that may
not be immediately apparent through traditional data analysis methods
Key Features:
STUDY4SUB
Interactive exploration for deeper insights.
9
Challenges: Data quality, complexity, privacy concerns, interpretability, and scalability.
Backup and Recovery:
Backup and recovery are essential components of data management, ensuring the preservation and availability of data in the event of data loss, corruption, or
system failures.
Backup:
• Types of backups include full backups (complete copies of all data), incremental backups (only copies changes since the last
backup), and differential backups (copies changes since the last full backup).
• Backup strategies involve determining the frequency of backups, the retention period for backup copies, and the storage locations
for backup data.
Recovery:
• The process of restoring data from backups after a data loss event.
STUDY4SUB
• Recovery methods depend on the type of backup and the extent of data loss, ranging from restoring individual files or folders to
entire systems.
• Recovery procedures should be documented and tested regularly to ensure effectiveness in case of emergencies.
Effective backup and recovery strategies are crucial for business continuity, disaster recovery, and compliance with data protection regulations. They help
minimize downtime, mitigate risks, and ensure data integrity and availability in various scenarios, including hardware failures, human errors, cyber attacks,
10
and natural disasters.
HOW DATA BACKUP AND DATA RECOVERY IS MANAGED IN DATA WAREHOUSE
In a data warehouse, data backup and recovery are critical processes to ensure the integrity, availability, and reliability of the
stored information. Here's how they are typically managed
Data Backup:
2.Full and Incremental: Full backups of the entire database and incremental backups of changes are common.
3.Backup Storage: Copies are stored securely in systems like NAS, tape libraries, or cloud storage.
4.Validation: Backup integrity is ensured through validation checks and periodic restoration tests.
Data Recovery:
STUDY4SUB
2.Disaster Recovery Planning: Plans address hardware failures, natural disasters, and cyber attacks.
3.Backup Catalog Management: Maintains records of backup sets for tracking and managing recovery.
4.Automation: Uses automation tools for backup scheduling, monitoring, and recovery procedures
11
TUNING IN DATA WAREHOUSE
Tuning in data warehouses refers to the process of optimizing the performance and efficiency of the data warehouse system to
enhance query response times, improve data loading speeds, and maximize overall system throughput.
3. Partitioning: Divide large tables into smaller segments based on criteria like date ranges.
4. Compression Techniques: Reduce storage requirements and improve query performance by compressing data.
5. Data Distribution: Distribute data across multiple nodes to balance query loads.
STUDY4SUB
7. Query Workload Management: Prioritize and manage query workloads efficiently.
8. Data Loading Optimization: Optimize data loading processes to minimize downtime and maximize throughput.
9. Cache Management: Use caching mechanisms to store frequently accessed data for faster retrieval.
10. Monitoring and Tuning: Continuously monitor system performance and apply tuning adjustments as needed
12
TESTING DATA WAREHOUSE
Testing data warehouses ensures the accuracy, reliability, and performance of the stored data and analytical processes. Here's a brief overview
2. ETL Testing: Verify Extract, Transform, Load processes and data mappings.
3. Data Consistency Testing: Ensure coherence across data sources and dimensions.
STUDY4SUB
8. Metadata Testing: Validate accuracy and completeness of metadata.
9. Backup and Recovery Testing: Test backup procedures and data recoverability.
10. User Acceptance Testing (UAT): Involve end-users to ensure alignment with business needs
13
APPLICATIONS OF DATA WAREHOUSE
Customer Relationship Management (CRM): Unified customer data for personalized services.
Government and Public Sector: Improved policy planning and citizen services.
STUDY4SUB
Banking Sector: Risk management and customer analytics.
14
WEB MINING
Web mining is the process of extracting useful information and knowledge from web data. It involves analyzing the content, structure, and
usage patterns of websites to discover valuable insights. There are three main types of web mining: web content mining, web structure mining,
and web usage mining.
STUDY4SUB
15
16 STUDY4SUB
STUDY4SUB TEAM
T H A N K YO U
17 P R E S E N TAT I O N T I T L E