0% found this document useful (0 votes)
12 views6 pages

DWDM202

Uploaded by

Harshit Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

DWDM202

Uploaded by

Harshit Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Data warehousing is a process used to collect, store, and manage large volumes of data from

various sources in a centralized repository. It is designed to facilitate efficient querying and


analysis, enabling organizations to make informed decisions based on comprehensive data
insights. It helps businesses make better decisions by providing access to historical data.

Database System: Primarily designed for transactional processing (OLTP - Online Transaction
Processing). It is optimized for fast and efficient real-time operations like inserting,
updating, deleting, and retrieving small sets of data. Uses normalized data structures to
eliminate redundancy and ensure data integrity. This makes transactional operations
efficient. Typically stores operational data from a single application or system. Processes
real-time operational data that constantly changes. Used in applications like banking
systems, e-commerce, healthcare records, and CRM where real-time transactions are
essential. Eg Technology- MySQL, PostgreSQL, Oracle DB.

Data Warehouse- Designed for analytical processing (OLAP - Online Analytical Processing).
It is optimized for complex queries, reporting, and data analysis over large datasets. Uses
denormalized data structures (star schema, snowflake schema) to optimize read-heavy
operations and speed up complex queries. Integrates data from multiple sources
(databases, external files, APIs) to create a unified data repository for analysis. Processes
historical data for long-term storage and analysis, often in batch mode. Used in business
intelligence, reporting, trend analysis, and predictive analytics where historical and large-
scale data analysis is required. Eg Amazon Redshift, Google BigQuery, Snowflake.

Componets data warehousing Data sources provide raw data to the warehouse. These
sources can include operational databases (OLTP systems), external data sources,
spreadsheets, cloud services, and log files. The data may come from different formats like
SQL databases, APIs, or flat files, requiring transformation before storage. 2. ETL (Extract,
Transform, Load) Process- The ETL process is responsible for extracting data from various
sources, transforming it into a suitable format, and loading it into the data warehouse. The
transformation step includes cleaning, aggregating, and normalizing data to ensure
consistency and accuracy. Data Staging Area This is a temporary storage area where raw
data is held before transformation. It allows for data validation, cleansing, and
preprocessing to avoid loading erroneous or redundant data into the warehouse. Business
Intelligence (BI) Tools - BI tools provide a user-friendly interface for querying, reporting, data
visualization, and dashboard creation. These tools help business users generate insights
from data warehouse queries without needing technical expertise.
NEED- Business User: Business users require a data warehouse to view summarized data
from the past. Since these people are non-technical, the data may be presented to them in
an elementary form. 2)Store historical data: Data Warehouse is required to store the time
variable data from the past. This input is made to be used for various purposes. 3) Make
strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions. 4) For data
consistency and quality: Bringing the data from different sources at a commonplace, the
user can effectively undertake to bring the uniformity and consistency in data. 5) High
response time: Data warehouse has to be ready for somewhat unexpected loads and types
of queries, which demands a significant degree of flexibility and quick response time.

Data gathering . Interviews-Conduct one-on-one or group discussions with stakeholders.


Helps gain in-depth insights into user needs. Surveys and Questionnaires Used to collect
structured responses from a large audience. Effective for identifying common requirements.
Workshops- Facilitates real-time collaboration among stakeholders. Useful for
brainstorming and resolving conflicting needs. Observation (Job Shadowing) - Watching
users perform tasks in their actual work environment. Helps uncover unstated or implicit
requirements. Prototyping- Developing mockups or wireframes to visualize requirements.
Helps stakeholders validate expectations early. Document Analysis- Reviewing existing
documents, reports, or system artifacts. Useful for legacy system upgrades or process
automation.
Characteristics- Subject-Oriented: Data warehouses are organized around key subjects or
business areas (e.g., sales, finance, customer data) rather than individual transactions.
Integrated: Data from different sources (such as databases, CRM systems, ERP systems, etc.)
is cleaned, transformed, and integrated into a consistent format, allowing for
comprehensive analysis. Time-Variant: Data warehouses store historical data, enabling
users to analyze trends over time. This is crucial for forecasting and decision-making. Non-
Volatile: Once data is entered into a data warehouse, it is not typically changed or deleted.
This stability allows for consistent reporting and analysis.

Data Mart: A subset of a data warehouse focused on a specific business unit, department,
or function (e.g., sales, marketing, finance). Usually derived from a single source or a subset
of the data warehouse. Uses a bottom-up approach—individual data marts are built first,
and they may later contribute to a data warehouse. Data marts focus on a specific subject
area, making it easier for users to access and analyze relevant data without having to
navigate the entire data warehouse. Processes more straightforward queries related to
departmental needs. Eg A sales data mart storing only sales-related data for analysis.
Features- Faster Access , Subject-Oriented , Improved Performance
Bottom Tier(Data Source Layer) – Data Extraction and Storage − The bottom tier of the
architecture is the data warehouse database server. It is the relational database system. We
use the back end tools and utilities to feed data into the bottom tier. These back end tools
and utilities perform the Extract, Clean, Load, and refresh functions. Relational databases
(e.g., Oracle, SQL Server), data warehouses (e.g., Amazon Redshift, Google BigQuery).
Middle Tier (Data Processing Layer)-OLAP & Business Logic− In the middle tier, we have the
OLAP Server that can be implemented in either of the following ways. By Relational OLAP
(ROLAP), which is an extended relational database management system. The ROLAP maps
the operations on multidimensional data to standard relational operations. By
Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional
data and operations. Eg Application servers, ETL tools (e.g., Apache NiFi, Talend)
Top-Tier (Presentation Layer) – Reporting & Visualization − This tier is the front-end client
layer. This layer holds the query tools and reporting tools, analysis tools and data mining
tools. I (e.g., Tableau, Power BI, Looker).

Metadata is data about the data or documentation about the information which is required
by the users. In data warehousing, metadata is one of the essential aspects. Metadata
includes the following:• The location and descriptions of warehouse systems and
components.• Names, definitions, structures, and content of data-warehouse and end-
users views. •Identification of authoritative data sources. •Integration and transformation
rules used to populate data. It is used for building, maintaining, managing, and using the
data warehouses. Metadata allow users access to help understand the content and find
data. Necessary- First, it acts as the glue that links all parts of the data warehouses. Next, it
provides information about the contents and structures to the developers. Finally, it opens
the doors to the end-users and makes the contents recognizable in their terms. TYPES-
Operational metadata contains all of the information about the operational data sources.
•Extraction and Transformation Metadata(this category of metadata contains information
about all the data transformation that takes place in the data staging area.) •End-User
Metadata(The end-user metadata allows the end-users to use their business terminology
and look for the information in those ways in which they usually think of the business.)

ETL stands for Extract, Transform, Load, and it is a crucial process in data warehousing and
data integration. ETL is used to collect data from various sources, transform it into a suitable
format, and load it into a target database or data warehouse for analysis and reporting. The
ETL process is essential for organizations that need to consolidate data from multiple
sources for analysis and reporting. Extract: The extraction phase involves retrieving data
from various source systems. These sources can be databases, flat files, APIs, cloud services,
or even web scraping. Transform: The transformation phase involves cleaning, enriching,
and converting the extracted data into a format suitable for analysis. This step is crucial for
ensuring data quality and consistency. Data Cleaning/Enrichment/Formatting Load: The
loading phase involves writing the transformed data into the target database or data
warehouse. This is where the data becomes available for analysis and reporting.
Data mining is the process of extracting meaningful patterns, trends, and insights from large
datasets using computational techniques. It transforms raw data into valuable insights by
leveraging algorithms, tools, and domain expertise. It works across structured,
unstructured, and semi-structured data to uncover patterns like associations, classifications,
and anomalies. Data mining automates the discovery of non-obvious patterns that are too
complex for manual detection. For example, retailers use data mining to identify purchasing
trends, while healthcare providers use it to predict disease outbreaks.

Techniques-Classification patterns involve predicting categorical labels based on input


features. This process is often used in supervised learning, where a model is trained on a
labeled dataset. Common algorithms-Decision Trees, Naïve Bayes, Support Vector Machines
(SVM). Clustering patterns group similar data points into clusters based on their
characteristics. Unlike classification, clustering is an unsupervised learning technique,
meaning it does not rely on labeled data. Algorithms: K-Means, Hierarchical
Clustering. Regression patterns involve predicting continuous numerical values based on
input features. This type of analysis is used to model relationships between variables and
forecast future outcomes. For example, predicting house prices based on features like size,
location Eg- Linear Regression, Logistic Regression. Anomaly Detection (Outlier Detection)
Identifies unusual patterns or deviations in data. Ex: Fraud detection in banking. Algor:
Isolation Forest, Local Outlier Factor (LOF).

Association Rules: Association patterns identify relationships between variables in large


datasets. They are commonly used in market basket analysis to find items that frequently
co-occur in transactions. For example, if customers who buy bread often also buy butter,
this relationship can be expressed as an association rule: "If bread is purchased, then butter
is likely to be purchased." The most famous algorithm for mining association rules is the
Apriori algorithm, which uses support and confidence metrics to evaluate the strength of
the rules. Concepts: Support: Measures how frequently an itemset appears in the dataset.
Confidence: Measures the likelihood that item B is bought when item A is bought. Lift:
Measures how much more likely A and B occur together compared to random chance.
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the
specific data mining task. For ex a dataset with customer records might have missing ages,
duplicate entries, or mismatched units (e.g., weight in pounds vs. kilograms). Pre-processing
resolves these issues to ensure accurate analysis. Data Cleaning: This involves identifying
and correcting errors or inconsistencies in the data, such as missing values, outliers, and
duplicates. Various techniques can be used for data cleaning, such as imputation, removal,
and transformation. Handling Missing Values – Removing or filling missing data using
methods like mean, median, or interpolation. Removing Duplicates – Eliminating redundant
records to ensure data integrity. Data Integration: This involves combining data from
multiple sources to create a unified dataset. Data integration can be challenging as it
requires handling data with different formats, structures, and semantics. Techniques such
as record linkage and data fusion can be used for data integration. Combining Data from
Multiple Sources – Merging data from different databases, files, or APIs. Handling Schema
Conflicts – Resolving differences in naming conventions, data types, or units.

Data Reduction: This involves reducing the size of the dataset while preserving the
important information. It can be achieved through techniques such as feature selection and
feature extraction. Feature Selection – Identifying and keeping only the most relevant
features. Sampling – Selecting a representative subset of data to reduce computational
complexity. Data Transformation: This involves converting the data into a suitable format
for analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization – Scaling numerical data to a common
range (e.g., 0-1) to prevent bias. Standardization – Transforming data to have zero mean and
unit variance. Data discretization is the process of converting continuous data into discrete
categories or intervals. This technique is particularly useful in data mining and machine learning, as
many algorithms perform better with categorical data. Discretization helps simplify the analysis,
improve interpretability, and reduce the impact of noise in the data. Converting Continuous Data to
Categorical – Grouping numerical values into bins (e.g., age groups: 0–18, 19–35, 36–60).
Issues- Privacy and Ethics: Data mining often involves analyzing sensitive information, such
as personal health records, financial data, or user behavior. This raises significant privacy
and ethical concerns. Scalability: As the volume of data generated continues to grow
exponentially, data mining faces scalability challenges. Processing massive datasets, such as
those generated by social media platforms, IoT devices, or e-commerce transactions,
requires substantial computational resources. Scalability & Performance -Handling Large
Datasets: Processing petabytes of data requires high computational power. Real-time Data
Mining: Challenges in extracting insights from streaming data efficiently. Overfitting occurs
when a model learns the training data too well, capturing noise and outliers rather than the
underlying patterns. As a result, while the model may perform exceptionally well on the
training dataset, it fails to generalize to new, unseen data, leading to poor performance in
real-world applications. Interpretability: As data mining models become more complex,
particularly with the rise of deep learning, interpretability becomes a significant challenge.

Data Mining applications- Market Basket Analysis- Helps businesses understand what
products are frequently bought together. Ex: A supermarket finds that customers who buy
diapers also buy baby wipe. Disease Prediction and Diagnosis- Machine learning models
analyze patient data to predict diseases like cancer or diabetes. Ex: AI-assisted radiology for
detecting tumors in X-rays. Risk Management = Predicts the likelihood of loan defaults or
stock market crashes. Ex: Banks use credit scoring models to assess loan applicants. Social
Media and Internet - Sentiment Analysis Analyzes user opinions from social media posts,
reviews, and comments. Ex: Twitter sentiment analysis to gauge public opinion on political
issues. Transportation: A diversified transportation company with a large direct sales force
can apply data mining to identify the best prospects for its services. A large consumer
merchandise organization can apply information mining to improve its business cycle to
retailers.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy