DWM Exp1
DWM Exp1
Theory:
A] Data Warehouse
i. Definition
The Data Warehouse is an informational environment that provides an integrated and total view of
the enterprise making enterprise’s current and historical information easily available for decision
making and allowing decision-support transactions possible without hindering operational systems to
present a flexible and interactive source of strategic information and renders the organization’s
information consistent.
ii. Need of Data Warehouse
Running of simple queries and reports against current and historical data is needed for today’s
Industries.
Ability to perform Analysis on the data in many different ways and to query, step back, analyze,
and then continue the process to any desired length to understand the industry progress from
time to time.
We need to spot historical trends and apply them for future results for development of the
industry.
iii. Defining features of Data Warehouse
a) Subject Oriented
Operational systems store data by specific applications (like order processing or banking services), each
with tailored datasets for their functions. In contrast, data warehouses organize data by "business
subjects" critical to the enterprise (like sales, inventory, or customer accounts), enabling
comprehensive analysis and strategic decision-making across different areas of the business.
viii. Metadata
Metadata in a data warehouse is similar to the data dictionary or the data catalog in a database
management system.
Types of Metadata:
a) Operational Metadata:
Data for the data warehouse comes from diverse operational systems with varying data structures,
field lengths, and data types. Integration involves splitting, combining, and managing multiple coding
schemes. Operational metadata tracks these details to link delivered information back to the original
source data sets.
b) Extraction and Transformation Metadata:
Extraction and transformation metadata contain data about the extraction of data from the source
systems, namely, the extraction frequencies, extraction methods, and business rules for the data
extraction. Also, this category of metadata contains information about all the data transformations
that take
place in the data staging area.
c) End-User Metadata:
The end-user metadata is the navigational map of the data warehouse. It enables the end-users to find
information from the data warehouse. The end-user metadata allows the end-users to use their own
business terminology and look for information in those ways in which they normally think of the
business.
B] Data Mining
i. Definition
Data mining as a synonym for another popularly used term, knowledge discovery from data, or KDD
as an essential step in process of knowledge discovery. Another terms similar to data mining is
knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging.
ii. Steps in Data Mining
Commercial tools can assist in the data transformation step. Data migration tools allow simple
transformations to be specified such as to replace the string “gender” by “sex.” ETL
(extraction/transformation/loading) tools allow users to specify transforms through a graphical user
interface (GUI). These tools typically support only a restricted set of transforms so that, often, we may
also choose to write custom scripts for this step of the data cleaning process.
3. Data selection (where data relevant to the analysis task are retrieved from the database):
When selecting data for data mining, the goal is to choose data that will effectively support the
objectives of your analysis.
1. Based on Objectives
Predictive Modeling: For tasks such as forecasting or classification, select data that includes
both input features (independent variables) and target variables (dependent variables). For
instance, if predicting customer churn, you might include customer demographics,
transaction history, and previous interaction data.
Descriptive Analytics: To understand historical patterns and relationships, select
comprehensive datasets that capture relevant attributes. For example, to analyze customer
behavior, you might need purchase history, website interactions, and customer feedback.
Anomaly Detection: For identifying outliers or anomalies, focus on data with normal and
historical behavior patterns. Ensure you have a good representation of typical cases to
effectively spot deviations.
Clustering: Choose data that includes features relevant to the segmentation or grouping you
aim to achieve. For instance, in market segmentation, include demographic, behavioral, and
transactional data.
2. Data Granularity
Detail Level: Depending on the analysis, select data with the appropriate level of detail.
For high-level trends, aggregate data might suffice, while detailed, granular data is
necessary for in-depth analysis.
Aggregation: Sometimes, data needs to be aggregated to provide a summary or higher-
level view. For example, daily sales data might be aggregated into monthly or yearly totals
for trend analysis.
4. Data transformation (where data are transformed and consolidated into forms appropriate for mining
by performing summary or aggregation operations):
In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning, regression, and
clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and added
from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For example, the
daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is
typically used in constructing a data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as −1.0
to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels
(e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be
recursively organized into higher-level concepts, resulting in a concept hierarchy for the numeric
attribute. More than one concept hierarchy can be defined for the same attribute to accommodate the
needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be generalized
to higher-level concepts, like city or country. Many hierarchies for nominal attributes are implicit within
the database schema and can be automatically defined at the schema definition level.
5. Data mining (an essential process where intelligent methods are applied to extract data patterns):
Data mining supports knowledge discovery by finding hidden patterns and associations, constructing
analytical models, performing classification and prediction, and presenting the mining results using
visualization tools.
Information processing, based on queries, can find useful information. However, answers to such
queries reflect the information directly stored in databases or computable by aggregate functions. They
do not reflect sophisticated patterns or regularities buried in the database. Therefore, information
processing is not data mining.
Online analytical processing comes a step closer to data mining because it can derive information
summarized at multiple granularities from user-specified subsets of a data warehouse.
The functionalities of OLAP and data mining can be viewed as disjoint: OLAP is a data
summarization/aggregation tool that helps simplify data analysis, while data mining allows the
automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data.
OLAP tools are targeted toward simplifying and supporting interactive data analysis, whereas the goal
of data mining tools is to automate as much of the process as possible, while still allowing users to
guide the process. In this sense, data mining goes one step beyond traditional online analytical
processing.
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on
interestingness measures):
The use of only support and confidence measures to mine associations may generate a large
number of rules, many of which can be uninteresting to users. Instead, we can augment the
support–confidence framework with a pattern interestingness measure, which helps focus the
mining toward rules with strong pattern relationships. The added measure substantially reduces
the number of rules generated and leads to the discovery of more meaningful rules. Besides those
introduced in this section, many other interestingness measures have been studied in the
literature. Unfortunately, most of them do not have the null-invariance property. Because large
data sets typically have many null-transactions, it is important to consider the null-invariance
property when selecting appropriate interestingness measures for pattern evaluation. Among the
four null-invariant measures studied here, namely all confidence, max confidence, Kulc, and
cosine.
Figure 1.7: Data mining as step in the process of knowledge discovery.
v. Technologies Used
a) Statistics:
Statistics studies the collection, analysis, interpretation or explanation, and presentation of data. Data
mining has an inherent connection with statistics A statistical model is a set of mathematical functions
that describe the behavior of the objects in a target class in terms of random variables and their
associated probability distributions.
b) Machine Learning:
Machine learning investigates how computers can learn (or improve their performance) based on data.
A main research area is for computer programs to automatically learn to recognize complex patterns
and make intelligent decisions based on data. Types of Learning include:
1. Supervised Learning
2. Unsupervised Learning
3. Active Learning
c) Database Systems and warehouses:
Database systems research focuses on creating, maintaining, and using databases with principles in
data models, query languages, optimization, storage, indexing, and scalability. Data mining benefits
from scalable database technologies for efficient handling of large datasets, supporting advanced data
analysis needs. Modern database systems integrate data warehousing and data mining capabilities,
using multidimensional data cubes for OLAP and multidimensional data mining.
d) Information Retrieval:
Information retrieval (IR) searches for unstructured text or multimedia, differing from database
systems with keyword-based queries and probabilistic models like bag of words and topic modeling to
analyze documents and data. Integrating IR with data mining aids in effective search and analysis amid
growing online data.
Conclusion:
Studied the need for Data warehousing.
Features of Data warehousing and mining like Subject oriented, Integrated data, etc were studied.
Difference between Data warehousing and Data Mart was studied.
Architecture of Data warehousing and their components were studied.
Strategic information given by Data warehouse.
Application of Data warehousing in Banking, Education field, etc were studied.
Practical approach used for designing data warehouse by taking advantages of top-down and
bottom-up approaches also eliminating their limitations.
Data mining used for knowledge generation.