0% found this document useful (0 votes)
33 views12 pages

DWM Exp1

Uploaded by

Madhura Kanse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views12 pages

DWM Exp1

Uploaded by

Madhura Kanse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Aim: To Study Data Warehouse and Data Mining Concepts & its Applications.

Theory:
A] Data Warehouse
i. Definition
The Data Warehouse is an informational environment that provides an integrated and total view of
the enterprise making enterprise’s current and historical information easily available for decision
making and allowing decision-support transactions possible without hindering operational systems to
present a flexible and interactive source of strategic information and renders the organization’s
information consistent.
ii. Need of Data Warehouse
 Running of simple queries and reports against current and historical data is needed for today’s
Industries.
 Ability to perform Analysis on the data in many different ways and to query, step back, analyze,
and then continue the process to any desired length to understand the industry progress from
time to time.
 We need to spot historical trends and apply them for future results for development of the
industry.
iii. Defining features of Data Warehouse
a) Subject Oriented
Operational systems store data by specific applications (like order processing or banking services), each
with tailored datasets for their functions. In contrast, data warehouses organize data by "business
subjects" critical to the enterprise (like sales, inventory, or customer accounts), enabling
comprehensive analysis and strategic decision-making across different areas of the business.

Figure 1.1: The data warehouse is subject oriented.


b) Integrated Data
The data warehouse integrates data from multiple operational systems, which may vary in databases,
file formats, character codes, and field naming conventions. Additionally, external data sources like
Metro Mail, A. C. Nielsen, and IRI provide crucial information that must be incorporated. This results
in a diverse mix of source data for the data warehouse.

Figure 1.2: The data warehouse is integrated.


c) Time variant Data
Operational systems store current data to support daily activities, like the current balance in
accounts receivable or the status of an order. Although some past transactions are stored, these
systems mainly reflect current information. In contrast, a data warehouse is designed for analysis and
decision-making, requiring historical data. Users need to analyze past and present data, such as
customer purchase patterns or sales trends over time. Therefore, data warehouses store historical
snapshots, integrating data from various operational systems and removing inconsistencies. This
historical aspect is crucial for designing and implementing a data warehouse.
The time-variant nature of the data in a data warehouse:-
1) Allows for analysis of the past
2) Relates information to the present

Figure 1.3: The data warehouse is time variant.


d) Non volatile Data
Data from operational systems and external sources is transformed, integrated, and stored in the data
warehouse for analysis, not for daily operations. For example, you wouldn't check the warehouse for
current stock status when processing orders. Instead, the warehouse stores historical snapshots. Data
is moved into the warehouse at specific intervals, such as daily, weekly, or monthly, depending on
business needs. These movements are scheduled based on user requirements, with different data sets
updated at varying frequencies.

Figure 1.4: The data warehouse is nonvolatile.


e) Granularity
In operational systems, data is stored at the lowest level of detail, such as individual sales at a grocery
store. Summary data is calculated by aggregating individual transactions as needed. In contrast, data
warehouse queries typically start with summary data and drill down to detailed levels. To facilitate
this, data warehouses store data summarized at different levels, known as data granularity. Lower
granularity means finer detail, which requires more storage. Deciding on granularity levels depends on
the data types and expected query performance.

Figure 1.5: The data warehouse is granular.


iv. Datawarehouse and Datamart

Sr Data Warehouse Data Mart


no
1 Corporate/Enterprise-wide Departmental
2 Union of all data marts A single business process
3 Data received from staging area Star-join (facts & dimensions)
4 Queries on presentation resource Technology optimal for data access and analysis
5 Structure for corporate view of data Structure to suit the departmental view of data

v. Top-Down vs Bottom-up approach

Top-Down approach Bottom-up approach


Advantages a) A truly corporate effort, an a) Faster and easier implementation of
enterprise view of data manageable pieces
b) Inherently architected—not a b) Favorable return on investment and
union of disparate data marts proof of concept
c) Single, central storage of data c) Less risk of failure
about the content d) Inherently incremental; can schedule
d) Centralized rules and control important data marts first
e) May see quick results if e) Allows project team to learn and grow
implemented with iterations
Disadvantages a) Takes longer to build even with an a) Each data mart has its own narrow view
iterative method of data
b) High exposure/risk to failure b) Permeates redundant data in every
c) Needs high level of cross-functional data mart
skills c) Perpetuates inconsistent and
d) High outlay without proof of irreconcilable data
concept d) Proliferates unmanageable interfaces

vi. Practical Approach


Following steps are followed practical approach:
 Plan and define requirements at the overall corporate level
 Create a surrounding architecture for a complete warehouse
 Conform and standardize the data content
 Implement the data warehouse as a series of supermarts, one at a time
vii. Architecture

Figure 1.6: Architecture of Data Warehouse.

 Source Data Component


Source data coming into the data warehouses may be grouped into four broad categories:
Production Data: This type of data comes from the different operating systems of the enterprise. Based
on the data requirements in the data warehouse, we choose segments of the data from the various
operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets, reports, customer
profiles, and sometimes even department databases. This is the internal data, part of which could be
useful in a data warehouse.
Archived Data: Operational systems are mainly intended to run the current business. In every
operational system, we periodically take the old data and store it in achieved files.
External Data: Most executives depend on information from external sources for a large percentage of
the information they use. They use statistics associating to their industry produced by the external
department.
 Data Staging Component
After extracting data from various operational and external sources, it needs to be prepared for storage
in the data warehouse. This involves three key functions: extraction, transformation, and loading (ETL).
These processes occur in a staging area, which acts as a workbench for cleaning, changing, combining,
converting, deduplicating, and preparing the data. A separate staging area is necessary because data
in a warehouse is subject-oriented and integrates data from multiple sources, unlike operational
systems that handle data from a single source. This separation ensures that the data is properly
prepared before being stored for querying and analysis in the data warehouse.
 Data Storage Components
The data storage for a data warehouse is separate from operational systems to accommodate large
volumes of historical data for analysis. Unlike operational databases, which prioritize current data and
efficient processing through normalized formats, data warehouses store data in structures optimized
for analysis rather than quick retrieval. Data warehouses are "read-only" to ensure stability for analysis,
with data represented as snapshots at specified periods. They typically use relational database
management systems (RDBMS) and sometimes multidimensional database management systems
(MDDBs) for aggregating and storing summary data in proprietary formats.
 Information Delivery Component
The information delivery element is used to enable the process of subscribing for data warehouse files
and having it transferred to one or more destinations according to some customer-specified scheduling
algorithm.
 Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures, the
data about the records and addresses, the information about the indexes, and so on.

viii. Metadata
Metadata in a data warehouse is similar to the data dictionary or the data catalog in a database
management system.
 Types of Metadata:
a) Operational Metadata:
Data for the data warehouse comes from diverse operational systems with varying data structures,
field lengths, and data types. Integration involves splitting, combining, and managing multiple coding
schemes. Operational metadata tracks these details to link delivered information back to the original
source data sets.
b) Extraction and Transformation Metadata:
Extraction and transformation metadata contain data about the extraction of data from the source
systems, namely, the extraction frequencies, extraction methods, and business rules for the data
extraction. Also, this category of metadata contains information about all the data transformations
that take
place in the data staging area.
c) End-User Metadata:
The end-user metadata is the navigational map of the data warehouse. It enables the end-users to find
information from the data warehouse. The end-user metadata allows the end-users to use their own
business terminology and look for information in those ways in which they normally think of the
business.

ix. Applications of Data Warehouse


1) Social media websites
Social media is a great example of data warehousing, social media industry is emerging and so is the
need to implement DW in it.
2) Construction (material-based industries)
Data warehouse approach in construction industry seems to be efficient in decision making as it
provides construction managers the complete internal and external knowledge about available data
so that they can measure and monitor the construction performance.
3) Manufacturing Industry
DW plays a vital role in daily house to industrial hold things. Manufacturing industry includes product
and process design, scheduling, planning. production, maintenance and huge investments in
equipment, manpower and heavy machinery.
a) Trend Analysis: It is a technique that is used to predict future outcomes from historical results or
information.
b) Market Segmentation: Market segmentation is the Identification of the customer's behavior and
common characteristics related to the purchases made against that product of related company.
Many organizations are focusing on integrating data warehouse to get best behavior analysis.
4) Banking
Bank intelligence is the ability to gather, manage, and analyze a large amount of data on bank
customers, products, operations, services, suppliers, partners and all the transactions. Many data
warehouse flavors are designed for the support of banking industry.
5) Education
DW in education field is becoming popular day by day. Use of DW in educational field presents several
potential benefits in making appropriate decisions and for evaluating data in time which is the basic
target of DW process. On a large scale, a DW can integrate the information of different institutes into
a single central repository for analysis and strategic decision making.

B] Data Mining
i. Definition
Data mining as a synonym for another popularly used term, knowledge discovery from data, or KDD
as an essential step in process of knowledge discovery. Another terms similar to data mining is
knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology, and data
dredging.
ii. Steps in Data Mining

1. Data cleaning (to remove noise and inconsistent data):


Missing values, noise, and inconsistencies contribute to inaccurate data. So far, we have looked at
techniques for handling missing data and for smoothing data. “But data cleaning is a big job. What
about data cleaning as a process? How exactly does one proceed in tackling this task? Are there any
tools out there to help?” The first step in data cleaning as a process is discrepancy detection.
Discrepancies can be caused by several factors, including poorly designed data entry forms that have
many optional fields, human error in data entry, deliberate errors (e.g., respondents not wanting to
divulge information about themselves), and data decay (e.g., outdated addresses). Discrepancies may
also arise from inconsistent data representations and inconsistent use of codes. Other sources of
discrepancies include errors in instrumentation devices that record data and system errors. Errors can
also occur when the data are (inadequately) used for purposes other than originally intended. There
may also be inconsistencies due to data integration (e.g., where a given attribute can have different
names in different databases).

Commercial tools can assist in the data transformation step. Data migration tools allow simple
transformations to be specified such as to replace the string “gender” by “sex.” ETL
(extraction/transformation/loading) tools allow users to specify transforms through a graphical user
interface (GUI). These tools typically support only a restricted set of transforms so that, often, we may
also choose to write custom scripts for this step of the data cleaning process.

2. Data integration (where multiple data sources may be combined):


Data mining often requires data integration—the merging of data from multiple data stores. Careful
integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This
can help improve the accuracy and speed of the subsequent data mining process.
It is likely that your data analysis task will involve data integration, which combines data from multiple
sources into a coherent data store, as in data warehousing. These sources may include multiple
databases, data cubes, or flat files. There are a number of issues to consider during data integration.
Schema integration and object matching can be tricky. How can equivalent real-world entities from
multiple data sources be matched up? This is referred to as the entity identification problem.

3. Data selection (where data relevant to the analysis task are retrieved from the database):

When selecting data for data mining, the goal is to choose data that will effectively support the
objectives of your analysis.
1. Based on Objectives

 Predictive Modeling: For tasks such as forecasting or classification, select data that includes
both input features (independent variables) and target variables (dependent variables). For
instance, if predicting customer churn, you might include customer demographics,
transaction history, and previous interaction data.
 Descriptive Analytics: To understand historical patterns and relationships, select
comprehensive datasets that capture relevant attributes. For example, to analyze customer
behavior, you might need purchase history, website interactions, and customer feedback.
 Anomaly Detection: For identifying outliers or anomalies, focus on data with normal and
historical behavior patterns. Ensure you have a good representation of typical cases to
effectively spot deviations.
 Clustering: Choose data that includes features relevant to the segmentation or grouping you
aim to achieve. For instance, in market segmentation, include demographic, behavioral, and
transactional data.

2. Data Granularity

 Detail Level: Depending on the analysis, select data with the appropriate level of detail.
For high-level trends, aggregate data might suffice, while detailed, granular data is
necessary for in-depth analysis.
 Aggregation: Sometimes, data needs to be aggregated to provide a summary or higher-
level view. For example, daily sales data might be aggregated into monthly or yearly totals
for trend analysis.

4. Data transformation (where data are transformed and consolidated into forms appropriate for mining
by performing summary or aggregation operations):

In data transformation, the data are transformed or consolidated into forms appropriate for mining.
Strategies for data transformation include the following:
1. Smoothing, which works to remove noise from the data. Techniques include binning, regression, and
clustering.
2. Attribute construction (or feature construction), where new attributes are constructed and added
from the given set of attributes to help the mining process.
3. Aggregation, where summary or aggregation operations are applied to the data. For example, the
daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is
typically used in constructing a data cube for data analysis at multiple abstraction levels.
4. Normalization, where the attribute data are scaled so as to fall within a smaller range, such as −1.0
to 1.0, or 0.0 to 1.0.
5. Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by interval labels
(e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels, in turn, can be
recursively organized into higher-level concepts, resulting in a concept hierarchy for the numeric
attribute. More than one concept hierarchy can be defined for the same attribute to accommodate the
needs of various users.
6. Concept hierarchy generation for nominal data, where attributes such as street can be generalized
to higher-level concepts, like city or country. Many hierarchies for nominal attributes are implicit within
the database schema and can be automatically defined at the schema definition level.

5. Data mining (an essential process where intelligent methods are applied to extract data patterns):
Data mining supports knowledge discovery by finding hidden patterns and associations, constructing
analytical models, performing classification and prediction, and presenting the mining results using
visualization tools.
Information processing, based on queries, can find useful information. However, answers to such
queries reflect the information directly stored in databases or computable by aggregate functions. They
do not reflect sophisticated patterns or regularities buried in the database. Therefore, information
processing is not data mining.
Online analytical processing comes a step closer to data mining because it can derive information
summarized at multiple granularities from user-specified subsets of a data warehouse.
The functionalities of OLAP and data mining can be viewed as disjoint: OLAP is a data
summarization/aggregation tool that helps simplify data analysis, while data mining allows the
automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data.
OLAP tools are targeted toward simplifying and supporting interactive data analysis, whereas the goal
of data mining tools is to automate as much of the process as possible, while still allowing users to
guide the process. In this sense, data mining goes one step beyond traditional online analytical
processing.

6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on
interestingness measures):
The use of only support and confidence measures to mine associations may generate a large
number of rules, many of which can be uninteresting to users. Instead, we can augment the
support–confidence framework with a pattern interestingness measure, which helps focus the
mining toward rules with strong pattern relationships. The added measure substantially reduces
the number of rules generated and leads to the discovery of more meaningful rules. Besides those
introduced in this section, many other interestingness measures have been studied in the
literature. Unfortunately, most of them do not have the null-invariance property. Because large
data sets typically have many null-transactions, it is important to consider the null-invariance
property when selecting appropriate interestingness measures for pattern evaluation. Among the
four null-invariant measures studied here, namely all confidence, max confidence, Kulc, and
cosine.
Figure 1.7: Data mining as step in the process of knowledge discovery.

iii. Types of Data that can be mined


1. Database Data:
A database management system (DBMS) organizes and manages interrelated data through software
programs. It defines structures, manages storage, and ensures secure, consistent data access despite
system crashes or unauthorized attempts. A relational database consists of tables with columns
(attributes) and rows (tuples) storing objects identified by unique keys. It often employs semantic
models like entity-relationship (ER) models to represent entities and relationships within the database.
2. Data Warehouses:
A data warehouse is a centralized repository of information from multiple sources, organized under a
unified schema and typically located at a single site. It facilitates easy analysis of historical data across
major subjects like customers, items, suppliers, and activities. Data in a data warehouse is summarized
and structured for decision-making, often using multidimensional data cubes where each dimension
represents attributes from the schema and stores aggregate measures like counts or sums.
3. Transactional Data:
A transactional database records transactions such as customer purchases, flight bookings, or user
interactions on a web page. Each transaction includes a unique transaction ID and a list of items
involved. Additional tables in the database may store details like item descriptions, salesperson
information, branch details, and more related to the transactions.

iv. Kinds of Data that can be mined


a) Class Concept Characterization and Discrimination:
Data characterization is a summarization of the general characteristics of features The data
corresponding to the user-specified class are typically collected by a query. Data discrimination is a
comparison of the general features of the target class data objects against the general features of
objects from one or multiple contrasting classes The target and contrasting classes can be specified by
a user, and the corresponding data objects can be retrieved through database queries of a target class
of data. The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables, including
crosstabs output presentation are similar to those for characteristic descriptions, although
discrimination descriptions should include comparative measures that help to distinguish between the
target and contrasting classes.
b) Mining Frequent Patterns associations and Correlations:
Name suggests are patterns that occur frequently in data There are many kinds of frequent patterns,
including frequent item sets, frequent subsequences (also known as sequential patterns), and frequent
substructure A substructure can refer to different structural forms (e.g. graphs, trees, or lattices) that
may be combined with item sets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern. Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.
c) Classification and regression:
Classification is the process of finding a model (or function) that describes and distinguishes data
classes or concepts. The model are derived based on the analysis of a set of training data (i.e., data
objects for which the class labels are known). The model is used to predict the class label of objects
for which the class label is unknown
d) Clustering and Analysis:
Unlike classification and regression, which analyze class-labeled (training) data sets, clustering analyzes
data objects without consulting class labels. In many cases, class labeled data may simply not exist at
the beginning.
Clustering can be used to generate class labels for a group of data.
e) Outlier Analysis:
A data set may contain objects that do not comply with the general behavior or model of the data.
These data objects are outliers. Many data mining methods discard outliers as noise or exceptions.

v. Technologies Used
a) Statistics:
Statistics studies the collection, analysis, interpretation or explanation, and presentation of data. Data
mining has an inherent connection with statistics A statistical model is a set of mathematical functions
that describe the behavior of the objects in a target class in terms of random variables and their
associated probability distributions.
b) Machine Learning:
Machine learning investigates how computers can learn (or improve their performance) based on data.
A main research area is for computer programs to automatically learn to recognize complex patterns
and make intelligent decisions based on data. Types of Learning include:
1. Supervised Learning
2. Unsupervised Learning
3. Active Learning
c) Database Systems and warehouses:
Database systems research focuses on creating, maintaining, and using databases with principles in
data models, query languages, optimization, storage, indexing, and scalability. Data mining benefits
from scalable database technologies for efficient handling of large datasets, supporting advanced data
analysis needs. Modern database systems integrate data warehousing and data mining capabilities,
using multidimensional data cubes for OLAP and multidimensional data mining.
d) Information Retrieval:
Information retrieval (IR) searches for unstructured text or multimedia, differing from database
systems with keyword-based queries and probabilistic models like bag of words and topic modeling to
analyze documents and data. Integrating IR with data mining aids in effective search and analysis amid
growing online data.

vi. Applications Targeted


a) Business Intelligence:
Business intelligence (BI) provides historical, current, and predictive insights into business operations,
including reporting, online analytical processing, and predictive analytics. Data mining is crucial in BI
for market analysis, customer feedback comparison, competitor analysis, customer retention, and
strategic decision-making. BI relies on data warehousing, multidimensional data mining, classification,
prediction, and clustering techniques to enhance understanding of customer behavior and optimize
business strategies.
b) Web Search Engines:
Web search engines are complex data mining applications that use algorithms to crawl, index, and rank
web pages and other data sources. They handle vast amounts of data with computer clouds and face
challenges in real-time query processing and context-aware recommendations. Maintaining models
for evolving data streams is crucial, despite the skewed distribution of queries.
c) Scientific Analysis:
Scientific simulations are generating bulks of data every day. This includes data collected from nuclear
laboratories, data about human psychology, etc. Data mining techniques are capable of the analysis of
these data. Now we can capture and store more new data faster than we can analyze the old data
already accumulated.
d) Intrusion Detection:
A network intrusion refers to any unauthorized activity on a digital network. Network intrusions often
involve stealing valuable network resources. Data mining technique plays a vital role in searching
intrusion detection, network attacks, and anomalies. These techniques help in selecting and refining
useful and relevant information from large data sets. Data mining technique helps in classify relevant
data for Intrusion Detection System. Intrusion Detection system generates alarms for the network
traffic about the foreign invasions in the system.
e) Research:
A data mining technique can perform predictions, classification, clustering, associations, and grouping
of data with perfection in the research area. Rules generated by data mining are unique to find results.
In most of the technical research in data mining, we create a training model and testing model. The
training/testing model is a strategy to measure the precision of the proposed model. It is called
Train/Test because we split the data set into two sets: a training data set and a testing data set. A
training data set used to design the training model whereas testing data set is used in the testing
model.
f) Financial/Banking Sector:
A credit card company can leverage its vast warehouse of customer transaction data to identify
customers most likely to be interested in a new credit product.
 Credit card fraud detection.
 Identify ‘Loyal’ customers.
 Extraction of information related to customers.
 Determine credit card spending by customer groups.

vii. Issues in Data Mining


a) Mining Methodology:
Researchers are advancing data mining methodologies to explore diverse types of knowledge, navigate
multidimensional space, and integrate disciplines like natural language processing. They address
challenges such as data uncertainty and noise, enhancing pattern evaluation through user-defined
measures and constraints for more meaningful discoveries in interconnected environments.
b) User Interaction:
The user plays an important role in the data mining process. Interesting areas of research include how
to interact with a data mining system, how to incorporate a user’s background knowledge in mining,
and how to visualize and comprehend data mining results
c) Efficiency and scalability:
Efficiency and scalability are crucial in data mining algorithms, especially with increasing data volumes.
Algorithms must be optimized for predictable, short running times to handle large and dynamic data
sets effectively. Parallel, distributed, and incremental approaches are essential for managing data
complexity and optimizing performance, leveraging cloud and cluster computing for scalable solutions.
d) Diversity of Database Types:
The diversity of database types poses challenges to data mining, including handling various data
formats like structured, semi-structured, and unstructured data. Specialized data mining systems are
needed for in-depth analysis of specific data types. Mining dynamic, networked, and global data
repositories, interconnected by the Internet and diverse networks, is crucial but challenging due to the
heterogeneous nature and semantics of the data. Fields like Web mining, multisource data mining, and
information network mining are actively evolving to address these complexities.
e) Data mining and Society:
Data mining impacts society by enabling scientific discovery, business efficiency, and security
enhancements, but it raises concerns about privacy and data protection. Techniques like privacy-
preserving data mining aim to balance data utility with individual privacy rights. Many everyday
systems use data mining to improve functionality, such as intelligent search engines and online stores,
often without users being aware of it. These applications integrate data mining to enhance user
experience and service delivery seamlessly.

Conclusion:
 Studied the need for Data warehousing.
 Features of Data warehousing and mining like Subject oriented, Integrated data, etc were studied.
 Difference between Data warehousing and Data Mart was studied.
 Architecture of Data warehousing and their components were studied.
 Strategic information given by Data warehouse.
 Application of Data warehousing in Banking, Education field, etc were studied.
 Practical approach used for designing data warehouse by taking advantages of top-down and
bottom-up approaches also eliminating their limitations.
 Data mining used for knowledge generation.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy