Data Warehousing, Data Mining, OLAP and OLTP Technologies Are Indispensable Elements To Support Decision-Making Process in Industrial World
Data Warehousing, Data Mining, OLAP and OLTP Technologies Are Indispensable Elements To Support Decision-Making Process in Industrial World
ISSN 2250-3153
Abstract- This paper provides an overview of Data warehousing, Time-Variant: Historical data is kept in a data warehouse.
Data Mining, OLAP, OLTP technologies, exploring the features, For example, one can retrieve data from 3 months, 6 months, 12
new applications and the architecture of Data Warehousing and months, or even older data from a data warehouse. This contrasts
data mining. The data warehouse supports on-line analytical with a transactions system, where often only the most recent data
processing (OLAP), the functional and performance is kept. For example, a transaction system may hold the most
requirements of which are quite different from those of the on- recent address of a customer, where a data warehouse can hold
line transaction processing (OLTP) applications traditionally all addresses associated with a customer.
supported by the operational databases. Data warehouses provide Non-volatile: Once data is in the data warehouse, it will not
on-line analytical processing (OLAP) tools for the interactive change. So, historical data in a data warehouse should never be
analysis of multidimensional data of varied granularities, which altered.
facilitates effective data mining. Data warehousing and on-line Ralph Kimball provided a more concise definition of a data
analytical processing (OLAP) are essential elements of decision warehouse: A data warehouse is a copy of transaction data
support, which has increasingly become a focus of the database specifically structured for query and analysis. This is a functional
industry. OLTP is customer-oriented and is used for transaction view of a data warehouse. Kimball did not address how the data
and query processing by clerks, clients and information warehouse is built like Inmon did; rather he focused on the
technology professionals. An OLAP system is market-oriented functionality of a data warehouse.
and is used for data analysis by knowledge workers, including Data warehousing is a collection of decision support
managers, executives and analysts. Data warehousing and OLAP technologies, aimed at enabling the knowledge worker
have emerged as leading technologies that facilitate data storage, (executive, manager, analyst) to make better and faster
organization and then, significant retrieval. Decision support decisions. Data warehousing technologies have been
places some rather different requirements on database technology successfully deployed in many industries: manufacturing (for
compared to traditional on-line transaction processing order shipment and customer support), retail (for user profiling
applications. and inventory management), financial services (for claims
analysis, risk analysis, credit card analysis, and fraud
Index Terms- Data Warehousing, OLAP, OLTP, Data Mining, detection),transportation(for fleet management),
Decision Making and Decision Support, Data mining, Data telecommunications (for call analysis and fraud detection),
marts, Meta data, ETL (Extraction, Transportation, utilities (for power usage analysis), and healthcare (for
transformation and loading), Server, Data warehouse outcomes analysis). This paper presents a roadmap of data
architecture. warehousing technologies, focusing on the special requirements
that data warehouses place on database management
systems (DBMSs).
I. INTRODUCTION
www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 5, Issue 5, May 2015 2
ISSN 2250-3153
www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 5, Issue 5, May 2015 3
ISSN 2250-3153
analysis and decision-making. Such systems can organize and Scientific Viewpoint…
present data in various formats in order to accommodate the Data collected and stored at enormous speeds
diverse needs of the different users. These systems are called on- – Remote sensor on a satellite
line analytical processing (OLAP) systems. – Telescope scanning the skies
– Microarrays generating gene expression data
3.1 Major distinguishing features between OLTP and – Scientific simulations generating terabytes of
OLAP data
Traditional techniques are infeasible for raw data
i) Users and system orientation: OLTP is customer- Data mining for data reduction
oriented and is used for transaction and query – Cataloging, classifying, segmenting data
processing by clerks, clients and information – Helps scientists in Hypothesis Formation.
technology professionals. An OLAP system is
market-oriented and is used for data analysis by 4.3 Major Data Mining Tasks
knowledge workers, including managers,
executives and analysts. Classification: Predicting an item class.
ii) Data contents: OLTP system manages current data in Association Rule Discovery: descriptive.
too detailed format. While an OLAP system Clustering: descriptive, finding groups of items.
manages large amounts of historical data, provides Sequential Pattern Discovery: descriptive.
facilities for summarization and aggregation. Deviation Detection: predictive, finding changes.
Moreover, information is stored and managed at Forecasting: predicting a parameter value
different levels of granularity, it makes the data Description: describing a group.
easier to use in informed decision-making. Link analysis: finding relationships and associations.
iii) Database design: An OLTP system generally adopts an
entity –relationship data model and an application- 4.3.1 Classification: Definition
oriented database design. An OLAP system adopts Given a collection of records(training set)
either a star or snowflake model and a subject – Each record contains a set of attributes, one of
oriented database design. the attributes is the class.
Find a model for class attribute as a function of the
values of other attributes.
IV. DATA MINING Goal: previously unseen records should be assigned a
Data Mining is the extraction or “Mining” of knowledge class as accurately as possible.
from a large amount of data or data warehouse. To do this – A test set is used to determine the accuracy of
extraction data mining combines artificial intelligence, statistical the model. Usually, the given data set is
analysis and database management systems to attempt to pull divided into training and test sets, with training
knowledge form stored data. Data mining is the process of set used to build the model and test set used to
applying intelligent methods to extract data patterns. This is done validate it.
using the front-end tools. The spreadsheet is still the most
compiling front-end application for Online Analytical Processing 4.3.1.1 Classification: Application
(OLAP).
The automatic discovery of relationships in typically Direct Marketing
large database and, in some instances, the use of the – Goal: Reduce cost of mailing by targeting a set
discovery results in predicting relationships. of customers likely to buy a new cell-phone
An essential process where intelligent methods are product.
applied in order to extract data patterns. – Approach:
Data mining lets you be proactive Use the data for a similar product
– Prospective rather than Retrospective introduced before.
We know which customers decided to
1.1 Why mine data? buy and which decided otherwise.
Commercial viewpoint… This {buy, don’t buy} decision forms
the class attribute.
Lots of data is being collected and warehoused. Collect various demographic,
Computing has become affordable. lifestyle, and company-interaction
Competitive Pressure is Strong related information about all such
– Provide better, customized services for an customers.
edge. – Type of business, where they
– Information is becoming product in its own stay, how much they earn,
right. etc.
Use this information as input
1.2 Why Mine Data? attributes to learn a classifier model.
www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 5, Issue 5, May 2015 4
ISSN 2250-3153
www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 5, Issue 5, May 2015 5
ISSN 2250-3153
• Scenario 1: Given both the network consist of the fittest rules and their
structure and all variables observable: offspring
compute only the CPT entries • The fitness of a rule is represented by its
• Scenario 2: Network structure known, classification accuracy on a set of training
some variables hidden: gradient descent examples
(greedy hill-climbing) method, i.e., search • Offspring are generated by crossover and
for a solution along the steepest descent of mutation
a criterion function • The process continues until a population P
• Weights are initialized to random evolves when each rule in P satisfies a pre-
probability values specified threshold
• At each iteration, it moves towards what • Slow but easily parallelizable.
appears to be the best solution at the
moment, w.o. backtracking 5.4.1 Rough Set Approach:
• Weights are updated at each iteration &
converge to local optimum Rough sets are used to approximately or “roughly”
• Scenario 3: Network structure unknown, define equivalent classes
all variables observable: search through A rough set for a given class C is approximated by two
the model space to reconstruct network sets: a lower approximation (certain to be in C) and an
topology . upper approximation (cannot be described as not
• Scenario 4: Unknown structure, all hidden belonging to C)
variables: No good algorithms known for Finding the minimal subsets (reducts) of attributes for
this purpose feature reduction is NP-hard but a discernibility matrix
• D. Heckerman. A Tutorial on Learning (which stores the differences between attribute values
with Bayesian Networks. In Learning in for each pair of data tuples) is used to reduce the
Graphical Models, M. Jordan, ed. MIT computation intensity.
Press, 1999.
For Example
n
y = sign(∑ wi xi − µ k )
i =0 Fig: 3 rough set Approach
An n-dimensional input vector x is mapped into variable
y by means of the scalar product and a nonlinear
function mapping VI. ACTIVE LEARNING
Class labels are expensive to obtain
The inputs to unit are outputs from the previous layer. They Active learner: query human (oracle) for labels
are multiplied by their corresponding weights to form a weighted Pool-based approach: Uses a pool of unlabeled data
sum, which is added to the bias associated with unit. Then a L: a small subset of D is labeled, U: a pool of
nonlinear activation function is applied to it. unlabeled data in D
Use a query function to carefully select one or
1.5 Genetic Algorithms (GA) more tuples from U and request labels from an
• Genetic Algorithm: based on an analogy to oracle (a human annotator)
biological evolution The newly labeled samples are added to L, and
• An initial population is created consisting learn a model
of randomly generated rules Goal: Achieve high accuracy using as few
• Each rule is represented by a string of bits labeled data as possible
• E.g., if A 1 and ¬A 2 then C 2 can be Evaluated using learning curves: Accuracy as a function
encoded as 100 of the number of instances queried (# of tuples to be
• If an attribute has k > 2 values, k bits can queried should be small)
be used Research issue: How to choose the data tuples to be
• Based on the notion of survival of the queried?
fittest, a new population is formed to
www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 5, Issue 5, May 2015 6
ISSN 2250-3153
www.ijsrp.org
International Journal of Scientific and Research Publications, Volume 5, Issue 5, May 2015 7
ISSN 2250-3153
www.ijsrp.org