2017 Summer Model Answer Paper
2017 Summer Model Answer Paper
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
Important Instructions to examiners:
1) The answers should be examined by key words and not as word-to-word as given in the model answer
scheme.
2) The model answer and the answer written by candidate may vary but the examiner may try to assess the
understanding level of the candidate.
3) The language errors such as grammatical, spelling errors should not be given more Importance (Not
applicable for subject English and Communication Skills.
4) While assessing figures, examiner may give credit for principal components indicated in the figure. The
figures drawn by candidate and model answer may vary. The examiner may give credit for any equivalent
figure drawn.
5) Credits may be given step wise for numerical problems. In some cases, the assumed constant values
may vary and there may be some difference in the candidate’s answers and model answer.
6) In case of some questions credit may be given by judgement on part of examiner of relevant answer
based on candidate’s understanding.
7) For programming language papers, credit may be given to any other program based on equivalent
concept.
Ans: 1) Advanced query processing: in most businesses, even the best database systems are (Any 4 Need:
bound to either a single server or a handful of servers in a cluster. A data warehouse is a 1 Mark
purpose built hardware solution far more advanced than standard database servers. What Each)
this means is a data warehouse will process queries much faster and more effectively,
leading to efficiency and increased productivity.
2) Better consistency of data: developers work with data warehousing systems after
data has been received so that all the information contained in the data warehouse is
standardized. Only uniform data can be used efficiently for successful comparisons.
Other solutions simply cannot match a data warehouse's level of consistency.
3) Improved user access: a standard database can be read and manipulated by programs
like SQL Query Studio or the Oracle client, but there is considerable ramp up time for
end users to effectively use these apps to get what they need. Business intelligence and
data warehouse end-user access tools are built specifically for the purposes data
warehouses are used: analysis, benchmarking, prediction and more.
Page | 1
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
4) All-in-one: a data warehouse has the ability to receive data from many different
sources, meaning any system in a business can contribute its data. Let's face it: different
business segments use different applications. Only a proper data warehouse solution can
receive data from all of them and give a business the "big picture" view that is needed to
analyze the business, make plans, track competitors and more.
5) Future-proof: a data warehouse doesn't care where it gets its data from. It can work
with any raw information and developers can "massage" any data it may have trouble
with. Considering this, you can see that a data warehouse will outlast other changes in
the business' technology. For example, a business can overhaul its accounting system,
choose a whole new CRM solution or change the applications it uses to gather statistics
on the market and it won't matter at all to the data warehouse. Upgrading or overhauling
apps anywhere in the enterprise will not require subsequent expenditures to change the
data warehouse side.
6) Retention of data history: end-user applications typically don't have the ability, not
to mention the space, to maintain much transaction history and keep track of multiple
changes to data. Data warehousing solutions have the ability to track all alterations to
data, providing a reliable history of all changes, additions and deletions. With a data
warehouse, the integrity of data is ensured.
(b) What is data cleaning technique? Explain any one technique in detail. 4M
Ans: Data cleaning is performed as data preprocessing step while preparing the data for a data (Definition: 2
warehouse. Data cleaning (or data cleansing) routines attempt to fill in missing values, marks,
smooth out noise while identifying outliers, and correct inconsistencies in the data. Explanation
Noise is a random error or variance in a measured variable. Given a numerical attribute of any one: 2
such as, say, price, we can “smooth” out the data to remove the noise by the following marks)
data smoothing techniques:
Page | 2
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
value is replaced by the bin median. In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries. Each bin value is
then replaced by the closest boundary value. In general, the larger the width, the greater
the effect of the smoothing. Alternatively, bins may be equal-width, where the interval
range of values in each bin is constant.
Figure: Binning
2. Regression: Data can be smoothed by fitting the data to a function, such as with
regression. Linear regression involves finding the “best” line to fit two attributes (or
variables), so that one attribute can be used to predict the other. Multiple linear
regression is an extension of linear regression, where more than two attributes are
involved and the data are fit to a multidimensional surface.
3. Clustering: Outliers may be detected by clustering, where similar values are
organized into groups, or “clusters.” Intuitively, values that fall outside of the set of
clusters may be considered outliers.
c) Describe Multidimensional data model. 4M
Ans: In the multidimensional model, data are organized into multiple dimensions, and each (Diagram: 2
dimension contains multiple levels of abstraction defined by concept hierarchies. This marks,
organization provides users with the flexibility to view data from different perspectives. Explanation:
A number of OLAP data cube operations exist to materialize these different views, 2 marks)
allowing interactive querying and analysis of the data at hand. Hence, OLAP provides a
user-friendly environment for interactive data analysis.
Page | 3
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
This cube is referred to as the central cube. The measure displayed is dollars sold (in
thousands). The data examined are for the cities Chicago, New York, Toronto, and
Vancouver.
Page | 4
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
d) What is concept description? 4M
1. Data Characterization: this refers to summarizing data of class under study. This
class under study is called as Target Class.
Ans: 1) Subject Oriented: Data warehouses are designed to help you analyze data. For (Each
example, to learn more about your company's sales data, you can build a Characteristi
warehouse that concentrates on sales. Using this warehouse, you can answer cs: 1 mark
questions like "Who was our best customer for this item last year?" This ability Each, any 6)
to define a data warehouse by subject matter, sales in this case makes the data
warehouse subject oriented.
3) Nonvolatile: Nonvolatile means that, once entered into the warehouse, data
should not change. This is logical because the purpose of a warehouse is to
enable you to analyze what has occurred.
8) Consistency: Structural and contents of the data is very important and can only
be guaranteed by the use of metadata: this is independent from the source and
collection date of the data
Ans: Data reduction techniques can be applied to obtain a reduced representation of the data (Definition: 2
set that is much smaller in volume, yet closely maintains the integrity of the original marks,
data. That is, mining on the reduced data set should be more efficient yet produce the Explanation
same (or almost the same) analytical results. Strategies for data reduction include the of any 4
following: techniques :
1 mark each)
1. Data cube aggregation: where aggregation operations are applied to the data in the
construction of a data cube.
3. Dimensionality reduction: where encoding mechanisms are used to reduce the data
set size.
Page | 6
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
5. Discretization and concept hierarchy generation: where raw data values for
attributes are replaced by ranges or higher conceptual levels. Data discretization is a
form of numerosity reduction that is very useful for the automatic generation of concept
hierarchies. Discretization and concept hierarchy generation are powerful tools for data
mining, in that they allow the mining of data at multiple levels of abstraction
2. Answer any two of the following : (8X2=16)
Marks
(a) Describe discretization and concept of hierarchy generation for numeric and 8M
categorical data.
Ans: Concept hierarchies for numeric attributes can be constructed automatically based on (Numeric
data distribution analysis. Five methods for concept hierarchy generation are defined data: 4
below- inning histogram analysis entropy-based discretization and data segmentation by marks and
natural partitioning categorical
data 4
Binning: Attribute values can be discretized by distributing the values into bin and marks)
replacing each bin by the mean bin value or bin median value. These technique can be
applied ecursively to the resulting partitions in order to generate concept hierarchies.
Histogram Analysis: Histograms can also be used for discretization. Partitioning rules
can be applied to define range of values. The histogram analyses algorithm can be
applied recursively to each partition in order to automatically generate amultilevel
concept hierarchy, with the procedure terminating once a pre specified number of
concept levels have been reached. A minimum interval size can be used per level to
control the recursive procedure. This specifies the minimum width of the partition, or the
minimum member of partitions at each level.
Cluster Analysis: A clustering algorithm can be applied to partition data into clusters or
groups. Each cluster forms a node of a concept hierarchy, where all noses are at thesame
conceptual level. Each cluster may be further decomposed into sub-clusters, forming a
lower level in the hierarchy. Clusters may also be grouped together to form a higher-
level concept hierarchy. Segmentation by natural partitioning: Breaking up annual
salaries in the range of into ranges like ($50,000-$100,000) are often more desirable than
ranges like ($51, 263, 89-$60,765.3) arrived at by cluster analysis. The 3-4-5 rule can be
used to segment numeric data into relatively uniform ―natural intervals. In general the
rule partitions agive range of data into 3,4,or 5 equinity intervals, recursively level by
level based on value range at the most significant digit. The rule can be recursively
applied to each interval creating a concept hierarchy for the given numeric attribute.
Page | 7
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
Categorical data are discrete data. Categorical attributes have finite number of distinct
values, with no ordering among the values, examples include geographic location, item
type and job category. There are several methods for generation of concept hierarchies
for categorical data. Specification of a partial ordering of attributes explicitly at the
schema level by experts. Concept hierarchies for categorical attributes or dimensions
typically involve a group of attributes. A user or an expert can easily define concept
hierarchy by specifying a partial or total ordering of the attributes at a schema level. A
hierarchy can be defined at the schema level such as street < city < province <state <
country. Specification of a portion of a hierarchy by explicit data grouping: This is
identically a manual definition of a portion of a concept hierarchy. In a large database, is
unrealistic to define an entire concept hierarchy by explicit value enumeration. However,
it is realistic to specify explicit groupings for a small portion of the intermediate-level
data. Specification of a set of attributes but not their partial ordering: A user may specify
a set of attributes forming a concept hierarchy, but omit to specify their partial ordering.
The system can then try to automatically generate the attribute ordering so as to
construct a meaningful concept hierarchy.
b) Describe the following schemes for multidimensional database. 8M
1)Star
2)Snowflake
Ans: Star Schema: (4 marks
•In star schema each dimension is represented with only one dimension table. each)
•This dimension table contains the set of attributes.
•In the following diagram we have shown the sales data of a company with respect to the
four dimensions namely, time, item, branch and location.
•There is a fact table at the center. This fact table contains the keys to each of four
dimensions.
•The fact table also contain the attributes namely, dollars sold and units sold.
Snowflake Schema:
•In Snowflake schema some dimension tables are normalized.
•Dimensions with hierarchies can be decomposed into a snowflake structure when you
want to avoid joins to big dimension tables when you are using an aggregate of the fact
table.
•The normalization split up the data into additional tables.
•Unlike Star schema the dimensions table in snowflake schema is normalized for
example the item dimension table in star schema is normalized and split into two
dimension tables namely, item and supplier table.
Page | 8
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
•Therefore now the item dimension table contains the attributes item key, item name,
type, brand, and supplier-key.
•The supplier key is linked to supplier dimension table. The supplier dimension table
contains the attributes supplier key, and supplier type.
c) State the association rules in data mining. Write applications of each rule. 8M
Ans: Similar to the mining of association rules in transactional and relational databases, (Statement of
spatial association rules can be mined in spatial databases. A spatial association rule is of Rule: 4
the form A)B [s%;c%] where A and B are sets of spatial or nonspatial predicates, s% is marks,
the support of the rule, and c% is the confidence of the rule. For example, the following Application:
is a spatial association rule: is a(X; “school”)^close to(X; “sports center”))close to(X; 4 marks)
“park”) [0:5%;80%]. This rule states that 80% of schools that are close to sports centers
are also close to parks, and 0.5% of the data belongs to such a case. Various kinds of
spatial predicates can constitute a spatial association rule. Examples include distance
information (such as close to and far away), topological relations (like intersect, overlap,
and disjoint), and spatial orientations (like left of and west of). Since spatial association
mining needs to evaluate multiple spatial relationships among a large number of spatial
objects, the process could be quite costly. An interesting mining optimization method
called progressive refinement can be adopted in spatial association analysis. The method
first mines large data sets roughly using a fast algorithm and then improves the quality
of mining in a pruned data set using a more expensive algorithm. To ensure that the
pruned data set covers the complete set of answers when applying the high-quality data
Page | 9
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
mining algorithms at a later stage, an important requirement for the rough mining
algorithm applied in the early stage is the superset coverage property: that is, it preserves
all of the potential answers. In other words, it should allow a false-positive test, which
might include some data sets that do not belong to the answer sets, but it should not
allow a false-negative test, which might exclude some potential answers. For mining
spatial associations related to the spatial predicate close to, we can first collect the
candidates that pass the minimum support threshold by Applying certain rough spatial
evaluation algorithms, for example, using an MBR structure (which registers only two
spatial points rather than a set of complex polygons), and evaluating the relaxed spatial
predicate, g close to, which is a generalized close to covering a broader context that
includes close to, touch, and intersect. If two spatial objects are closely located, their
enclosing MBRs must be closely located, matching g close to. However, the reverse is
not always true: if the enclosing MBRs are closely located, the two spatial objects may
or may not be located so closely. Thus, the MBR pruning is a false-positive testing tool
for closeness: only those that pass the rough test need to be further examined using more
expensive spatial computation algorithms. With this preprocessing, only the patterns that
are frequent at the approximation level will need to be examined by more etailed and
finer, yet more expensive, spatial computation. Besides mining spatial association rules,
one may like to identify groups of particular features that appear frequently close to each
other in a geospatial map. Such a problem is essentially the problem of mining spatial
co-locations. Finding spatial co-locations can be considered as a special case of mining
spatial associations. However, based on the property of spatial autocorrelation,
interesting features likely coexist in closely located regions. Thus spatial co-location can
be just what one really wants to explore. Efficient methods can be developed for mining
spatial co-locations by exploring the methodologies like Aprori and progressive
refinement, similar to what has been done for mining spatial association rules. Mining
Associations in Multimedia Data Association rules involving multimedia objects can be
mined in image and video databases. At least three categories can be observed:
Associations between image content and non-image content features: A rule like “If at
least 50% of the upper part of the picture is blue, then it is likely to represent sky”
belongs to this category since it links the image content to the keyword sky.
Associations among image contents that are not related to spatial relationships: A rule
like “If a picture contains two blue squares, then it is likely to contain one red circle As
well belongs to this category since the associations are all regarding image contents.
Associations among image contents related to spatial relationships: A rule like “If a red
triangle is between two yellow squares, then it is likely a big oval-shaped object is
underneath” belongs to this category since it associates objects in the image with spatial
relationships.
Page | 10
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
3. Answer any four of the following : (4x4=16)
Marks
Ans: Decision support systems are interactive software-based systems intended to help (Definition: 1
managers in decision making by accessing large volume of information generated from mark, Any 3
various related information systems involved in organizational business processes, like, type: 1 mark
office automation system, transaction processing system etc. DSS uses the summary each)
information, exceptions, patterns and trends using the analytical models. Decision
Support System helps in decision making but does not always give a decision itself. The
decision makers compile useful information from raw data, documents, personal
knowledge, and/or business models to identify and solve problems and make decisions.
Programmed and Non-programmed Decisions There are two types of decisions -
programmed and non-programmed decisions. Programmed decisions are basically
automated processes, general routine work, where:
•These decisions are based on the manger's discretion, instinct, perception and judgment
Decision support systems generally involve non-programmed decisions. Therefore, there
will be no exact report, content or format for these systems.
b) Explain benefits of data warehousing. 4M
Ans: OLAP (online analytical processing) is a function of business intelligence software that (Any 4 need,
enables a user to easily and selectively extract and view data from different points of Each need
view. OLAP technology is a vast improvement over traditional relational database carries: 1
management systems (RDBMS). Relational databases, which have a two-dimensional mark)
structure, do not allow the multidimensional data views that OLAP provides.
Traditionally used as an analytical tool for marketing and financial reporting, OLAP is
now viewed as a valuable tool for any management system that needs to create a flexible
decision support system. Today's work environment is characterized by flatter
organizations that need to be able to adapt quickly to changing conditions. Managers
need the tools that will allow them to make quick, intelligent decisions on the fly.
Making the wrong decision or taking too long to make it can affect the competitive
position of an organization. OLAP provides the multidimensional capabilities that most
organizations need today. By using a multidimensional data store, also known in the
industry as a hypercube, OLAP allows the end user to analyze data along the axes of
their business. The two most common forms of analysis that most businesses use are
called "slice and dice" and "drill down".
Frequency by itself doesn’t tell the whole story. For instance if I told you hot dogs and
buns were purchased 820 times together you wouldn’t know if that was relevant or not.
Therefore we introduce two other measures called support and confidence to help with
the analysis. If you divide the frequency by the total number of orders you get the
percentage of order containing the pair. This is called the support. Another way to
thinking about support is as the probability of the pair being purchased. Now if 820 hot
dogs and buns were purchased together and your store took 1000 orders the support for
this would be calculated as (820 / 1000) = 82.0% .We can extend this even further by
Page | 12
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
defining a calculation called confidence. Confidence compares the number of times the
pair was purchased to the number of times one of the items in the pair was purchased. In
probability terms this is referred to as the conditional probability of the pair. Sogoing
back to our hot dogs example if hot dogs were purchases 900 times and out of those 900
purchases 820 contained buns we would have a confidence of (820 / 900) = 91.1%.Now
that we’ve defined frequency, support and confidence we can talk a little about what a
market basket analysis report might look like. The report would have the user select the
product they are interested in performing the analysis on (i.e. hot dogs). Then it would
list all the products that were purchased together with the selected products ranked by it
frequency. It might look something like the following
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
Inconsistencies
Data integration Integration of multiple databases, data cubes, or files
Data transformation Normalization and aggregation
Data reduction Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization Part of data reduction but with particular importance, especially for
numerical data
Page | 13
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
Ans: DSS have been classified in different ways as the concept matured with time. As. and (Classes: 2
when the full potential and possibilities for the field emerged, different classification marks (any
systems also emerged. Some of the well-known classification mode ls are given below. 2),
According to Donovan and Mad nick (1977) DSS can be classified as: Categories:
2
marks (any
1). Institutional-when the DSS supports ongoing and recurring decisions 2))
2). Ad hoc-when the DSS supports a one off-kind of decision. Hack thorn and Keen
(1981)
Alter (1980) opined that decision support systems could be classified into seven types
based on their generic nature of operations. He described the seven types as.
1). File drawer systems. This type of DSS primarily provides access to data stores/data
Page | 14
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
related items. Examples--ATM Machine, Use the balance to make transfer of funds
decisions
2). Data analysis systems. This type of DSS supports the manipulation of data through
the use of specific or generic computerized settings or tools. Examples: Airline
Reservation system, use the info to make flight plans
3). Analysis information systems. This type of DSS provides access to sets of decision
oriented databases and simple small models.
4). Accounting and financial models. This type of DSS can perform 'what if analysis'
and calculate the outcomes of different decision paths. Examples: calculate production
cost.
Ans: A substantial portion of the available information is stored in text databases (or (Explanation
document databases), which consist of large collections of documents from various : 4 marks)
sources, such as news articles, research papers, books, digital libraries, e-mail messages,
and Web pages. Text databases are rapidly growing due to the increasing amount of
information available in electronic form, such as electronic publications, various kinds
of electronic documents, e-mail, and the World Wide Web (which can also be viewed as
a huge, interconnected, dynamic text database). Nowadays most of the information in
government, industry, business, and other institutions are stored electronically, in the
form of text databases. Data stored in most text databases are semi structured data in that
they are neither completely unstructured nor completely structured. For example, a
document may contain structured fields, such as title, authors, publication date, and
category, and so on, but also contain some largely unstructured text components, such as
abstract and contents. There have been a great deal of studies on the modelling and
implementation of semi structured data in recent database research. Moreover,
information retrieval techniques, such as text indexing methods, have been developed to
handle unstructured documents. Traditional information retrieval techniques become
inadequate for the increasingly vast amounts of text data. Typically, only a small
fraction of the many available documents will be relevant to a given individual user.
Without knowing what could be in the documents, it is difficult to formulate effective
queries for analyzing and extracting useful information from the data. Users need tools
to compare different documents, rank the importance and relevance of the documents, or
find patterns and trends across multiple documents. Thus, text mining has become an
increasingly popular and essential theme in data mining. Text Mining Approaches:
There are many approaches to text mining, which can be classified from different
perspectives, based on the inputs taken in the text mining system and the data mining
tasks to be performed. In general, the major approaches, based on the kinds of data they
take as input, are:
Page | 15
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
1. The keyword-based approach, where the input is a set of keywords or terms in the
documents,
2. The tagging approach, where the input is a set of tags, and
3. The information-extraction approach, which inputs semantic information, such as
events, facts, or entities uncovered by information extraction.
iii) Draw block diagram of data warehouse architecture and list its components. 4M
Ans: (Block
Diagram: 2
marks, List:
2 marks)
1). The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom tier
from operational databases or other external sources (such as customer profile
information provided by external consultants).These tools and utilities perform data
extraction, cleaning, and transformation (e.g., to merge similar data from different
Sources into a unified format), as well as load and refresh functions to update the data
warehouse. The data are extracted using application program interfaces known as
gateways. A gateway is supported by the underlying DBMS and allows client programs
to generate SQL code to be executed at a server. Examples of gateways include ODBC
Page | 16
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
(Open Database Connection) and OLEDB (Open Linking And Embedding for
Databases) by Microsoft and JDBC (Java Database Connection).This tier also contains a
metadata repository, which stores information about the data warehouse and its contents.
The middle tier is an OLAP server that is typically implemented using either (1) a
relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps
operations on multidimensional data to standard relational operations; or
(2) A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that
directly implements multidimensional data and operations.
The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
Data analysis task will involve data integration, which combines data from multiple (Explanation:
sources into a coherent data store, as in data warehousing. These sources may include 2 marks,
multiple databases, data cubes, or flat files. There are a number of issues to consider Example: 2
during data integration. Schema integration and object matching can be tricky. How can marks)
equivalent real-world entities from multiple data sources be matched up? This is referred
to as the entity identification problem.
Example:
For example, how can the data analyst or the computer be sure that customer id in one
database and cust number in another refers to the same attribute? Examples of metadata
for each attribute include the name, meaning, data type, and range of values permitted
for the attribute, and null rules for handling blank, zero, or null values. Such metadata
can be used to help avoid errors in schema integration. The metadata may also be used to
help transform the data (e.g., where data codes for pay type in one database may be “H”
and “S”, and 1 and 2 in another). Hence, this step also relates to data cleaning,
Redundancy is another important issue. An attribute (such as annual revenue, for
instance) may be redundant if it can be “derived” from another attribute or set of
attributes. Inconsistencies in attribute or dimension naming can also cause redundancies
in the resulting data set.
Page | 17
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
b) Attempt any one of the following : (6x1=6)
Marks
Ans: Apriori is a seminal algorithm for mining frequent item sets for Boolean association (Description
rules. The name of the algorithm is based on the fact that the algorithm uses prior :3 marks,
knowledge of frequent Item set properties Apriori employs an iterative approach known Algorithm :3
as a level-wise search, where k item sets are used to explore (k + 1)-item sets. First, the marks)
set of frequent 1-itemsets is found by scanning the database to accumulate the count for
each item, and collecting those items that satisfy minimum support. The resulting set is
denoted L1. Next, L1 is used to find L2 the set of frequent 2-itemsets, which is used to
find L3, and so on, until no more frequent k-item sets can be found. The finding of each
Lk requires one full scan of the database. Once the frequent item sets from transactions
in a database D have been found, it is straightforward to generate strong association
rules from them (where strong association rules satisfy both minimum support and
minimum confidence). This can be done using Equation for confidence, which we show
again here for completeness:
Page | 18
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
Informational Data:
Focusing on providing answers to problems posed by decision makers
Summarized
Non updateable
Page | 19
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
a) Describe four different OLAP operation in the multidimensional model with neat 8M
diagram.
Ans: In the multidimensional model, data are organized into multiple dimensions, and each (Diagram: 2
dimension contains multiple levels of abstraction defined by concept hierarchies. This marks,
organization provides users with the flexibility to view data from different perspectives. Points: 2
A number of OLAP data cube operations exist to materialize these different views, marks,
allowing interactive querying and analysis of the data at hand. Hence, OLAP provides a Explanation
user-friendly environment for interactive data analysis. :4 marks)
Page | 20
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
respect to city values,
time is aggregated with respect to quarters, and item is aggregated with respect to item
types.
This cube is referred to as the central cube. The measure displayed is dollars sold (in
thousands).The data examined are for the cities Chicago, New York, Toronto, and
Vancouver. Roll-up: The roll-up operation (also called the drill-up operation by some
vendors) performs aggregation on a data cube, either by climbing up a concept hierarchy
for a dimension or by dimension reduction. Figure shows the result of a roll-up operation
performed on the central cube by climbing up the concept hierarchy for location given in
Figure. This hierarchy was defined as the total order “street< city < province or state <
country.” The roll-up operation shown aggregates the data by ascending the location
hierarchy from the level of city to the level of country. In other words, rather than
grouping the data by city, the resulting cube groups the data by country. When roll-up is
performed by dimension reduction, one or more dimensions are removed from the given
cube. For example, consider a sales data cube containing only the two dimensions
location and time. Roll-up may be performed by removing, say, the time dimension,
Page | 21
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
resulting in an aggregation of the total sales by location, rather than by location and by
time. Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed
data to more detailed data. Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing additional dimensions. Figure shows the result
of a drill-down operation performed on the central cube by stepping down a concept
hierarchy for time defined as “day < month < quarter < year.” Drill-down occurs by
descending the time hierarchy from the level of quarter to the more detailed level of
month. The resulting data cube details the total sales per month rather than summarizing
them by quarter. Because a drill-down adds more detail to the given data, it can also be
performed by adding new dimensions to a cube. For example, a drill-down on the central
cube of Figure can occur by introducing an additional dimension, such as customer
group. Slice and dice: The slice operation performs a selection on one dimension of the
given cube, resulting in a subcube. Figure shows a slice operation where the sales data
are selected from the central cube for the dimension time using the criterion time =
“Q1”. The dice operation defines a sub cube by performing a selection on two or more
dimensions. Figure shows a dice operation on the central cube based on the following
selection criteria that involve three dimensions: (location = “Toronto” or “Vancouver”)
and (time = “Q1” or “Q2”) and (item =“home entertainment” or “computer”). Pivot
(rotate):Pivot (also called rotate) is a visualization operation that rotates the data axes in
view in order to provide an alternative presentation of the data. Figure shows a pivot
operation where the item and location axes in a 2-D slice are rotated. Other examples
include rotating the axes in a 3-D cube, or transforming a 3-D cube into a series of 2-D
planes.
“Given a set of sequences, where each sequence consists of a list of events (or elements)
and each event consists of a set of items, and given a user-specified minimum support
threshold of min sup, sequential pattern mining finds all frequent subsequences, that is,
Page | 23
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
the subsequences whose occurrence frequency in the set of sequences is no less than min
sup.”
Let I = fI1, I2, : : : , Ipg be the set of all items.
An itemset is a nonempty set of items. A sequence is an ordered list of events. A
sequences is denoted he1e2e3 _ _ _eli, where event e1 occurs before e2, which occurs
before e3, and so on. Event e j is also called an element of s. In the case of customer
purchase data, an event refers to a shopping trip in which a customer bought items at
acertain store. The event is thus an itemset, that is, an unordered list of items that the
customer purchased during the trip. The itemset (or event) is denoted (x1x2 _ _ _xq),
where xk is an item. For brevity, the brackets are omitted if an element has only one
item, that is, element (x) is written as x. Suppose that a customer made several shopping
trips to the store. These ordered events form a sequence for the customer. That is, the
customer first bought the items in s1, then later bought the items in s2, and so on. An
item can occur at most once in an event of a sequence, but can occur multiple times in
different events of a sequence. The number of instances of items in a sequence is called
the length of the sequence. A sequence with length l is called an l-sequence. A sequence
a = ha1a2 _ _ _ani is called a subsequence of another sequence b = hb1b2 _ _ _bmi, and
b is a supersequence of a, denoted as a v b, if there exist integers 1 _ j1 < j2 < _ _ _ < jn
_ m such that a1 _ bj1 , a2 _ bj2 , . . . , an _ bjn . For example, if a h(ab), di and b =
h(abc), (de)i, where a, b, c, d, and e are items, then a is a subsequence of b and b is a
supersequence of a. A sequence database, S, is a set of tuples, hSID, si, where SID is a
sequence ID and s is a sequence. For our example, S contains sequences for all
customers of the store. A tuple hSID, si is said to contain a sequence a, if a is a
subsequence of s. The support of a sequence a in a sequence database S is the number of
tuples in the database containing a, that is, supportS(a) = j fh SID, sij(hSID, si 2 S)^(a v
s)g j. It can be denoted as support (a) if the sequence database is clear from the context.
Given a positive integer min sup as the minimum support threshold, a sequence a is
frequent in sequence database S if supportS(a)_min sup. That is, for sequence a to be
frequent, it must occur at least min sup times in S. A frequent sequence is called a
sequential pattern. A sequential pattern with length l is called an l-pattern.
c) Define knowledge discovery and describe any six innovative technique for 8M
knowledge discovery.
Ans: What is Knowledge Discovery Knowledge discovery in databases (KDD) is the non- (Explanation
trivial process of identifying valid, potentially useful and ultimately understandable :8 marks)
patterns in data process of extracting previously unknown, valid, and actionable
(understandable) information from large databases Data mining is a step in the KDD
process of applying data analysis and discovery algorithms Machine learning, pattern
recognition, statistics, databases, data visualization. Knowledge Discovery and Data
Mining (KDD) is an interdisciplinary area focusing upon methodologies for extracting
useful knowledge from data. The ongoing rapid growth of online data due to the Internet
and the widespread use of databases have created an immense need for KDD
methodologies. The challenge of extracting knowledge from data draws upon research in
statistics, databases, pattern recognition, machine learning, data visualization,
optimization, and high-performance computing, to deliver advanced business
intelligence and web discovery solutions. Some people treat data mining same as
Page | 24
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
Knowledge discovery while some people view data mining essential step in process of
knowledge discovery. Here is the list of steps involved in knowledge discovery process:
Data Cleaning-In this step the noise and inconsistent data is removed.
Data Integration-In this step multiple data sources are combined.
Data Selection-In this step relevant to the analysis task are retrieved from the
database.
Data Transformation-In this step data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining-In this step intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation-In this step, data patterns are evaluated. Knowledge Presentation-
In this step,knowledge is represented.
Ans: Data mining refers to extracting or “mining” knowledge from large amounts of data. (Describtion:
Mining can be broadly defined as the discovery and analysis of useful information from 4 marks)
the World Wide Web. The World Wide Web contains the huge information such as
hyperlink information, web page access info, education etc that provide rich source for
data mining. The basic structure of the webpage is based on Document Object Model
(DOM). The DOM structure refers to a tree like structure. In this structure the HTML
tag in the page corresponds to a node in the DOM tree. We can segment the web page by
Page | 25
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
using predefined tags in HTML. The HTML syntax is flexible therefore; the web pages
do not follow the W3C specifications. Not following the specifications of W3C may
cause error in DOM tree structure. The DOM structure was initially introduced for
presentation in the browser not for description of semantic structure of the web page.
The DOM structure cannot correctly identify the semantic relationship between different
parts of a web page.
Ans: Metadata are data about data. When used in a data warehouse, metadata are the data that (
define warehouse objects. Figure 3.12 showed a metadata repository within the bottom Describtion:
tier of the data warehousing architecture. Metadata are created for the data names and 4 marks)
definitions of the given warehouse. Additional metadata are created and captured for
time stamping any extracted data, the source of the extracted data, and missing fields
that have been added by data cleaning or integration processes. A metadata repository
should contain the following:
•A description of the structure of the data warehouse, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents
•Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or purged),
and monitoring information (warehouse usage statistics, error reports, and audit trails)
•The algorithms used for summarization, which include measure and dimension
definition
•algorithms, data on granularity, partitions, subject areas, aggregation, summarization,
and predefined queries and reports
•The mapping from the operational environment to the data warehouse, which includes
source databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging rules,
and security (user authorization and access control)
•Data related to system performance, which include indices and profiles that improve
data access and retrieval performance, in addition to rules for the timing and scheduling
of refresh, update, and replication cycles
•Business metadata, which include business terms and definitions, data ownership
information, and charging policies
c) Describe concept of hierarchy with an example. 4M
Ans: A concept hierarchy for a given numerical attribute defines a discretization of the ( Describtion:4
attribute. Concept hierarchies can be used to reduce the data by collecting and replacing marks)
low-level concepts (such as numerical values for the attribute age) with higher-level
concepts (such as youth, middle-aged, or senior). Although detail is lost by such data
generalization, the generalized data may be more meaningful and easier to interpret.
This contributes to a consistent representation of data mining results among multiple
mining tasks, which is a common requirement. In addition, mining on a reduced data set
Page | 26
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
require fewer input/output operations and is more efficient than mining on a larger, un-
generalized data set. Because of these benefits, discretization techniques and concept
hierarchies are typically applied before data mining as a preprocessing step, rather than
during mining. Concept hierarchies for numeric attributes can be constructed
automatically based on data distribution analysis. Five methods for concept hierarchy
generation are defined below- binning histogram analysis entropy-based discretization
and data segmentation by
Natural partitioning Binning: Attribute values can be discretized by distributing
the values into bin and replacing each bin by the mean bin value or bin median
value. These technique can be applied recursively to the resulting partitions in order
to generate concept hierarchies.
Natural intervals. In general the rule partitions a give range of data into 3,4,or 5
equinity intervals, recursively level by level based on value range at the most
significant digit. The rule can be recursively applied to each interval creating a
concept hierarchy for the given numeric attribute.
d) Describe mining descriptive statistical measure in large databases. 4M
Ans: Class description can be explained with respect to the terms of popular Measures, such ( Describtion:4
as count, sum, and average. Relational database systems provide five Built-in aggregate marks)
functions: count ( ), sum ( ), max ( ), and min ( ). These Functions can also be computed
efficiently (in incremental and distributed manners) in data cubes. Thus, there is no
problem in including these aggregate functions as basic measures in the descriptive
mining of multidimensional data. For many data mining tasks, however, users would
like to learn more data characteristics regarding both central tendency and data
dispersion. Measures of central tendency include mean, median, mode, and midrange,
while measures of data dispersion include quartiles, outliers, and variance. These
descriptive statistics are of great help in Understanding the distribution of the data. Such
Page | 27
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
(Autonomous)
(ISO/IEC - 27001 - 2005 Certified)
MODEL ANSWER
SUMMER– 17 EXAMINATION
Subject Title: DATA WAREHOUSING AND DATA MINING Subject Code: 17520
measures have been studied extensively In the statistical literature. From the data mining
point of view, we need to examine how They can be computed efficiently in large
multidimensional databases.
e) Describe issues regarding classification and predication. 4M
Relevance Analysis:
Many of the attributes in the data may be irrelevant to the classification or prediction
task. For example, data recording the day of the week on which a bank loan application
was filed is unlikely to be relevant to the success of the application. Furthermore, other
attributes may be redundant. Hence, relevance analysis may be performed on the data
with the aim of removing any irrelevant or redundant attributes from the learning
process. In machine learning, this step is known as feature selection. Including such
attributes may otherwise slow down, and possibly mislead, the learning step. Ideally, the
time spent on relevance analysis, when added to the time spent on learning from the
resulting “reduced” feature subset should be less than the time that would have been
spent on learning from the original set of features. Hence, such analysis can help
improve classification efficiency and scalability.
Data Transformation:
The data can be generalized to higher – level concepts. Concept hierarchies may be used
for this purpose. This is particularly useful for continuous – valued attributes. For
example, numeric values for the attribute income may be generalized to discrete ranges
such as low, medium, and high. Similarly, nominal – valued attributes like street, can be
generalized to higher – level concepts, like city. Since generalization compresses the
original training data, fewer input / output operations may be involved during learning.
The data may also be normalized, particularly when neural networks or methods
involving distance measurements are used in the learning step. Normalization involves
scaling all values for a given attribute so that they fall within a small specified range,
such as – 1.0 to 1.0, or 0.0 to 1.0. In methods that use distance measurements, for
example, this would prevent attributes with initially large ranges (like, say, income)
from outweighing attributes with initially smaller ranges (such as binary attributes).
Predictive Accuracy: This refers to the ability of the model to correctly predict the
class label of new or previously unseen data.
Speed: This refers to the computation costs involved in generating and using the model.
Robustness: This is the ability of the model to make correct predictions given noisy data
or data with missing values.
Scalability: This refers to the ability to construct the model efficiently given large
amount of data.
Interpretability: This refers to the level of understanding and insight that is provided by
the model.
Page | 29