0% found this document useful (0 votes)
174 views7 pages

DWM Paper (S2022) Solution

The document is a solved paper for the subject of Data Warehousing and Data Mining. It contains questions and answers on various topics related to data warehousing, data mining, and knowledge discovery from databases. Specifically, it discusses data warehousing architecture and OLAP operations. It also covers major issues in data mining, the process of knowledge discovery, and methods for data preprocessing like handling missing data and noise. Decision tree algorithms and measures like information gain are explained. Association rule mining and evaluating classifier accuracy are also discussed with examples.

Uploaded by

Samay Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
174 views7 pages

DWM Paper (S2022) Solution

The document is a solved paper for the subject of Data Warehousing and Data Mining. It contains questions and answers on various topics related to data warehousing, data mining, and knowledge discovery from databases. Specifically, it discusses data warehousing architecture and OLAP operations. It also covers major issues in data mining, the process of knowledge discovery, and methods for data preprocessing like handling missing data and noise. Decision tree algorithms and measures like information gain are explained. Association rule mining and evaluating classifier accuracy are also discussed with examples.

Uploaded by

Samay Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Summer -2022 Gujarat Technological

Data Warehousing and Data Mining University


emCster.-VI(UT) Professional Elective - III(3161610)| Solved Paper

1
2Hours] (Total Marks : 70
Tine: 2

questions.
Instructions : 1) Attempt all
2) Make suitable assumptions wherever necessary.
3) Figures to the right indicate full marks.
4) Simple and non-programmable scientific calculators are allowed.
01 a) What is data warehousing ? Explain its features. (Refer section 1.2) (3]

b) With the help of a neat diagram explain the3 -tier architecture of data warehouse.
(Refer section 1.7) [4]

c) What is cube ?Discuss various OLAP operations on data cube.


(Refer section 1.9) [7]

Q2 a) Describe major issues in data mining. (Refer section 2.1.1) [3]

[4)
D) Dofeature wise comparison between OLAP and OLTP. (Refer section 1.1.1)
C) Define the term "Data mining". Why is it called data mining rather knowledge
of
mining ? Explain the process of knowledge discovery from databases with the help(7]
a suitable diagram. (Refer sections 2.1and 2.8)
OR

schema for multidimensional database


) Explain star, snowflake and fact constellation (71
with diagram. (Refer section 1.8)
a) List different methods for data discretization and explain any one in detail. [3]
(Refer section 3.7)
selection.
) What is feature selection and explain methods for feature [4]
(Refer section 3.10)
handle missing data and noisy data during the
Cplain the pre-processing required to (7]
sections 3.3.2 and 3.4.1)
process of data mining. (Refer
Data Warehousing and Data Mining S-2 Solved University
Question
OR Papers
Q.3 a) Explain the following as attribute selection measure : () Information
gain (ii)
ratio. (Refer sections 5.5.1.1 and 5.5.12) Gain
b) Discuss about dimensionalit1y reduction in brief. (Refer section 3.6.3)
(3)
c) What is noise ? Why data smoothing is required. Perform
by bin medians and by bin boundaries on the following data. smoothing by bin
means,
Consider the data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29. 34 l
Ans. : Refer sections 3.4 and 3.4.1.
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into
(equi-width) bins
-Bin 1(4 - 14) : 4, 8, 9
- Bin 2(15 - 24) : 15, 21, 21,
24
Bin 3(25 -34) : 25, 26, 28, 29, 34
Smoothing by bin means :
- Bin 1:7,7, 7
- Bin 2 : 20, 20, 20,
20
- Bin 3 : 28, 28, 28,
28, 28
Smoothing by bin boundaries
-Bin 1 :4, 4, 4
- Bin 2 : 15, 24, 24,
24
- Bin 3 : 25, 25,
25, 25, 34
Sorted data for price (In dollars) : 4,
* Partition into 8,9, 15, 21, 21, 24, 25, 26, 28,
(equi - depth) bins : 29, 34
- Bin 1:
4,8,9, 15
- Bin 2 : 21, 21,
24, 25
-Bin 3: 26, 28,
29, 34
Smoothing by bin means :
-Bin 1 :9, 9,9, 9
Bin 2 : 23, 23, 23, 23
Bin 3:29, 29, 29, 29

TECHNICAL PUBLICATIONS - An up thrust for


Warehousingend Data Mining S-3 Solved University Question Papers
Date

Snoolhing by bin boundaries :


Bin 1: 4, 4, 4,
15
Bin 2: 21, 21, 25, 25
Bin 3:26, 26, 26, 34

Q.4 a)
Why tree pruning useful in decision tree induction ?
[3)
Ans. : When decision trees are built, many of the branches may reflect noise or
tliers in the training data.
Tree pruning methods address this problem of
overfitting the data.
Tree pruning attempts to identify and
of improving classification accuracy on remove such branches, with the goal
unseen data.
Decision trees can suffer from repetition and
overwhelming to interpret. replication, making them
Repetition occurs when an
of the tree In replication, attribute repeatedly tested along a given branch
is
duplicate sub-trees exist within the tree.
These situations can impede the accuracy and
tree. omprehensibility of a decision
Pruned trees
These tend to be smaller and less
complex and thus, easier to comprehend.
They are usually faster and better at
data than un-pruned trees. correctly classifying independent test
Pruned trees tend to be more compact than their
There are two common approaches to tree un-pruned counterparts
pruning :
Pre-pruning
" In the pre-pruning approach, a tree is "'pruned" by halting its construction
early (e.g. by deciding not to further split or partition the subset of training
tuples at a given node).
When constructing a tree, measures such as
information gain, Gini index and so on can be used to statistical significance,
assess the goodness of
a split.
lf partitioning the tuples at a node would result in a split that falls below a
pre specified threshold, then further partitioning of the given subset is
halted.
There are difficulties, however, in choosing an appropriate threshold.
Fligh thresholds could result in oversimplified trees, whereas low thresholds
could result in very little simplification

TECHNICAL PUBLICATIONS- An up thrust for knowledge


Data Warehousing and Data Mining S-4 Solved University Question Papers

Post - pruning
" The second and more common approach is post pruning, which removes
subtrees from a "fully grown" tree.
" Asubtree at a given node is pruned by removing its branches and
it with a leaf. replacing
" The leaf is labeled with the most frequent class among the subtree being
replaced.
The cost complexity pruning algorithm used in CART is an example of the
post pruning approach.
The basic idea is that the simplest solution is preferred.
Unlike cost complexity, pruning does not require an independent set of
tuples.
Post pruning leads to a more reliable tree.
b) Explain association rules with confidence and support giving an example.
(Refer sections 4.7 and 4.7.3)

c) Explain how the accuracy of classifier predictor can be measured.


(Refer sections 5.1, 5.1.1 and 5.1.2)
[7]
OR

Q.4 a) Explain mining multiple-level association rules using example.


[3]
Ans. : Multilevel mining
association rules :
Items often form hierarchy.
Items of the lower level are expected to have
lower support.
Foodstuff
Department
Frozen Refrigerated Fresh
Sector Bakery Etc.

Family Vegetable Fruit Diary Etc.

Product Bananas Apple Orange Etc.

Fig. 1
A common form of background knowledge as
generated or specialized according to a hierarchy ofthat an attribute
concepts.
may be

TECHNICAL PUBLICATIONS A
WarehouSingand Data Mining S-5 Solved University Question Papers

Dules which contain asSOCiations with hierarchy of concepts are


multilevel association rules, called
and confidence
Support of multilevel association rules :
Ceneralizing / specializing values of attributes affects
support and
confidene.
Support of rules increases from specialized to general.
Support of rules decreases from general to specialized.
Confidence is not affected for general to specialized.
Two approaches of multilevel association rules :

l'sing uniform support level for all levels :


The same minimum support for all levels.
There is only one minimum support
itemsets threshold SO no need to
examine
" If support threshold is too high ’ miss low level
" If the support associations
associations.
threshold is too low ’
generate too many high level

Level 1
Milk
min support = 5 %
(Support =10 %)

Level 2 2 % Milk
min support =5 % Skim milk
(Support =6%) (Support=4 %)
Example of uniform minimum support for all
levels
Using reduced
minimum support at lower level :
At every level of
abstraction, there is its own minimum support
So minimum support at
lower levels reduces. threshold.
Level 1
Milk
min support = 5%
(Support = 10 %)

Level 2 2% Milk
min support =5% (Support =6 %) Skim milk
(Support = 4 %)

Example of reduced minlmum support for lower level


TECHNICAL PUBLICATIONS - Anup thrust for knowledge
Data Warehousing and Data Mining S-6 Solved University Question Papers

b) Hov k-mean clustering method differs from k-medoid custering method ?


(Refer sections 6.5.1 and 6.6) 14)
c) Find the frequent itemsets and generate association rules on below given data using
apriori algorithm. Assume that minimum suyport threshold (s = 33.33 %) and
minimum confident threshold (c = 60 %)

Transaction ID Iterms

|Hot dogs, Buns, Ketchup


|Hot dogs, Buns

|Hot dogs, Coke, Chips

T |Chips, Coke

|Chips, Ketchup
|Hot Dogs, Coke, Chips

Ans. :
33.33
Minimum support count X 6= 2
10

Refer Fig. 2 on next page.


There is only one itemset with minimum support 2. So only one item set is frequent.
Frequent Itemset (1) = (Hot Dogs, Coke, Chips)
Association rules,
(Hot Dogs^Coke<=>(Chips]l/confidence = sup(Hot Dogs ^Coke^ Chips)/sup(Hot
Dogs^Coke) = 2/2*100=100% //Selected
[Hot Dogs^ Chips]=>(Coke] //confidence = sup(Hot Dogs^ Coke ^ Chips)/sup(Hot
Dogs^ Chips) = 2/2*100=100% //Selected
[Coke ^Chips]= >(Hot Dogs] //confidence = sup(HotDogs ^Coke^Chips)/sup
(Coke^ Chips) = 2/3*100=66.67% //Selected
(Hot Dogs]=>(Coke^ Chips) //confidence = sup(Hot Dogs^Coke^Chips)/sup(Hot Dogs)
= 2/4*100=50% //Rejected

(Coke]=>(Hot Dogs^Chips] l/confidence = sup(Hot Dogs^ Coke ^Chips)/sup(Coke)


= 2/3*100=66.67% I/Selected

|Chips] =>(Hot Dogs^Coke] l/confidence = sup(Hot Dogs^ Coke^ Chips)/sup(Chips)


= 2/4*100=50% //Rejected
There are four strong results (minimum confidence greater than 60%)

TECHNICAL PUBLICATIONS - An up thrust for knowledge


And Date Minina S-7
Date Warehousing Solved University QuestionPapers
Itenset Sup-count Item set
Hot Dogs 4
Hot Dogs Sup-count
4
Buns Buns
Ketchup 2
Coke 3
Ketchup
Coke 3
Chips 4
Chips

Item set
Sup-count Item set
Hot Dogs, Buns
Hot Dogs, Buns Sup-count
Hot Dogs,Coke Hot Dogs, Ketchup
2
Hot Dogs, Chips 2 1
Hot Dogs, Coke
Coke, Chips 3
Hot Dogs, Chips 2
Buns, Ketchup 1
Buns, Coke
Buns, Chips
Item set Sup-count Ketchup, Coke 0
Hot Dogs, Buns, Coke Ketchup, Chips 1
Hot Dogs, Buns, Chips Coke, Chips 3
Hot Dogs, Coke, Chips

Item set
Sup-count
Hot Dogs, Coke, Chips 2

Fig. 2
.5 a) Diference between linear and logistic regression. (Refer sections 5.10.1 and
5.10.3)
(3]
b) What is outlier ? Discuss different methods for outlier
(Refer section 6.10) detection.
[4]
C) What is web mining ? Explain types of web mining. (Refer section 7.10) (71
OR

a) Diference between spatial mining and temporal mining. (Refer section 7.2.1) (3]

D) Discuss k-NN classification algorithm. (Refer section 5.6) [4]

Explain linear regression using suitable example along with its pros and cons.
(Refer section 5.10.1) (71
O00

TECHNICAL PUBLICATIONS.-An up thrust for knowledge

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy