CST466 Datamining Syllabus
CST466 Datamining Syllabus
YEAR OF
CATEGORY L T P CREDIT INTRODUCTION
CST466 DATA MINING
PEC 2 1 0 3 2019
Preamble: This course helps the learner to understand the concepts of data mining and data
warehousing. It covers the key processes of data mining, data preprocessing techniques,
fundamentals and advanced concepts of classification, clustering, association rule mining, web
mining and text mining. It enables the learners to develop new data mining algorithms and apply
the existing algorithms in real-world scenarios.
Prerequisite: NIL
Course Outcomes: After the completion of the course the student will be able to
CO# CO
CO1 Employ the key process of data mining and data warehousing concepts in application
domains. (Cognitive Knowledge Level: Understand)
CO2 Make use of appropriate preprocessing techniques to convert raw data into suitable
format for practical data mining tasks (Cognitive Knowledge Level: Apply)
CO3 Illustrate the use of classification and clustering algorithms in various application
domains (Cognitive Knowledge Level: Apply)
CO4 Comprehend the use of association rule mining techniques. (Cognitive Knowledge
Level: Apply)
CO5 Explain advanced data mining concepts and their applications in emerging domains
(Cognitive Knowledge Level: Understand)
PO PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO1 PO11 PO1
1 0 2
CO1
CO2
CO3
COMPUTER SCIENCE AND ENGINEERING
CO4
CO5
Assessment Pattern
Remember 20 20 20
Understand 30 30 30
Apply 50 50 50
Analyze
Evaluate
Create
Mark Distribution
150 50 100 3
COMPUTER SCIENCE AND ENGINEERING
Syllabus
Module – 1 (Introduction to Data Mining and Data Warehousing)
Data warehouse-Differences between Operational Database Systems and Data Warehouses,
Multidimensional data model- Warehouse schema, OLAP Operations, Data Warehouse
Architecture, Data Warehousing to Data Mining, Data Mining Concepts and Applications,
Knowledge Discovery in Database Vs Data mining, Architecture of typical data mining system,
Data Mining Functionalities, Data Mining Issues.
Text Books
1. Dunham M H, “Data Mining: Introductory and Advanced Topics”, Pearson Education, New
Delhi, 2003.
2. Arun K Pujari, “Data Mining Techniques”, Universities Press Private Limited,2008.
3. Jaiwei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, Elsevier,
2006
Reference Books
1. M Sudeep Elayidom, “Data Mining and Warehousing”, 1st Edition, 2015, Cengage
Learning India Pvt. Ltd.
2. MehmedKantardzic, “Data Mining Concepts, Methods and Algorithms”, John Wiley and
Sons, USA, 2003.
3. Pang-Ning Tan and Michael Steinbach, “Introduction to Data Mining”, Addison Wesley,
2006.
COMPUTER SCIENCE AND ENGINEERING
1. Use the methods below to normalize the following group of data:100, 200, 300, 400,550,
600, 680, 850, 1000
(a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization
(c) Normalization by decimal scaling
Comment on which method you would prefer to use for the given data, givingreasons as to
why.
2. Identify a suitable dataset from any available resources and apply different preprocessing
steps that you have learned. Observe and analyze the output obtained. (Assignment)
2. Illustrate the working of K medoid algorithm for the given dataset. A1=(3,9), A2=(2,5),
A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9).
COMPUTER SCIENCE AND ENGINEERING
3. Take a suitable dataset from available resources and apply all the classification and clustering
algorithms that you have studied on original and preprocessed datasets. Analyze the
performance variation in terms of different quality metrics. Give a detailed report based on
the analysis. (Assignment)
1. A database has five transactions. Let min sup = 60% and min con f = 80%.
a) Find all frequent item sets using Apriori and FP-growth, respectively. Compare the
efficiency of the two mining processes.
b) List all of the strong association rules (with support s and confidence c) matching the
following metarule, where X is a variable representing customers, and𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 denotes
variables representing items (e.g., “A”, “B”, etc.)
∀𝑥𝑥 ∈ 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡, 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏(𝑋𝑋, 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖1 ) ∧ 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏(𝑋𝑋, 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖2 ) ⇒ 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 (𝑋𝑋, 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖3 ) [𝑠𝑠, 𝑐𝑐]
2. Identify and list some scenarios in which association rule mining can be used, and then use at
least two appropriate association rule mining techniques in one of the two scenarios.
(Assignment)
1. Consider an e-mail database that stores a large number of electronic mail (e-mail)
messages. It can be viewed as a semi structured database consisting mainly of text data.
Discuss the following.
a. How can such an e-mail database be structured so as to facilitate multidimensional
search, such as by sender, by receiver, by subject, and by time?
b. What can be mined from such an e-mail database?
c. Suppose you have roughly classified a set of your previous e-mail messages as junk,
unimportant, normal, or important. Describe how a data mining system may take this
as the training set to automatically classify new e-mail messages or unclassified ones.
2. Precision and recall are two essential quality measures of an information retrieval system.
(a) Explain why it is the usual practice to trade one measure for the other.
(b) Explain why the F-score is a good measure for this purpose.
COMPUTER SCIENCE AND ENGINEERING
(c) Illustrate the methods that may effectively improve the F-score in an information
retrieval system.
3. Explain HITS algorithm with an example.
PART A
6. Given two objects represented by the tuples (22,1,42,10) and (20,0, 36,8).
Compute the Euclideanand Manhattan distance between the two objects.
7. The pincer search algorithm is a bi-directional search, whereas the level wise
algorithm is a unidirectional search. Express your opinion about the statement.
8. Define support, confidence and frequent set in association data mining context.
Part B
(Answer any one question from each module. Each question carries 14 Marks)
11. (a) Suppose a data warehouse consists of three measures: customer, account (7)
and branch and two measures count (number of customers in the branch)
and balance. Draw the schema diagram using snowflake schema and star
schema.
(b) Explain three- tier data warehouse architecture with a neat diagram. (7)
OR
13 (a) Suppose that the data for analysis includes the attribute age. The age values (8)
for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22,
22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Use min-max normalization to transform the value 35 for age onto
COMPUTER SCIENCE AND ENGINEERING
the
range [0-1].
(b) Use z-score normalization to transform the value 35 for age, where
the standard deviation of age is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for
age.
(d) Use smoothing by bin means to smooth the above data, using a bin
depth of 3. Illustrate your steps. Comment on the effect of this
technique for the given data.
(b) With proper illustration, explain how PCA can be used for dimensionality (6)
reduction? Explain
OR
14 (a) Suppose a group of 12 sales price records has been sorted as follows: 5, 10, (8)
11, 13, 15, 35, 50, 55, 72, 92, 204, 215. Sketch examples of each of the
following sampling techniques: SRSWOR, SRSWR, cluster sampling,
stratified sampling. Use samples of size 5 and the strata “youth,” “middle-
aged,” and “senior.”
(b) Partition the above data into three bins by each of the following methods: (6)
(i) equal-frequency (equi-depth) partitioning
(ii) equal-width partitioning
15 (a) Explain the concept of a cluster as used in ROCK. Illustrate with examples (9)
(b) Consider the following dataset for a binary classification problem. (5)
A B Class
Label
T F +
T T +
T T +
T F -
T T +
F F -
F F -
F F -
T T -
T F -
Calculate the gain in Gini index when splitting on A and B respectively.
Which attribute would the decision tree induction algorithm choose?
COMPUTER SCIENCE AND ENGINEERING
OR
16 (a) For a sunburn dataset given below, find the first splitting attribute for the (10)
decision tree by using the ID3 algorithm.
17 (a) Illustrate the working of Pincer Search Algorithm with an example. (7)
(b) Describe the working of dynamic itemset counting technique? Specify when (7)
to move an itemset from dashed structures to solid structures?
OR
18 (a) A database has six transactions. Let min_sup be 60% and min_conf be (9)
80%.
TID items_bought
T1 I1, I2, I3
T2 I2, I3, I4
T3 I4, I5
T4 I1, I2, I4
(b) Write partitioning algorithm for finding large itemset and compare its (5)
efficiency with apriori algorithm
COMPUTER SCIENCE AND ENGINEERING
(b) Write an algorithm to find maximal frequent forward sequences to mine log (7)
traversal patterns. Illustrate the working of this algorithm.
OR
20 (a) Explain how web structure mining is different from web usage mining and (7)
web content mining? Write a CLEVER algorithm for web structure mining.
(b) Describe different Text retrieval methods. Explain the relationship between (7)
text mining and information retrieval and information extraction.
Teaching Plan
No. of
No Contents lecture
hours
(36
Hrs)