0% found this document useful (0 votes)
140 views11 pages

Assignment 2 Slot8 TTS3208 Summer

This document contains: 1. An assignment for a data warehousing and data mining course with 17 tasks assigned to students. 2. The tasks involve applying various data mining techniques such as normalization, histogram generation, correlation analysis, association rule mining to sample datasets. 3. Case studies on impacts of data mining in different domains like retail, weather forecasting and human resources are also included as tasks.

Uploaded by

Joker Joker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views11 pages

Assignment 2 Slot8 TTS3208 Summer

This document contains: 1. An assignment for a data warehousing and data mining course with 17 tasks assigned to students. 2. The tasks involve applying various data mining techniques such as normalization, histogram generation, correlation analysis, association rule mining to sample datasets. 3. Case studies on impacts of data mining in different domains like retail, weather forecasting and human resources are also included as tasks.

Uploaded by

Joker Joker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

School of Computing

Department of CSE
Academic Year: 2021-2022

Course: 1151CS114-Data warehousing and Data mining Sem:Summer

Name :Dr. A. KAVITHA Slot: S8

Assignment – II

Level of learning domain


CO
Course Outcomes (Based on revised Bloom’s
Nos.
taxonomy)
Explain the concept of Data mining system and apply the
Co3 K2
various preprocessing techniques on large dataset.
Apply Association rule mining, classification and
CO4 K3
clustering techniques to discover various mining patterns.
Apply clustering techniques in various data mining
CO5 K3
Applications

Task Allotment

S. Vtu No. Name of the Student Task Allotment


No.
1. vtu11804 SHAVA VENKATA ASHOK Task1
2. vtu11827 SHEENURI RAVI
3. vtu11857 ABHAY KUMAR
4. vtu11892 YASHVARDHAN SINGH CHOUHAN Task 2
5. vtu12593 R MANJUNATHA REDDY
6. vtu12775 TULA BALAJI RAIMAJI
7. vtu12818 KAMMARA HARINATH ACHARI Task 3
8. vtu13019 CH BRAHMACHARY
9. vtu13105 P.SRI LAKSHMI KAVYA
10. vtu13293 YEDUNUTHALA ROHITH REDDY Task 4
11. vtu13710 M.VIMAL RAJ
12. vtu13744 ULLI VAMSI
13. vtu13754 BOBBALA MADHAVA REDDY Task 5
14. vtu13862 NARASIMHA NAIDU
15. vtu13863 GANDHAM KOUSHIK REDDY
16. vtu14484 KARLAPUDI PRAVEEN Task 6
17. vtu14485 V VENKATA RAHUL
18. vtu14852 PULI REMANTH SAI
19. vtu14967 VUSU PAVAN KUMAR Task 7
20. vtu15049 G.SAI RISHI
21. vtu15051 ANAM BHARATH REDDY
22. vtu15155 AYUSH KUMAR SINHA Task 8
23. vtu15209 BANDARI RAMA KRISHNA
24. vtu15223 JAKKOJU JASHWANTH
25. vtu15235 SAURABH KUMAR Task 9
26. vtu15238 RISHAB SHRESHTA
27. vtu15240 T SAI BHARGAV
28. vtu15249 TALAMPALLY PAVAN KUMAR Task 10
29. vtu15251 AKKUTHOTA RAGHAV REDDY
30. vtu15461 GADDI SAI NITISH
31. vtu15473 MEHUL PATEL Task 11
32. vtu15474 MANISH KUMAR YADAV
33. vtu15656 P SHIVARAMA SAI CHARAN
34. vtu15670 P.HARSHITH Task 12
35. vtu15673 V.BARATH KUMAR
36. vtu15715 DADDI MANEESH REDDY
37. vtu15819 ASULA RAKESH Task 13
38. vtu15840 RAJULA ABHISHEK REDDY
39. vtu15859 EPPALAPALLI YADAGIRI NARASIMHA
40. vtu15868 SANEPALLI VENKATA KUMAR REDDY Task 14
41. vtu15869 SANEPALLI VENKATA SUBBA REDDY
42. vtu15898 AKSHINTHALA VENKATA BHARATH
43. vtu15912 P GNANA SAI Task 15
44. vtu15915 M CHANDU KUMAR
45. vtu15935 MUKKOTI JANANI
46. vtu15947 RASINTI REKHA Task 16
47. vtu16144 ABHISHEK KUMAR
48. vtu16146 ANIKET RAJ
49. vtu16156 MUKESH KUMAR
50. vtu16201 RAJAT KUMAR BHAGAT Task 17
51. vtu16247 ROSHAN BANIYA
52. vtu16249 FAIZUL AZIZ

List of Tasks

Knowl
Task Course
Question Outcome
edge
No Level
1. A. Suppose that the values for a given set of data are grouped into intervals.
The intervals and corresponding frequencies are as follows.
age frequency
1–5 200
5–15 450
15–20 300
20–50 1500
50–80 700
80–110 44
Using the data for age given, plot an equal-width histogram of width 10.
CO3 K2
B. The following table shows the midterm and final exam grades obtained for CO4 K3
students in a database course. CO5 K2
X 2 6 7 10 9 10 13 12 34 55 90 20
Y 25 65 75 59 35 25 55 18 19 20 29 10

(a) Plot the data. Do x and y seem to have a linear relationship?


(b) Use the method of least squares to find an equation for the prediction of a
student’s final exam grade based on the student’s midterm grade in the course.

C. Discuss a case study on impacts of data mining in retail industry.


2. A. Use the two methods below to normalize the following group of data:
200, 300, 400, 600, 1000
(a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization

B. A database has five transactions. Let min sup = 50% and min conf = 60%.
TID items bought CO3 K2
T100 {F,I,S,H,E,R} CO4 K3
T200 { G,O,A,T,S} CO5 K2
T300 {D, O, V, E,S}
T400 {L, I,O, N, E,S,S }
T500 {S, U,G,A,R,C,A,N,E}
Explore all frequent itemsets using FP-growth and find the strong association
rules.
C. Discuss a case study on impacts of data mining for weather forecasting
3. A. Suppose a hospital tested the age and body fat data for randomly selected
adults with the following result:
age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2

Calculate the correlation coefficient (Pearson’s product moment


coefficient).Are these two variables positively or negatively correlated?

B. A database has five transactions. Let min sup = 60% and min con f = 80%. CO3 K2
TID items bought CO4 K3
T100 {M, O, N, K, E, Y} CO5 K2
T200 {D, O, N, K, E, Y }
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I ,E}
Explore all frequent itemsets using FP- growth and find the strong association
rules
C. Discuss a case study on impacts of data mining in HR.

4. A. Suppose that the data for analysis includes the attribute age. The age values
for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22,
25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) Compute the mean and median of the data.
(b) Compute the mode of the data.
(c)Compute the midrange of the data.

B. A database has five transactions. Let min sup = 60% and min con f = 80%.
TID items bought
T100 {M, O, N, K, E, Y} CO3 K2
T200 {D, O, N, K, E, Y } CO4 K3
T300 {M, A, K, E} CO5 K2
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I ,E}
(a) Find all frequent itemsets using Apriori.
(b) List all of the strong association rules (with support s and confidence c)
matching the following metarule, where X is a variable representing customers,
and itemi denotes variables representing items (e.g., “A”, “B”, etc.):

C. Discuss a case study on spatial data mining.


5. A. Suppose that the values for a given set of data are grouped into
intervals. The intervals and corresponding frequencies are as follows.
age frequency
1–5 200
5–15 450
15–20 300
20–50 1500
50–80 700
80–110 44
Using the data for age given answer the following:
a) Compute an approximate median value for the data
b) Use min-max normalization to transform the value 35 for age on to the range
[0.0,1.0].
(c) Use z-score normalization to transform the value 35 for age, where the
standard deviation of age is 12.94 years.
CO3 K2
(c) Use normalization by decimal scaling to transform the value 35 for age.
CO4 K3
CO5 K2
B. A database has five transactions. Let min sup = 50% and min con f =
60%.
TID items bought
T100 {F,I,S,H,E,R}
T200 { G,O,A,T,S}
T300 {D, O, V, E,S}
T400 {L, I,O, N, S }
T500 {S, U,G,A,R}
(a) Find all frequent itemsets using Apriori.
(b) List all of the strong association rules (with support s and confidence c)
matching the following metarule, where X is a variable representing customers,
and itemi denotes variables representing items (e.g., “A”, “B”, etc.):

C. Discuss a case study on impacts of data mining in smart home.


6. A. Using the Student ID 1 2 3 4 5 6 7 8 9 10`
data for marks Marks 8 10 15 20 25 30 40 34 78 85 90
given answer
the following:
a) Compute an approximate median value for the data
b) Use min-max normalization to transform the value of student ID 6 on to the
CO3 K2
range [0.0,1.0].
CO4 K3
(c) Use z-score normalization to transform the value of student ID 6 .
CO5 K2
(c) Use normalization by decimal scaling to transform the marks of student ID
6.

B. The following contingency table summarizes supermarket transaction data,


where hotdogs refers to the transactions containing hot dogs, hot dogs refers to
the transactions that do not contain hot dogs, hamburgers refers to the
transactions containing hamburgers, and hamburgers refers to the transactions
that do not contain hamburgers.
hot dogs hotdogs Srow
Hamburgers 2,000 500 2,500
Hamburgers 1,000 1,500 2,500
Scol 3,000 2,000 5,000

(a) Suppose that the association rule “hot dogs=> hamburgers” is mined. Given
a minimum support threshold of 25% and a minimum confidence threshold of
50%, is this association rule strong?
(b) Based on the given data, is the purchase of hot dogs independent of the
purchase of hamburgers? If not, what kind of correlation relationship exists
between the two?

C. Discuss a case study on impacts of data mining in Medical field.


7. A. Suppose that the data for analysis includes the attribute age. The age values
for the data tuples are (in increasing order) 2, 4, 6, 8, 9 , 17, 23, 34, 45, 56, 67,
78, 89, 91, 95, 96, 99,100,102, 120
(a) Compute the mean and median of the data.
(b) Compute the mode of the data.
(c)Compute the midrange of the data.

B. The following contingency table summarizes supermarket transaction data,


where hotdogs refers to the transactions containing hot dogs, hot dogs refers to
the transactions that do not contain hot dogs, hamburgers refers to the
transactions containing hamburgers, and hamburgers refers to the transactions
that do not contain hamburgers.
CO3 K2
CO4 K3
hotdogs hotdogs Srow
CO5 K2
Hamburgers 4,000 1500 5,500
Hamburgers 6,000 4,500 10,500
Scol 10,000 6,000 16,000

(a) Suppose that the association rule “hot dogs=> hamburgers” is mined. Given
a minimum support threshold of 25% and a minimum confidence threshold of
50%, is this association rule strong?
(b) Based on the given data, is the purchase of hot dogs independent of the
purchase of hamburgers? If not, what kind of correlation relationship exists
between the two

C. Discuss a case study on data mining for drought monitoring.

8. A. Suppose a hospital tested the age and body fat data for 9 randomly CO3 K2
selected adults with the following result: CO4 K3
age 52 54 54 56 57 58 58 60 61 CO5 K2
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

Normalize the two variables based on z-score normalization.

B. The following contingency table summarizes supermarket transaction


data, where Icecream refers to the transactions containing Icecreams,
̅̅̅̅̅̅̅̅̅̅̅̅̅̅
𝐼𝑐𝑒𝑐𝑟𝑒𝑎𝑚𝑠 refers to the transactions that do not contain Icecreams,
Chocolate refers to the transactions containing Chocolate, and
̅̅̅̅̅̅̅̅̅̅̅̅̅̅
𝐶ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 refers to the transactions that do not contain hamburgers.

Icecream ̅̅̅̅̅̅̅̅̅̅̅̅̅
𝐼𝑐𝑒𝑐𝑟𝑒𝑎𝑚 ∑ 𝑟𝑜𝑤
Chocolate 3,500 2300 5,800
̅̅̅̅̅̅̅̅̅̅̅̅̅̅
𝐶ℎ𝑜𝑐𝑜𝑙𝑎𝑡𝑒 2,500 700 3,200
∑ 𝑐𝑜𝑙 6,000 3,000 9,000

(a) Suppose that the association rule “Icecream=> Chocolate” is mined. Given
a minimum support threshold of 25% and a minimum confidence threshold of
50%, is this association rule strong?
(b) Based on the given data, is the purchase of Icecream independent of the
purchase of Chocolate? If not, what kind of correlation relationship exists
between the two?

C. Discuss a case study on social impacts in data mining


9. A. Suppose a group of 10 sales price records has been sorted as follows:
67, 78, 89, 91, 95, 96, 99,100,102, 120.

Partition them into three bins by using equal-frequency (equidepth)


partitioning.
B. A database has five transactions. Let min sup = 60% and min con f =
80%.
TID items bought
T100 {A, P,L,E} CO3 K2
T200 { O,R, A, N, G, E } CO4 K3
T300 {G,R,A,E} CO5 K2
T400 {J,A,C,K}
T500 {M,E,L,O,N}
(a) Find all frequent itemsets using Apriori and FP-growth, respectively.
Compare the efficiency of the two mining processes.

C. Discuss a case study on social impacts in data mining.


10. A.
Student ID 1 2 3 4 5 6 7 8 9 10`
Marks 25 19 40 55 80 90 46 91 94 85 90
Using the data for marks given answer the following:
a) Compute an approximate median value for the data
b) Use min-max normalization to transform the marks value of student Id =10
on to the range [0.0,1.0].
(c) Use z-score normalization to transform the marks of the student Id= 10.
(c) Use normalization by decimal scaling to transform the marks of student ID
10. CO3 K2
CO4 K3
B. The following table shows the midterm and final exam grades obtained for CO5 K2
students in a database course.
X 1 4 5 7 9 10 14 18 23 11 98 55
Y 20 60 70 56 30 20 50 13 16 15 19 20

(a) Plot the data. Do x and y seem to have a linear relationship?


(b) Use the method of least squares to find an equation for the prediction of a
student’s final exam grade based on the student’s midterm grade in the course.

C. Discuss a case study on Clustering for earth-quake study.


11. A. Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215
Partition them into three bins by equal-width partitioning method.

B. The following table shows the midterm and final exam grades obtained for
students in a database course. CO3 K2
(a) Plot the data. Do x and y seem to have a linear relationship? CO4 K3
(b) Use the method of least squares to find an equation for the prediction of a CO5 K2
student’s final exam grade based on the student’s midterm grade in the course.
X Mid 72 50 81 74 94 86 59 83 65 33 88 81
Y Final 84 63 77 78 90 75 49 79 77 52 74 90

C. Discuss a case study on Text mining.


12. A. Use the two methods below to normalize the following group of data:
200, 300, 400, 600, 800, 1000
(a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization

B. A database has five transactions. Let min sup = 50% and min conf = 60%.
TID items bought
T100 {F,I,S,H,E,R}
T200 { G,O,A,T,S}
T300 {D, O, V, E,S}
T400 {L, I,O, N, E,S,S }
T500 {S, U,G,A,R,C,A,N,E}
(a) Find all frequent itemsets using Apriori and find the strong
association rules

C. Discuss a case study on impacts of data mining for weather forecasting


13. A. Suppose a hospital tested the age and body fat data for randomly selected
adults with the following result:
age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2

Normalize the two variables based on min-max normalization.

B. A database has five transactions. Let min sup = 60% and min con f = 80%.
TID items bought
T100 {M, O, N, K, E, Y}
T200 {D, O, N, K, E, Y }
T300 {M, A, K, E}
T400 {M, U, C, K, Y}
T500 {C, O, O, K, I ,E}
Explore all frequent itemsets using Apriori and find the strong association rules

C. Discuss a case study on social impacts of data mining.


14. A. Suppose a group of 10 sales price records has been sorted as follows: 67,
78, 89, 91, 95, 96, 99,100,102, 120.
Partition them into three bins using equal-frequency (equidepth) partitioning
method.

B. A random sample of 435 people were surveyed and each person was asked
to report the highest education level they obtained. Is gender independent
of education level? The data that resulted from the survey is summarized in
the following table:

High School Bachelors Masters Ph.d. Total

Female 55 65 60 46 226

Male 45 44 60 60 209

Total 100 109 120 106 435

Are gender and education level dependent at 1% level of significance? In other


words, given the data collected above, is there a relationship between the
gender of an individual and the level of education that they have obtained?

C. Explore a case study on data mining for healthcare.


B. 15 A. Suppose a hospital tested the age and body fat data for randomly selected
adults with the following result:
age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2

Normalize the two variables based on z-score normalization.

B. Using the data set for predicting borrowers who will default on loan
payments,

Tid Home marital Annual


owner status income Defaulted
Borrower
1 yes Single 125k No
2 no Married 110k No
3 no Single 80k No
4 yes Married 130k No
5 no Divorced 100k Yes
6 no Married 65k No
7 yes Divorced 225k No
8 no Single 90k Yes
9 no Married 80k No
10 no Single 95k Yes

Classify the borrowers using Bayes’ rule based on the following borrowers.
1. { Homeowner=yes, marital status=single, Annual Income=300k}
2. { Homeowner=yes, marital status=married, Annual Income=50k}

C. Cluster the following ten points (with (x, y) representing locations) into three
clusters: A1(10, 9), A2(4, 9), A3(6, 6), A4(9, 7), A5(6, 5), A6(9, 4), A7(1, 9), A8(5, 9),
A9(9, 2),A10(10, 2), using Kmeans algorithm.

C. 16
A. Suppose a hospital tested the age and body fat data for randomly selected
adults with the following result:
age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2

Calculate the correlation coefficient (Pearson’s product moment


coefficient).Are these two variables positively or negatively correlated?

B. Car theft dataset with attributes such as Color , Type , Origin, and
the subject, stolen can be either yes or no.
Example Color Type Origin Stolen?
1 Red Sports Domestic Yes
2 Red Sports Domestic No
3 Red Sports Domestic Yes
4 Yellow Sports Domestic No
5 Yellow Sports Imported Yes
6 Yellow SUV Imported No
7 Yellow SUV Imported Yes
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes
11 Blue Sports Imported Yes
12 Green SUV Domestic No

Classify the cars using Bayes’ rule based on the following features
1. {Blue Domestic Sports}
2. {Red Imported SUV}

C. Cluster the following ten points (with (x, y) representing locations)


into three clusters:
A1(3, 10), A2(4, 5), A3(7, 6), A4(5, 7), A5(6, 5), A6(6, 6), A7(1, 2),
A8(5, 9), A9(8, 2), A10(10, 2), using K-means algorithm.

D. 17 A. Use the two methods below to normalize the following group of data:150,
350, 400, 650, 850, 900, 1000
(a) min-max normalization by setting min = 0 and max = 1
(b) z-score normalization

B. A random sample of 435 people were surveyed and each person was asked
to report the highest education level they obtained. Is gender independent of
education level? The data that resulted from the survey is summarized in the
following table:

High School Bachelors Masters Ph.d. Total

Female 55 65 60 46 226

Male 45 44 60 60 209

Total 100 109 120 106 435

Are gender and education level dependent at 1% level of significance? In other


words, given the data collected above, is there a relationship between the
gender of an individual and the level of education that they have obtained?
Note: Use chi-square test

C. Explore a case study on data mining for healthcare.

Faculty Course Co_ordinator HoD, CSE

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy