(COMP1942) (2022) (S) Midterm Thliai 91588
(COMP1942) (2022) (S) Midterm Thliai 91588
Instructions:
(1) Guideline
(a) Please follow all instructions about the exam guideline (e.g., your face video capturing) stated in
the Canvas website.
(b) For the sake of space, we do not write them again.
(2) Question
(a) There are 2 parts in this exam, Part A (Short/Long Question) and Part B (Multiple-Choice
Question).
(b) Please answer all questions in Part A and Part B. The total scores in this exam are 100.
(3) Answer Sheet
(a) Please submit your answers in PDF to the Canvas website.
(b) Please use the cover page stated in the Canvas website as the first page of your PDF file. This
cover page includes your information and an agreement with your signature.
(c) Please start to write your answers starting on the second page of your PDF file.
(d) The PDF file should “clearly” show your answers without any blurred images. No marks will be
given to any “blurred” parts in the PDF file. Please make sure that the PDF file shows your
answers clearly.
(4) Online Exam
(a) This is an online exam where you could access all online materials.
However, it is not allowed to communicate with other people (except the instructor and the tutors
in this course) in any form (including but not limited to orally, electronically and in writing)
during the entire exam period together with the pre-15-minute preparation time and the post-15-
minute buffer time.
(5) File Submission
(a) We allow a 15-minute buffer for your PDF file upload. Remember to upload your file at around
11:40am. We allow your file uploading time at most 15 minutes. Canvas will terminate any file
uploading process at 11:55am if your file is still being uploaded at 11:55am.
(6) Zero-Score Regulation
(a) If your face could not be shown in your video for at least 10 seconds in the exam period together
with the pre-15-minute preparation time and the post-15-minute buffer time, your exam score will
be set to 0 (even though you submit your PDF file in Canvas).
(b) If you do not submit the first cover page which is filled and signed completely, your exam score
will be set to 0.
(c) We only mark your latest PDF file uploaded by 11:55am. Your exam score will be set to 0 if we
could not see any PDF file uploaded by 11:55am (even though you do the question paper or you
“could” upload your PDF file after 11:55am).
1/13
COMP1942 Question Paper
We are given the following table containing 20 transactions and 16 items, namely a, b, … p, represented in a
binary matrix format. Please do the following parts.
TID a b c d e f g h i j k l m n o p
1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0
2 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0
3 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
4 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
5 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
6 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0
8 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
11 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0
12 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0
13 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
14 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
15 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
16 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0
17 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0
18 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1
(a) For each of the following answers, please show steps and show the answer rounded up to 2 decimal places.
(i) What is the confidence of rule “{a, c} b”?
(ii) What is the lift ratio of rule “{a, c} b”?
(iii) What is the support of rule “{a, c} b”?
(b) Suppose that the support threshold is set to 3.
Apply the algorithm of FP-growth and generate all the conditional FP-trees.
You are required to draw the original FP-tree and all conditional FP-trees.
What are the frequent itemsets generated?
You do not need to give the frequency of each frequent itemset.
2/13
COMP1942 Question Paper
Q1 (20 Marks) (Version B)
We are given the following table containing 20 transactions and 16 items, namely g, h, … v, represented in a
binary matrix format. Please do the following parts.
TID g h i j k l m n o p q r s t u v
1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0
2 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0
3 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
4 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
5 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
6 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0
8 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
11 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0
12 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0
13 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
14 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
15 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
16 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0
17 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0
18 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
19 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1
(a) For each of the following answers, please show steps and show the answer rounded up to 2 decimal places.
(i) What is the confidence of rule “{g, i} h”?
(ii) What is the lift ratio of rule “{g, i} h”?
(iii) What is the support of rule “{g, i} h”?
(b) Suppose that the support threshold is set to 3.
Apply the algorithm of FP-growth and generate all the conditional FP-trees.
You are required to draw the original FP-tree and all conditional FP-trees.
What are the frequent itemsets generated?
You do not need to give the frequency of each frequent itemset.
3/13
COMP1942 Question Paper
(a) The following shows the error report for a decision tree found from a given training dataset.
Class # Cases # Errors % Error
Yes 5 2 40%
No 8 3 37.5%
Overall 13 5 38.46%
(i) Is it possible to know the confusion matrix for this decision tree according to this error report? If yes,
please give the confusion matrix. If no, please explain it and give the minimum set of additional
information so that we could give the confusion matrix.
(ii) Is it possible to know the decile-wise lift chart for this decision tree according to this error report? If
yes, please give the decile-wise lift chart and write down the height of each bar at the top of the bar in
the chart clearly (rounded up to 2 decimal places). If no, please explain it and give the minimum set
of additional information so that we could give the decile-wise lift chart.
(b) Consider the following table where the first three columns correspond to the input attributes and the fourth
column corresponds to the target attribute.
Age Education Married Insurance
young high no yes
old high yes yes
old low yes yes
old low yes yes
young low no no
young low no no
young low no no
old low no no
We want to train a CART decision tree classifier to predict whether a new customer will buy an insurance
policy or not. We define the value of attribute Insurance is the label of a record.
(i) Please find a CART decision tree according to the above example. In the decision tree, whenever we
process (1) a node containing at least 80% records with the same label or (2) a node containing at most
2 records, we stop to process this node for splitting.
Please show all of your steps and express the numbers rounded up to 4 decimal places.
(ii) Consider an old unmarried customer with high education. Please predict whether it is likely that this
customer will buy an insurance policy or not.
4/13
COMP1942 Question Paper
Q2 (20 Marks) (Version B)
(a) The following shows the error report for a decision tree found from a given training dataset.
Class # Cases # Errors % Error
Yes 5 2 40%
No 8 3 37.5%
Overall 13 5 38.46%
(i) Is it possible to know the confusion matrix for this decision tree according to this error report? If yes,
please give the confusion matrix. If no, please explain it and give the minimum set of additional
information so that we could give the confusion matrix.
(ii) Is it possible to know the decile-wise lift chart for this decision tree according to this error report? If
yes, please give the decile-wise lift chart and write down the height of each bar at the top of the bar in
the chart clearly (rounded up to 2 decimal places). If no, please explain it and give the minimum set
of additional information so that we could give the decile-wise lift chart.
(b) Consider the following table where the first three columns correspond to the input attributes and the fourth
column corresponds to the target attribute.
Age Education Gender Insurance
young high male yes
old high female yes
old low female yes
old low female yes
young low male no
young low male no
young low male no
old low male no
We want to train a CART decision tree classifier to predict whether a new customer will buy an insurance
policy or not. We define the value of attribute Insurance is the label of a record.
(i) Please find a CART decision tree according to the above example. In the decision tree, whenever we
process (1) a node containing at least 80% records with the same label or (2) a node containing at most
2 records, we stop to process this node for splitting.
Please show all of your steps and express the numbers rounded up to 4 decimal places.
(ii) Consider an old male customer with high education. Please predict whether it is likely that this
customer will buy an insurance policy or not.
5/13
COMP1942 Question Paper
Note: Please write the letter clearly (i.e., A, B, C, D or E) for each answer so that it could be distinguished
from other letters easily. In the past, some students wrote the letter unclearly which look like two possible
letters. One example is that the hand-written letter “B” (from some students) is similar to the hand-written
letter “E”. There are more examples which are not included here. In any case, if your letter is judged by us
that it is unclear, even though you “thought” that your answer is correct, 0 score will be given to you for that
question.
Part B
6/13
COMP1942 Question Paper
Q3. Which of the following statement(s) is/are true? Consider the Apriori approach.
(1) It is always true that the number of itemsets in L3 just after the counting step is smaller than the
number of itemsets in C3 just after the prune step.
(2) It is always true that the number of itemsets in C3 just after the prune step is smaller than the number
of itemsets in C3 just after the join step.
(3) It is always true that the number of itemsets in C3 just after the join step is smaller than the number
of itemsets in C2 just after the join step.
7/13
COMP1942 Question Paper
Q6. In XLMiner, given a table T, we set some parameters and generated the following output in association
rule mining.
8/13
COMP1942 Question Paper
Q7. Which of the following statement(s) is/are true?
(1) Consider the original k-means method. The mean of a cluster is equal to the sum of all data points
in this cluster divided by the total number of all data points in this cluster.
(2) Compared with the original k-means method, the advantage of sequential k-means method is that
we could obtain the clustering results whenever there is a new point.
(3) Consider the forgetful sequential k-means method where parameter a is set to a real number greater
than 0 and smaller than 1. It is always true that the weight of an old data point is smaller than the
weight of a new data point.
Consider the agglomerative approach to group these points with distance group average linkage.
9/13
COMP1942 Question Paper
Q9. In XLMiner, given a table T, in Raymond’s PC, we set some parameters and generated the following
output in k-means clustering.
10/13
COMP1942 Question Paper
Which of the following statement(s) is/are true?
(1) Consider the two clusters in the final output. Before we perform the k-means clustering, the initial
mean of one cluster is (57.8333333, 73.5) and the initial mean of another cluster is (10.25, 11).
(2) In XLMiner’s input dialog box, we chose “Fixed Start” under category “Options”.
(3) Suppose that student “Peter” set the same parameters as shown above in his PC and generated the
output in k-means clustering. Due to the randomness of k-means clustering, it is possible that the
clustering result in this output obtained from his PC is different from the clustering result in the
output obtained from Raymond’s PC.
Q10. Consider the following table T with 4 records and 5 attributes, namely X1, X2, …, X5.
Record No. X1 X2 X3 X4 X5
1 1 1 0 0 1
2 1 0 0 1 1
3 0 1 1 1 0
4 0 1 1 0 1
11/13
COMP1942 Question Paper
Q11. Consider the following 2 matrices, namely A and B.
10 20
A=
30 40
7
B=
8
Q13. Which of the following statement(s) is/are true? Consider two clusters, namely A and B.
(1) It is possible that the distance between Cluster A and Cluster B under the single linkage is equal to
the distance between Cluster A and Cluster B under the complete linkage.
(2) It is possible that the distance between Cluster A and Cluster B under the median linkage is equal
to the distance between Cluster A and Cluster B under the centroid linkage.
(3) The agglomerative approach is a process of splitting the large cluster into two clusters iteratively.
12/13
COMP1942 Question Paper
Q14. Given a table T containing 3 input attributes (i.e., “No. of Phones”, “Age”, “Weight”) and 1 target
attribute “Insurance”, we want to predict whether a customer will buy an insurance policy. In XLMiner,
given this table T, we set some parameters and generated the following output in the classification tree.
End of Paper
13/13