0% found this document useful (0 votes)
17 views7 pages

Lab6-Data Mining

lab data mining course HCMIU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views7 pages

Lab6-Data Mining

lab data mining course HCMIU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

ITDSIU21030 -lab6

Nguyen Duy Phuc

Introduction to Data Mining


Lab 6: Putting it all together

5.1. The data mining process

In the fifth class, we are going to look at some more global issues about the data mining process. (See
the lecture of class 5 by Ian H. Witten, [1]1). We are going through four lessons: the data mining process,
Pitfalls and pratfalls, and data mining and ethics.

According to [1], the data mining process includes steps: ask a question, gather data, clean the data,
define new features, and deploy the result. Write down the brief for these steps:

- Ask a question

o Purpose: Clearly define the objective or problem you want to solve.

o Example: What factors contribute most to customer churn?

o This step ensures the data mining process is guided by a clear goal.

- Gather Data

o Purpose: Collect relevant data from various sources that can help address the question.

o Example: Data from CRM systems, transaction logs, or external databases.

o Ensure you have sufficient and relevant data for analysis.

- Clean the Data

o Purpose: Handle missing values, remove outliers, and correct inconsistencies.

o Example: Replacing null values, standardizing formats, or filtering noise.

o Cleaning improves the reliability and accuracy of the data.

- Define New Features

o Purpose: Create new variables or transformations that better represent the problem.

o Example: Converting dates to "days since last purchase" or normalizing numerical values.

o Helps extract useful patterns and enhances model performance

1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

1
ITDSIU21030 -lab6
Nguyen Duy Phuc

- Deploy the Result

o Purpose: Apply the insights gained to solve the problem or inform decisions.

o Example: Using the model to predict customer churn or recommend products.

o Finalizes the process by providing actionable outcomes.

Alternatively, according to (Han and Kamber, 2011), the data mining process is treated as a knowledge
discovery (KDD) process including an iterative sequence of 7-steps. Please list them all in the below:

- Data Cleaning

o Removing noise, handling missing values, and resolving inconsistencies in the data.

o Objective: Ensure the quality and reliability of the data.

- Data Integration

o Combining data from multiple sources into a coherent data set.

o Objective: Unify data to create a consistent dataset for analysis.

- Data Selection

o Selecting relevant data for the analysis task.

o Objective: Focus on data that is pertinent to the problem or question.

- Data Transformation

o Converting data into suitable formats or generating new features (e.g., normalization,
aggregation).

o Objective: Prepare the data for efficient and accurate mining.

- Data Mining

o Applying algorithms to identify patterns or extract knowledge.

o Objective: Discover hidden insights or useful patterns in the data.

- Pattern Evaluation

o Identifying the most interesting patterns that represent knowledge.

o Objective: Focus on patterns that are meaningful, valid, and actionable.

- Knowledge Presentation

2
ITDSIU21030 -lab6
Nguyen Duy Phuc

o Visualizing and presenting the mined knowledge using reports, charts, or other tools.

o Objective: Make the results comprehensible and usable for decision-making.

5.2. Pitfalls and pratfalls


Follow the lecture in [1] to learn what are pitfalls and pratfalls in data mining.

Do experiments to investigate how OneR and J48 deal with missing values.

Write down the results in the following table:

Dataset OneR’s classifier model and J48’s classifier model and


performance performance
weather‐nominal.arff === Classifier model (full training J48 pruned tree
(original) set) === ------------------

outlook: outlook = sunny


sunny -> no | humidity = high: no (3.0)
overcast -> yes | humidity = normal: yes (2.0)
rainy -> yes outlook = overcast: yes (4.0)
(10/14 instances correct) outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
Time taken to build model: 0
seconds Number of Leaves : 5

=== Stratified cross-validation === Size of the tree : 8


=== Summary ===

Correctly Classified Instances 6 Time taken to build model: 0 seconds


42.8571 %
Incorrectly Classified Instances === Stratified cross-validation ===
8 57.1429 % === Summary ===
Kappa statistic -0.1429
Mean absolute error
Correctly Classified Instances 7
0.5714 50 %
Root mean squared error
Incorrectly Classified Instances 7
0.7559 50 %
Relative absolute error 120
Kappa statistic -0.0426
% Mean absolute error
Root relative squared error
0.4167
153.2194 % Root mean squared error
Total Number of Instances
0.5984
14 Relative absolute error 87.5
%
=== Detailed Accuracy By Class === Root relative squared error
121.2987 %
TP Rate FP Rate Precision Total Number of Instances 14

3
ITDSIU21030 -lab6
Nguyen Duy Phuc

Recall F-Measure MCC ROC


Area PRC Area Class === Detailed Accuracy By Class ===
0.444 0.600 0.571
0.444 0.500 -0.149 0.422 TP Rate FP Rate Precision
0.611 yes Recall F-Measure MCC ROC Area
0.400 0.556 0.286 PRC Area Class
0.400 0.333 -0.149 0.422 0.556 0.600 0.625
0.329 no 0.556 0.588 -0.043 0.633
Weighted Avg. 0.429 0.584 0.758 yes
0.469 0.429 0.440 -0.149 0.400 0.444 0.333
0.422 0.510 0.400 0.364 -0.043 0.633
0.457 no
=== Confusion Matrix === Weighted Avg. 0.500 0.544
0.521 0.500 0.508 -0.043
a b <-- classified as 0.633 0.650
4 5 | a = yes
3 2 | b = no === Confusion Matrix ===

a b <-- classified as
5 4 | a = yes
3 2 | b = no

weather‐nominal.arff === Classifier model (full training J48 pruned tree


(with missing values) set) === ------------------
: yes (14.0/5.0)
outlook:
sunny -> yes Number of Leaves : 1
overcast -> yes
rainy -> yes Size of the tree : 1
? -> no
(13/14 instances correct)
Time taken to build model: 0 seconds

Time taken to build model: 0 === Stratified cross-validation ===


seconds === Summary ===

=== Stratified cross-validation === Correctly Classified Instances 7


=== Summary === 50 %
Incorrectly Classified Instances 7
Correctly Classified Instances 50 %
13 92.8571 % Kappa statistic -0.1395
Incorrectly Classified Instances Mean absolute error
1 7.1429 % 0.5403
Kappa statistic 0.8372 Root mean squared error

4
ITDSIU21030 -lab6
Nguyen Duy Phuc

Mean absolute error 0.5727


0.0714 Relative absolute error
Root mean squared error 113.4615 %
0.2673 Root relative squared error
Relative absolute error 15 116.0707 %
% Total Number of Instances 14
Root relative squared error
54.1712 % === Detailed Accuracy By Class ===
Total Number of Instances
14 TP Rate FP Rate Precision
Recall F-Measure MCC ROC Area
=== Detailed Accuracy By Class === PRC Area Class
0.667 0.800 0.600
TP Rate FP Rate Precision 0.667 0.632 -0.141 0.211
Recall F-Measure MCC ROC 0.545 yes
Area PRC Area Class 0.200 0.333 0.250
1.000 0.200 0.900 0.200 0.222 -0.141 0.211
1.000 0.947 0.849 0.900 0.306 no
0.900 yes Weighted Avg. 0.500 0.633
0.800 0.000 1.000 0.475 0.500 0.485 -0.141
0.800 0.889 0.849 0.900 0.211 0.460
0.871 no
Weighted Avg. 0.929 0.129 === Confusion Matrix ===
0.936 0.929 0.926 0.849
0.900 0.890 a b <-- classified as
6 3 | a = yes
=== Confusion Matrix === 4 1 | b = no

a b <-- classified as
9 0 | a = yes
1 4 | b = no

Remark: how do OneR and J48 deal with missing values?

- OneR is simpler and may discard or approximate missing values, which could lead to reduced accuracy
when missing data is common.

5
ITDSIU21030 -lab6
Nguyen Duy Phuc

- J48 is more sophisticated, leveraging probabilistic methods to account for missing values, making it
more robust in handling incomplete datasets

5.3. Data mining and ethics


Reading

5.4. Association-rule learners

Do experiments to investigate how Apriori and FP-Growth generate association rules for datasets
vote.arff

Dataset Apriori based association rules FP-Growth based association rules


Vote.arff Apriori === Associator model (full training set) ===
=======
FPGrowth found 41 rules (displaying top
Minimum support: 0.45 (196 instances) 10)
Minimum metric <confidence>: 0.9
Number of cycles performed: 11 1. [el-salvador-aid=y, Class=republican]:
157 ==> [physician-fee-freeze=y]: 156
Generated sets of large itemsets: <conf:(0.99)> lift:(2.44) lev:(0.21)
conv:(46.56)
Size of set of large itemsets L(1): 20 2. [crime=y, Class=republican]: 158 ==>
[physician-fee-freeze=y]: 155
Size of set of large itemsets L(2): 17 <conf:(0.98)> lift:(2.41) lev:(0.21)
conv:(23.43)
Size of set of large itemsets L(3): 6 3. [religious-groups-in-schools=y,
physician-fee-freeze=y]: 160 ==> [el-
Size of set of large itemsets L(4): 1 salvador-aid=y]: 156 <conf:(0.97)> lift:(2)
lev:(0.18) conv:(16.4)
Best rules found: 4. [Class=republican]: 168 ==> [physician-
fee-freeze=y]: 163 <conf:(0.97)>
1. adoption-of-the-budget-resolution=y lift:(2.38) lev:(0.22) conv:(16.61)
physician-fee-freeze=n 219 ==> 5. [adoption-of-the-budget-resolution=y,
Class=democrat 219 <conf:(1)> lift:(1.63) anti-satellite-test-ban=y, mx-missile=y]:
lev:(0.19) [84] conv:(84.58) 161 ==> [aid-to-nicaraguan-contras=y]:
2. adoption-of-the-budget-resolution=y 155 <conf:(0.96)> lift:(1.73) lev:(0.15)
physician-fee-freeze=n aid-to-nicaraguan- conv:(10.2)
contras=y 198 ==> Class=democrat 198 6. [physician-fee-freeze=y,
<conf:(1)> lift:(1.63) lev:(0.18) [76] Class=republican]: 163 ==> [el-salvador-
conv:(76.47) aid=y]: 156 <conf:(0.96)> lift:(1.96)
3. physician-fee-freeze=n aid-to-nicaraguan- lev:(0.18) conv:(10.45)
contras=y 211 ==> Class=democrat 210 7. [religious-groups-in-schools=y, el-
<conf:(1)> lift:(1.62) lev:(0.19) [80] salvador-aid=y, superfund-right-to-sue=y]:
conv:(40.74) 160 ==> [crime=y]: 153 <conf:(0.96)>

6
ITDSIU21030 -lab6
Nguyen Duy Phuc

4. physician-fee-freeze=n education- lift:(1.68) lev:(0.14) conv:(8.6)


spending=n 202 ==> Class=democrat 201 8. [el-salvador-aid=y, superfund-right-to-
<conf:(1)> lift:(1.62) lev:(0.18) [77] sue=y]: 170 ==> [crime=y]: 162
conv:(39.01) <conf:(0.95)> lift:(1.67) lev:(0.15)
5. physician-fee-freeze=n 247 ==> conv:(8.12)
Class=democrat 245 <conf:(0.99)> 9. [crime=y, physician-fee-freeze=y]: 168
lift:(1.62) lev:(0.21) [93] conv:(31.8) ==> [el-salvador-aid=y]: 160 <conf:(0.95)>
6. el-salvador-aid=n Class=democrat 200 lift:(1.95) lev:(0.18) conv:(9.57)
==> aid-to-nicaraguan-contras=y 197 10. [el-salvador-aid=y, physician-fee-
<conf:(0.98)> lift:(1.77) lev:(0.2) [85] freeze=y]: 168 ==> [crime=y]: 160
conv:(22.18) <conf:(0.95)> lift:(1.67) lev:(0.15)
7. el-salvador-aid=n 208 ==> aid-to- conv:(8.02)
nicaraguan-contras=y 204 <conf:(0.98)>
lift:(1.76) lev:(0.2) [88] conv:(18.46)
8. adoption-of-the-budget-resolution=y aid-
to-nicaraguan-contras=y Class=democrat
203 ==> physician-fee-freeze=n 198
<conf:(0.98)> lift:(1.72) lev:(0.19) [82]
conv:(14.62)
9. el-salvador-aid=n aid-to-nicaraguan-
contras=y 204 ==> Class=democrat 197
<conf:(0.97)> lift:(1.57) lev:(0.17) [71]
conv:(9.85)
10. aid-to-nicaraguan-contras=y
Class=democrat 218 ==> physician-fee-
freeze=n 210 <conf:(0.96)> lift:(1.7)
lev:(0.2) [86] conv:(10.47)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy