0% found this document useful (0 votes)

17 views7 pages

Lab6-Data Mining

lab data mining course HCMIU

Uploaded by

Nguyễn Duy Phúc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views7 pages

Lab6-Data Mining

lab data mining course HCMIU

Uploaded by

Nguyễn Duy Phúc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

ITDSIU21030 -lab6

Nguyen Duy Phuc

Introduction to Data Mining

Lab 6: Putting it all together

5.1. The data mining process

In the fifth class, we are going to look at some more global issues about the data mining process. (See
the lecture of class 5 by Ian H. Witten, [1]1). We are going through four lessons: the data mining process,
Pitfalls and pratfalls, and data mining and ethics.

According to [1], the data mining process includes steps: ask a question, gather data, clean the data,
define new features, and deploy the result. Write down the brief for these steps:

- Ask a question

o Purpose: Clearly define the objective or problem you want to solve.

o Example: What factors contribute most to customer churn?

o This step ensures the data mining process is guided by a clear goal.

- Gather Data

o Purpose: Collect relevant data from various sources that can help address the question.

o Example: Data from CRM systems, transaction logs, or external databases.

o Ensure you have sufficient and relevant data for analysis.

- Clean the Data

o Purpose: Handle missing values, remove outliers, and correct inconsistencies.

o Example: Replacing null values, standardizing formats, or filtering noise.

o Cleaning improves the reliability and accuracy of the data.

- Define New Features

o Purpose: Create new variables or transformations that better represent the problem.

o Example: Converting dates to "days since last purchase" or normalizing numerical values.

o Helps extract useful patterns and enhances model performance

1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

1
ITDSIU21030 -lab6
Nguyen Duy Phuc

- Deploy the Result

o Purpose: Apply the insights gained to solve the problem or inform decisions.

o Example: Using the model to predict customer churn or recommend products.

o Finalizes the process by providing actionable outcomes.

Alternatively, according to (Han and Kamber, 2011), the data mining process is treated as a knowledge
discovery (KDD) process including an iterative sequence of 7-steps. Please list them all in the below:

- Data Cleaning

o Removing noise, handling missing values, and resolving inconsistencies in the data.

o Objective: Ensure the quality and reliability of the data.

- Data Integration

o Combining data from multiple sources into a coherent data set.

o Objective: Unify data to create a consistent dataset for analysis.

- Data Selection

o Selecting relevant data for the analysis task.

o Objective: Focus on data that is pertinent to the problem or question.

- Data Transformation

o Converting data into suitable formats or generating new features (e.g., normalization,
aggregation).

o Objective: Prepare the data for efficient and accurate mining.

- Data Mining

o Applying algorithms to identify patterns or extract knowledge.

o Objective: Discover hidden insights or useful patterns in the data.

- Pattern Evaluation

o Identifying the most interesting patterns that represent knowledge.

o Objective: Focus on patterns that are meaningful, valid, and actionable.

- Knowledge Presentation

2
ITDSIU21030 -lab6
Nguyen Duy Phuc

o Visualizing and presenting the mined knowledge using reports, charts, or other tools.

o Objective: Make the results comprehensible and usable for decision-making.

5.2. Pitfalls and pratfalls

Follow the lecture in [1] to learn what are pitfalls and pratfalls in data mining.

Do experiments to investigate how OneR and J48 deal with missing values.

Write down the results in the following table:

Dataset OneR’s classifier model and J48’s classifier model and

performance performance
weather‐nominal.arff === Classifier model (full training J48 pruned tree
(original) set) === ------------------

outlook: outlook = sunny

sunny -> no | humidity = high: no (3.0)
overcast -> yes | humidity = normal: yes (2.0)
rainy -> yes outlook = overcast: yes (4.0)
(10/14 instances correct) outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
Time taken to build model: 0
seconds Number of Leaves : 5

=== Stratified cross-validation === Size of the tree : 8

=== Summary ===

Correctly Classified Instances 6 Time taken to build model: 0 seconds

42.8571 %
Incorrectly Classified Instances === Stratified cross-validation ===
8 57.1429 % === Summary ===
Kappa statistic -0.1429
Mean absolute error
Correctly Classified Instances 7
0.5714 50 %
Root mean squared error
Incorrectly Classified Instances 7
0.7559 50 %
Relative absolute error 120
Kappa statistic -0.0426
% Mean absolute error
Root relative squared error
0.4167
153.2194 % Root mean squared error
Total Number of Instances
0.5984
14 Relative absolute error 87.5
%
=== Detailed Accuracy By Class === Root relative squared error
121.2987 %
TP Rate FP Rate Precision Total Number of Instances 14

3
ITDSIU21030 -lab6
Nguyen Duy Phuc

Recall F-Measure MCC ROC

Area PRC Area Class === Detailed Accuracy By Class ===
0.444 0.600 0.571
0.444 0.500 -0.149 0.422 TP Rate FP Rate Precision
0.611 yes Recall F-Measure MCC ROC Area
0.400 0.556 0.286 PRC Area Class
0.400 0.333 -0.149 0.422 0.556 0.600 0.625
0.329 no 0.556 0.588 -0.043 0.633
Weighted Avg. 0.429 0.584 0.758 yes
0.469 0.429 0.440 -0.149 0.400 0.444 0.333
0.422 0.510 0.400 0.364 -0.043 0.633
0.457 no
=== Confusion Matrix === Weighted Avg. 0.500 0.544
0.521 0.500 0.508 -0.043
a b <-- classified as 0.633 0.650
4 5 | a = yes
3 2 | b = no === Confusion Matrix ===

a b <-- classified as
5 4 | a = yes
3 2 | b = no

weather‐nominal.arff === Classifier model (full training J48 pruned tree

(with missing values) set) === ------------------
: yes (14.0/5.0)
outlook:
sunny -> yes Number of Leaves : 1
overcast -> yes
rainy -> yes Size of the tree : 1
? -> no
(13/14 instances correct)
Time taken to build model: 0 seconds

Time taken to build model: 0 === Stratified cross-validation ===

seconds === Summary ===

=== Stratified cross-validation === Correctly Classified Instances 7

=== Summary === 50 %
Incorrectly Classified Instances 7
Correctly Classified Instances 50 %
13 92.8571 % Kappa statistic -0.1395
Incorrectly Classified Instances Mean absolute error
1 7.1429 % 0.5403
Kappa statistic 0.8372 Root mean squared error

4
ITDSIU21030 -lab6
Nguyen Duy Phuc

Mean absolute error 0.5727

0.0714 Relative absolute error
Root mean squared error 113.4615 %
0.2673 Root relative squared error
Relative absolute error 15 116.0707 %
% Total Number of Instances 14
Root relative squared error
54.1712 % === Detailed Accuracy By Class ===
Total Number of Instances
14 TP Rate FP Rate Precision
Recall F-Measure MCC ROC Area
=== Detailed Accuracy By Class === PRC Area Class
0.667 0.800 0.600
TP Rate FP Rate Precision 0.667 0.632 -0.141 0.211
Recall F-Measure MCC ROC 0.545 yes
Area PRC Area Class 0.200 0.333 0.250
1.000 0.200 0.900 0.200 0.222 -0.141 0.211
1.000 0.947 0.849 0.900 0.306 no
0.900 yes Weighted Avg. 0.500 0.633
0.800 0.000 1.000 0.475 0.500 0.485 -0.141
0.800 0.889 0.849 0.900 0.211 0.460
0.871 no
Weighted Avg. 0.929 0.129 === Confusion Matrix ===
0.936 0.929 0.926 0.849
0.900 0.890 a b <-- classified as
6 3 | a = yes
=== Confusion Matrix === 4 1 | b = no

a b <-- classified as
9 0 | a = yes
1 4 | b = no

Remark: how do OneR and J48 deal with missing values?

- OneR is simpler and may discard or approximate missing values, which could lead to reduced accuracy
when missing data is common.

5
ITDSIU21030 -lab6
Nguyen Duy Phuc

- J48 is more sophisticated, leveraging probabilistic methods to account for missing values, making it
more robust in handling incomplete datasets

5.3. Data mining and ethics

Reading

5.4. Association-rule learners

Do experiments to investigate how Apriori and FP-Growth generate association rules for datasets
vote.arff

Dataset Apriori based association rules FP-Growth based association rules

Vote.arff Apriori === Associator model (full training set) ===
=======
FPGrowth found 41 rules (displaying top
Minimum support: 0.45 (196 instances) 10)
Minimum metric <confidence>: 0.9
Number of cycles performed: 11 1. [el-salvador-aid=y, Class=republican]:
157 ==> [physician-fee-freeze=y]: 156
Generated sets of large itemsets: <conf:(0.99)> lift:(2.44) lev:(0.21)
conv:(46.56)
Size of set of large itemsets L(1): 20 2. [crime=y, Class=republican]: 158 ==>
[physician-fee-freeze=y]: 155
Size of set of large itemsets L(2): 17 <conf:(0.98)> lift:(2.41) lev:(0.21)
conv:(23.43)
Size of set of large itemsets L(3): 6 3. [religious-groups-in-schools=y,
physician-fee-freeze=y]: 160 ==> [el-
Size of set of large itemsets L(4): 1 salvador-aid=y]: 156 <conf:(0.97)> lift:(2)
lev:(0.18) conv:(16.4)
Best rules found: 4. [Class=republican]: 168 ==> [physician-
fee-freeze=y]: 163 <conf:(0.97)>
1. adoption-of-the-budget-resolution=y lift:(2.38) lev:(0.22) conv:(16.61)
physician-fee-freeze=n 219 ==> 5. [adoption-of-the-budget-resolution=y,
Class=democrat 219 <conf:(1)> lift:(1.63) anti-satellite-test-ban=y, mx-missile=y]:
lev:(0.19) [84] conv:(84.58) 161 ==> [aid-to-nicaraguan-contras=y]:
2. adoption-of-the-budget-resolution=y 155 <conf:(0.96)> lift:(1.73) lev:(0.15)
physician-fee-freeze=n aid-to-nicaraguan- conv:(10.2)
contras=y 198 ==> Class=democrat 198 6. [physician-fee-freeze=y,
<conf:(1)> lift:(1.63) lev:(0.18) [76] Class=republican]: 163 ==> [el-salvador-
conv:(76.47) aid=y]: 156 <conf:(0.96)> lift:(1.96)
3. physician-fee-freeze=n aid-to-nicaraguan- lev:(0.18) conv:(10.45)
contras=y 211 ==> Class=democrat 210 7. [religious-groups-in-schools=y, el-
<conf:(1)> lift:(1.62) lev:(0.19) [80] salvador-aid=y, superfund-right-to-sue=y]:
conv:(40.74) 160 ==> [crime=y]: 153 <conf:(0.96)>

6
ITDSIU21030 -lab6
Nguyen Duy Phuc

4. physician-fee-freeze=n education- lift:(1.68) lev:(0.14) conv:(8.6)

spending=n 202 ==> Class=democrat 201 8. [el-salvador-aid=y, superfund-right-to-
<conf:(1)> lift:(1.62) lev:(0.18) [77] sue=y]: 170 ==> [crime=y]: 162
conv:(39.01) <conf:(0.95)> lift:(1.67) lev:(0.15)
5. physician-fee-freeze=n 247 ==> conv:(8.12)
Class=democrat 245 <conf:(0.99)> 9. [crime=y, physician-fee-freeze=y]: 168
lift:(1.62) lev:(0.21) [93] conv:(31.8) ==> [el-salvador-aid=y]: 160 <conf:(0.95)>
6. el-salvador-aid=n Class=democrat 200 lift:(1.95) lev:(0.18) conv:(9.57)
==> aid-to-nicaraguan-contras=y 197 10. [el-salvador-aid=y, physician-fee-
<conf:(0.98)> lift:(1.77) lev:(0.2) [85] freeze=y]: 168 ==> [crime=y]: 160
conv:(22.18) <conf:(0.95)> lift:(1.67) lev:(0.15)
7. el-salvador-aid=n 208 ==> aid-to- conv:(8.02)
nicaraguan-contras=y 204 <conf:(0.98)>
lift:(1.76) lev:(0.2) [88] conv:(18.46)
8. adoption-of-the-budget-resolution=y aid-
to-nicaraguan-contras=y Class=democrat
203 ==> physician-fee-freeze=n 198
<conf:(0.98)> lift:(1.72) lev:(0.19) [82]
conv:(14.62)
9. el-salvador-aid=n aid-to-nicaraguan-
contras=y 204 ==> Class=democrat 197
<conf:(0.97)> lift:(1.57) lev:(0.17) [71]
conv:(9.85)
10. aid-to-nicaraguan-contras=y
Class=democrat 218 ==> physician-fee-
freeze=n 210 <conf:(0.96)> lift:(1.7)
lev:(0.2) [86] conv:(10.47)

Classification Data Mining
No ratings yet
Classification Data Mining
84 pages
CP1407 Prac6-9
No ratings yet
CP1407 Prac6-9
45 pages
Phạm Nguyễn Quỳnh Anh - ITDSIU22130 - Lab-06
No ratings yet
Phạm Nguyễn Quỳnh Anh - ITDSIU22130 - Lab-06
5 pages
Unit8 (Evaluation Method)
No ratings yet
Unit8 (Evaluation Method)
43 pages
ML Mini Project
No ratings yet
ML Mini Project
9 pages
IT 138 - Lecture 4
No ratings yet
IT 138 - Lecture 4
30 pages
Machine Learning Final Report
No ratings yet
Machine Learning Final Report
8 pages
Data Mining Final
No ratings yet
Data Mining Final
25 pages
Confusion Matrix & Evaluation Metrics in Machine Learning
No ratings yet
Confusion Matrix & Evaluation Metrics in Machine Learning
23 pages
ML Lecture 11 Evaluation
No ratings yet
ML Lecture 11 Evaluation
17 pages
Unit6 - 7 Issues
No ratings yet
Unit6 - 7 Issues
53 pages
DWM Exp 12
No ratings yet
DWM Exp 12
2 pages
NguyenCongSang ITITIU20292 Lab6
No ratings yet
NguyenCongSang ITITIU20292 Lab6
10 pages
3-Performance Measures
No ratings yet
3-Performance Measures
35 pages
JETIR2403387
No ratings yet
JETIR2403387
5 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
Performance Metrics Classification
No ratings yet
Performance Metrics Classification
39 pages
جلسه 13
No ratings yet
جلسه 13
76 pages
Performance Evaluation
No ratings yet
Performance Evaluation
24 pages
06-FSSR DS610 2024 2025T1 Metrics
No ratings yet
06-FSSR DS610 2024 2025T1 Metrics
24 pages
Most Cited Article in Academia - International Journal of Data Mining & Knowledge Management Process (IJDKP)
No ratings yet
Most Cited Article in Academia - International Journal of Data Mining & Knowledge Management Process (IJDKP)
39 pages
Lec 8
No ratings yet
Lec 8
35 pages
ML Lab 8
No ratings yet
ML Lab 8
9 pages
ML Practical1
No ratings yet
ML Practical1
4 pages
Iris Dataset Clustering and Spam Email Separation
No ratings yet
Iris Dataset Clustering and Spam Email Separation
20 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
8 pages
SupervisedLearning Classification
No ratings yet
SupervisedLearning Classification
20 pages
Addis Ababa University College of Natural and Computational Science School of Information Science Department of Information System
No ratings yet
Addis Ababa University College of Natural and Computational Science School of Information Science Department of Information System
9 pages
F1 Score
No ratings yet
F1 Score
14 pages
Iai&ml Unit-5
No ratings yet
Iai&ml Unit-5
15 pages
CH 4
No ratings yet
CH 4
9 pages
Classification: Basic Concepts, Decision Trees, and Model Evaluation
No ratings yet
Classification: Basic Concepts, Decision Trees, and Model Evaluation
46 pages
School of Information Science: Information System Regular Student Assignment
No ratings yet
School of Information Science: Information System Regular Student Assignment
7 pages
DM Manual-Min
No ratings yet
DM Manual-Min
100 pages
Metrix in ML
No ratings yet
Metrix in ML
7 pages
Nilay Debnath CSE 06607735
No ratings yet
Nilay Debnath CSE 06607735
22 pages
F1 Score Vs ROC AUC Vs Accuracy Vs PR AUC Which Evaluation Metric Should You Choose - Neptune - Ai
No ratings yet
F1 Score Vs ROC AUC Vs Accuracy Vs PR AUC Which Evaluation Metric Should You Choose - Neptune - Ai
1 page
Confusion Matrix
No ratings yet
Confusion Matrix
5 pages
Lecture - (3-4) Evaluation Metrices Classification and Regression
No ratings yet
Lecture - (3-4) Evaluation Metrices Classification and Regression
28 pages
Machine Learningassignment
No ratings yet
Machine Learningassignment
10 pages
Module 7 - Evaluation Measures
No ratings yet
Module 7 - Evaluation Measures
27 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
20 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
11.2 - Classification Evaluation Metrics
No ratings yet
11.2 - Classification Evaluation Metrics
22 pages
BigData Section6
No ratings yet
BigData Section6
10 pages
Unit I - Irs
No ratings yet
Unit I - Irs
85 pages
Name: Le Ho Thao Nguyen Student ID: 20194224
No ratings yet
Name: Le Ho Thao Nguyen Student ID: 20194224
9 pages
Instruction & Option Choice
No ratings yet
Instruction & Option Choice
6 pages
Imbalance Problem
No ratings yet
Imbalance Problem
13 pages
Performance Metrics (Classification) : Enrique J. de La Hoz D
100% (1)
Performance Metrics (Classification) : Enrique J. de La Hoz D
30 pages
Evaluation Measures
No ratings yet
Evaluation Measures
8 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
TYCS Data Science Manual
No ratings yet
TYCS Data Science Manual
44 pages
Random Forest Classification
No ratings yet
Random Forest Classification
8 pages
Confusion Matrix
No ratings yet
Confusion Matrix
8 pages
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
No ratings yet
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
15 pages
A Graph Based Deep Learning Model For Aml
No ratings yet
A Graph Based Deep Learning Model For Aml
135 pages
Weka
No ratings yet
Weka
9 pages
IOT Base Water Quality Measurment Using Meachine Learning
No ratings yet
IOT Base Water Quality Measurment Using Meachine Learning
45 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
Final
No ratings yet
Final
15 pages
3D Brain MRI Classification For Alzheimer Diagnosis Using CNN With Data Augmentation
No ratings yet
3D Brain MRI Classification For Alzheimer Diagnosis Using CNN With Data Augmentation
7 pages
Customer Churn Analysis
No ratings yet
Customer Churn Analysis
10 pages
Evaluation in AI
No ratings yet
Evaluation in AI
20 pages
Detection of Hand Bone Fractures in X-Ray Images Using Hybrid YOLO NAS
No ratings yet
Detection of Hand Bone Fractures in X-Ray Images Using Hybrid YOLO NAS
13 pages
1 s2.0 S2405844023109923 Main
No ratings yet
1 s2.0 S2405844023109923 Main
15 pages
WEKA Assignment I
No ratings yet
WEKA Assignment I
2 pages
DWM
No ratings yet
DWM
9 pages
Fine-Tuning Transformer Models Using Transfer Learning For Multilingual Threatening Text Identification
No ratings yet
Fine-Tuning Transformer Models Using Transfer Learning For Multilingual Threatening Text Identification
13 pages
Imported CSV Data: Exercise 1
No ratings yet
Imported CSV Data: Exercise 1
17 pages
Kumar 2021
No ratings yet
Kumar 2021
19 pages
Weka
No ratings yet
Weka
22 pages
ACMFNet Attention-Based Cross-Modal Fusion Network For Building Extraction of Remote Sensing Images
No ratings yet
ACMFNet Attention-Based Cross-Modal Fusion Network For Building Extraction of Remote Sensing Images
14 pages
Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020
No ratings yet
Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020
31 pages
Cluster Validity Indices
No ratings yet
Cluster Validity Indices
21 pages
Abandoned Object Detection and Classification Using Deep Embedded Vision
No ratings yet
Abandoned Object Detection and Classification Using Deep Embedded Vision
13 pages
Comparative Performance Analysis of K Nearest Neighbour (KNN) Algorithm and Its Different Variants For Disease Prediction
No ratings yet
Comparative Performance Analysis of K Nearest Neighbour (KNN) Algorithm and Its Different Variants For Disease Prediction
11 pages
14-Deep-CNN (InceptionResNetV2) With Transfer Learning
No ratings yet
14-Deep-CNN (InceptionResNetV2) With Transfer Learning
8 pages
Genetic Algorithm and Confusion Matrix For Document Clustering
No ratings yet
Genetic Algorithm and Confusion Matrix For Document Clustering
7 pages
ICDAR2017 Competition On Layout Analysis For Challenging Medieval Manuscripts
No ratings yet
ICDAR2017 Competition On Layout Analysis For Challenging Medieval Manuscripts
10 pages
A Machine Learning Based CIDS Model For Intrusion Detection To Ensure Security Within Cloud Network
No ratings yet
A Machine Learning Based CIDS Model For Intrusion Detection To Ensure Security Within Cloud Network
9 pages
Intelligent Child Safety System Using Machine Learning in IoT Devices
No ratings yet
Intelligent Child Safety System Using Machine Learning in IoT Devices
6 pages
Machine Learnig - Mini Project
No ratings yet
Machine Learnig - Mini Project
5 pages
GOS A Genetic OverSampling Algorithm For Classification of Quranic Verses
No ratings yet
GOS A Genetic OverSampling Algorithm For Classification of Quranic Verses
6 pages
Assessment II
No ratings yet
Assessment II
25 pages
An Improved Collaborative Movie Recommendation System Using Computational Intelligence-2
No ratings yet
An Improved Collaborative Movie Recommendation System Using Computational Intelligence-2
9 pages
Kendriya Vidyalaya Sangathan, Delhi Region Pre - Board Examination - 2024-25
No ratings yet
Kendriya Vidyalaya Sangathan, Delhi Region Pre - Board Examination - 2024-25
8 pages
Supervised Machine Learning Algorithms For Credit Card Fraudulent Transaction Detection: A Comparative Study
No ratings yet
Supervised Machine Learning Algorithms For Credit Card Fraudulent Transaction Detection: A Comparative Study
4 pages
Performance Evaluation of Query Processing Techniques in Information Retrieval
No ratings yet
Performance Evaluation of Query Processing Techniques in Information Retrieval
6 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lab6-Data Mining

Uploaded by

Lab6-Data Mining

Uploaded by

ITDSIU21030 -lab6

Nguyen Duy Phuc

Introduction to Data Mining

5.1. The data mining process

o Purpose: Clearly define the objective or problem you want to solve.

o Example: What factors contribute most to customer churn?

o Example: Data from CRM systems, transaction logs, or external databases.

o Ensure you have sufficient and relevant data for analysis.

- Clean the Data

o Purpose: Handle missing values, remove outliers, and correct inconsistencies.

o Example: Replacing null values, standardizing formats, or filtering noise.

o Cleaning improves the reliability and accuracy of the data.

- Define New Features

o Helps extract useful patterns and enhances model performance

- Deploy the Result

o Example: Using the model to predict customer churn or recommend products.

o Finalizes the process by providing actionable outcomes.

o Objective: Ensure the quality and reliability of the data.

o Combining data from multiple sources into a coherent data set.

o Objective: Unify data to create a consistent dataset for analysis.

o Selecting relevant data for the analysis task.

o Objective: Focus on data that is pertinent to the problem or question.

o Objective: Prepare the data for efficient and accurate mining.

o Applying algorithms to identify patterns or extract knowledge.

o Objective: Discover hidden insights or useful patterns in the data.

o Identifying the most interesting patterns that represent knowledge.

o Objective: Focus on patterns that are meaningful, valid, and actionable.

o Objective: Make the results comprehensible and usable for decision-making.

5.2. Pitfalls and pratfalls

Write down the results in the following table:

Dataset OneR’s classifier model and J48’s classifier model and

outlook: outlook = sunny

=== Stratified cross-validation === Size of the tree : 8

Correctly Classified Instances 6 Time taken to build model: 0 seconds

Recall F-Measure MCC ROC

weather‐nominal.arff === Classifier model (full training J48 pruned tree

Time taken to build model: 0 === Stratified cross-validation ===

=== Stratified cross-validation === Correctly Classified Instances 7

Mean absolute error 0.5727

Remark: how do OneR and J48 deal with missing values?

5.3. Data mining and ethics

5.4. Association-rule learners

Dataset Apriori based association rules FP-Growth based association rules

4. physician-fee-freeze=n education- lift:(1.68) lev:(0.14) conv:(8.6)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.