Lab6-Data Mining
Lab6-Data Mining
In the fifth class, we are going to look at some more global issues about the data mining process. (See
the lecture of class 5 by Ian H. Witten, [1]1). We are going through four lessons: the data mining process,
Pitfalls and pratfalls, and data mining and ethics.
According to [1], the data mining process includes steps: ask a question, gather data, clean the data,
define new features, and deploy the result. Write down the brief for these steps:
- Ask a question
o This step ensures the data mining process is guided by a clear goal.
- Gather Data
o Purpose: Collect relevant data from various sources that can help address the question.
o Purpose: Create new variables or transformations that better represent the problem.
o Example: Converting dates to "days since last purchase" or normalizing numerical values.
1
http://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/
1
ITDSIU21030 -lab6
Nguyen Duy Phuc
o Purpose: Apply the insights gained to solve the problem or inform decisions.
Alternatively, according to (Han and Kamber, 2011), the data mining process is treated as a knowledge
discovery (KDD) process including an iterative sequence of 7-steps. Please list them all in the below:
- Data Cleaning
o Removing noise, handling missing values, and resolving inconsistencies in the data.
- Data Integration
- Data Selection
- Data Transformation
o Converting data into suitable formats or generating new features (e.g., normalization,
aggregation).
- Data Mining
- Pattern Evaluation
- Knowledge Presentation
2
ITDSIU21030 -lab6
Nguyen Duy Phuc
o Visualizing and presenting the mined knowledge using reports, charts, or other tools.
Do experiments to investigate how OneR and J48 deal with missing values.
3
ITDSIU21030 -lab6
Nguyen Duy Phuc
a b <-- classified as
5 4 | a = yes
3 2 | b = no
4
ITDSIU21030 -lab6
Nguyen Duy Phuc
a b <-- classified as
9 0 | a = yes
1 4 | b = no
- OneR is simpler and may discard or approximate missing values, which could lead to reduced accuracy
when missing data is common.
5
ITDSIU21030 -lab6
Nguyen Duy Phuc
- J48 is more sophisticated, leveraging probabilistic methods to account for missing values, making it
more robust in handling incomplete datasets
Do experiments to investigate how Apriori and FP-Growth generate association rules for datasets
vote.arff
6
ITDSIU21030 -lab6
Nguyen Duy Phuc