Data Mining Lab Manual
Data Mining Lab Manual
Page No.
S.No NAME OF THE EXPERIMENT
1. Study of Creation of a Data Warehouse.
2. Creation of DataSet
3. Apriori Algorithm.
4. FP-Growth Algorithm.
5. K-means clustering.
8. Decision Tree.
Additional Experiments
IT6711.1 Apply data mining techniques and methods to large data sets.
IT6711.2 Use data mining tools.
IT6711.3 Compare and contrast the various classifiers
IT6711 PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3 PSO4
IT6711.1 1 1 2 2 1 - - 1 - 1 - 2 1 2 2 1
IT6711.2 1 1 2 2 3 1 1 1 2 1 - 2 1 2 2 1
IT6711.3 1 1 2 2 3 1 1 1 2 1 - 2 1 1 2 1
Knowledge Course
S. No Experiment
level Outcomes
CYCLE-I
1 L2& L5 Study of Creation of a Data Warehouse. IT6711.1
2 L2& L5 Creation of DataSet IT6711.1
3 L2 & L3 Apriori Algorithm. IT6711.1
4 L2 & L3 FP-Growth Algorithm. IT6711.1
CYCLE-II
5 L5,L3,L4 K-means clustering. IT6711.2
6 L5,L3,L4 One Hierarchical clustering algorithm. IT6711.2
7 L5,L3,L4 Bayesian Classification using WEKA IT6711.2
8 L5,L3,L4 Decision Tree. IT6711.2
9 L5,L3,L4 Support Vector Machines. IT6711.2
IT6711- Data Mining Laboratory Department of IT 2017-2018
CYCLE-III
10. L5,L3,L4 Case study on Banking Application IT6711.3
11. L5,L3,L4 Case study on Text Mining IT6711.3
1. By acquiring a strong foundation in basic and advanced engineering concepts, students are expertised in
formulating, analyzing and solving engineering problems.
2. By enhancing the logical reasoning skills, students are made capable to design optimized Technological
solutions in industry and academics as well.
3.By moulding students to be an active team player, possessing strong interpersonal skills and leadership
quality with entrepreneurial ability
4. By encouraging continuous self-learning ability, students are trained to meet the current demands of
industry, carrying out research in cutting edge technologies.
5. By infusing professional ethical approach in solving critical engineering problems, students are encouraged
to derive solutions considering economical, environmental, ethical, and societal issues.
IT6711- Data Mining Laboratory Department of IT 2017-2018
1.Be able to use and apply mathematical foundations, algorithmic principles and computer science theory in
the modeling and design of computer-based systems for providing competent technological solutions.
2. Be able to identify and analyze user needs and take them into account for selecting, creating, evaluating,
thereby effectively integrating IT based solutions using intelligent information tools for the society.
3.Be able to apply design, development and management ideologies in the creation of an effective information
system with varying complexity.
4.Understand best practices, ethical standards and replicate the same in the design and development of IT
solutions.
IT6711- Data Mining Laboratory Department of IT 2017-2018
PROCEDURE
Designing the data warehouse using star, snowflake & Galaxy schema.
Design a data cube which contain one fact table and design item, time, supplier, location,
customer dimension table, also identify measures for sales. Insert minimum 4 items like bikes,
small cars, mid segment cars, car consumables items etc. Also enter minimum 10‐12 records
Region/location, enter minimum 2 cities from each state also enter minimum 2 states. Keep track
of sales quarter wise.
Perform and implement above fact & dimension tables in oracle10g which are same as relational
table of database, perform analyze above with the help of SQL tool.
Use concepts of OLAP operation like slice, dice, roll‐up, drill‐down etc
Dimensional modeling (DM) is the name of a logical design technique often used for data
warehouses. Dimensional modeling always uses the concepts of facts, measures, and dimensions.
Facts are typically (but not always) numeric values that can be aggregated, Dimensions are
groups of hierarchies and descriptors that define the facts. For example, sales amount is a fact;
timestamp, product, register#, store#, etc. are elements of dimensions.
Dimensional models are built by business process area, e.g. store sales, inventory, claims, etc.
Fact table
The fact table is not a typical relational database table as it is de‐normalized on purpose ‐
to enhance query response times. The fact table typically contains records that are ready to
explore, usually with ad hoc queries. Records in the fact table are often referred to as events, due
to the time‐variant nature of a data warehouse environment.The primary key for the fact table is
a composite of all the columns except numeric values / scores (like QUANTITY, TURNOVER,
exact invoice date and time).
Typical fact tables in a global enterprise data warehouse are (usually there may be
additional company or business specific fact tables):
6
IT6711- Data Mining Laboratory Department of IT 2017-2018
Dimension table
Nearly all of the information in a typical fact table is also present in one or more
dimension tables. The main purpose of maintaining Dimension Tables is to allow browsing the
categories quickly and easily. The primary keys of each of the dimension tables are linked
together to form the composite primary key of the fact table. In a star schema design, there is
only one de‐normalized table for a given dimension.
Typical dimension tables in a data warehouse are:
Time dimension table Customers dimension table Products dimension table
Key account managers (KAM) dimension table
Sales office dimension table
7
IT6711- Data Mining Laboratory Department of IT 2017-2018
8
IT6711- Data Mining Laboratory Department of IT 2017-2018
The problem is that the more normalized the dimension table is, the more complicated SQL joins
must be issued to query them. This is because in order for a query to be answered, many tables
need to be joined and aggregates generated.
Fact constellation/Galaxy schema Architecture
For each star schema or snowflake schema it is possible to construct a fact constellation schema.
This schema is more complex than star or snowflake architecture, which is because it contains
multiple fact tables. This allows dimension tables to be shared amongst many fact tables.
In a fact constellation schema, different fact tables are explicitly assigned to the dimensions,
which are for given facts relevant. This may be useful in cases when some facts are associated
with a given dimension level and other facts with a deeper dimension level.
Use of that model should be reasonable when for example, there is a sales fact table (with details
down to the exact date and invoice header id) and a fact table with sales forecast which is
calculated based on month, client id and product id.
9
IT6711- Data Mining Laboratory Department of IT 2017-2018
RESULT
10
IT6711- Data Mining Laboratory Department of IT 2017-2018
AIM
To create a simple data set that can be opened in WEKA.
PROCEDURE
1.Data can be imported from various file format csv,ARFF
2.To create CSV ,create the table for student mark details.
3.Save as csv in EXCEL file format
4.open in weka
5.save as ARFF file by selecting save button in preprocessing Menu
ARFF contains two sections HEADER AND DATA
@relation marks
@data
100,'ABHISHEKRAM M',97,97.0,57.0,48,48,pass
102,'AISWARYA PL',100,100.0,27.0,90,90,fail
104,'AKASH G',100,100.0,88.0,100,100,pass
106,'AKSHAY KUMAR S',62,A,68.0,100,100,fail
108,'AKSHAY RAMANUJAM RANGANATHAN',57,57.0,52.0,89,89,pass
110,'ANISHA JULIET E',82,82.0,9.0,34,34,fail
112,'ANUGRAHA S',100,100.0,78.0,100,100,pass
114,'ARAVIND BHARATHY S',100,100.0,87.0,100,100,pass
116,'ARAVIND KUMARAN R',100,100.0,83.0,100,100,pass
118,'ARCHANA V K',42,42.0,59.0,100,100,fail
120,'AROKIA JOYCE A',60,60.0,49.0,78,78,pass
122,'ASHWIN SHANMUGAM I',25,30.0,A,21,30,fail
?,?,?,?,?,?,?,?
11
IT6711- Data Mining Laboratory Department of IT 2017-2018
Result:
12
IT6711- Data Mining Laboratory Department of IT 2017-2018
13
IT6711- Data Mining Laboratory Department of IT 2017-2018
?,?,T,?,?,T,?
?,?,?,?,?,T,?
?,?,?,T,?,T,?
T,T,T,?,?,?,T
?,?,?,?,T,?,T
T,T,?,?,?,?,?
Load the Fruit dataset
14
IT6711- Data Mining Laboratory Department of IT 2017-2018
15
IT6711- Data Mining Laboratory Department of IT 2017-2018
16
IT6711- Data Mining Laboratory Department of IT 2017-2018
OUTPUT
=== Run information ===
17
IT6711- Data Mining Laboratory Department of IT 2017-2018
‘apple’
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.3 (3 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 14
RESULT
18
IT6711- Data Mining Laboratory Department of IT 2017-2018
19
IT6711- Data Mining Laboratory Department of IT 2017-2018
T,?,?,T,?,?,?
?,?,?,T,T,?,?
T,T,T,?,?,?,?
?,?,T,?,?,T,?
?,?,?,?,?,T,?
?,?,?,T,?,T,?
T,T,T,?,?,?,T
?,?,?,?,T,?,T
T,T,?,?,?,?,?
Load the Fruit dataset
20
IT6711- Data Mining Laboratory Department of IT 2017-2018
21
IT6711- Data Mining Laboratory Department of IT 2017-2018
OUTPUT:
=== Run information ===
RESULT:
22
IT6711- Data Mining Laboratory Department of IT 2017-2018
K-MEANS CLUSTERING
Ex.No.5
AIM
This experiment illustrates the use of simple k-mean clustering with Weka explorer. The
sample data set used for this example is based on the iris.arff data set. This document assumes
that appropriate pre-processing has been performed.
K-MEANS CLUSTERING:
K-Means is simplest unsupervised learning algorithms that solve the well known
clustering problem. The procedure follows a simple and easy way to classify a given data
set through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to
define k centers, one for each cluster. These centers should be placed in a cunning way because
of different location causes different RESULt. So, the better choice is to place them as much as
possible far away from each other. The next step is to take each point belonging to a given data
set and associate it to the nearest center. When no point is pending, the first step is completed
and an early group age is done.
PROCEDURE:
1. Run the Weka explorer and load the data file iris.arff in preprocessing interface.
2. Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the choose
button. This step RESULts in a dropdown list of available clustering algorithms.
4. Next click in text button to the right of the choose button to get popup window shown in the
screenshots. In this window we enter six on the number of clusters and we leave the value of the
seed on as it is. The seed value is used in generating a random number which is used for making
the internal assignments of instances of clusters.
5. Once of the option have been specified. We run the clustering algorithm there we must make
sure that they are in the ‘cluster mode’ panel. The use of training set option is selected and then
23
IT6711- Data Mining Laboratory Department of IT 2017-2018
we click ‘start’ button. This process and RESULting window are shown in the following
screenshots.
6. The RESULt window shows the centroid of each cluster as well as statistics on the number
and the percent of instances assigned to different clusters. Here clusters centroid are means
vectors for each clusters. This clusters can be used to characterized the cluster.For eg, the
centroid of cluster1 shows the class iris.versicolor mean value of the sepal length is 5.4706, sepal
width 2.4765, petal width 1.1294, petal length 3.7941.
7. Another way of understanding characterstics of each cluster through visualization ,we can do
this, try right clicking the RESULt set on the RESULt. List panel and selecting the visualize
cluster assignments.
Dataset iris.arff
@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
Load the iris.arff dataset
24
IT6711- Data Mining Laboratory Department of IT 2017-2018
25
IT6711- Data Mining Laboratory Department of IT 2017-2018
26
IT6711- Data Mining Laboratory Department of IT 2017-2018
OUTPUT:
kMeans
Number of iterations: 7
Within cluster sum of squared errors: 62.1436882815797
Missing values globally replaced with mean/mode
Cluster centroids: Cluster#
Attribute Full Data 0 1
(150) (100) (50)
==================================================================
sepallength 5.8433 6.262 5.006
sepalwidth 3.054 2.872 3.418
petallength 3.7587 4.906 1.464
petalwidth 1.1987 1.676 0.244
class Iris-setosa Iris-versicolor Iris-setosa
Clustered Instances
0 100 ( 67%)
1 50 ( 33%)
27
IT6711- Data Mining Laboratory Department of IT 2017-2018
RESULT:
HIERARCHICAL CLUSTERING
Ex.No.6
AIM
This experiment illustrates the use of one hierarchical clustering with Weka explorer. The
sample data set used for this example is based on the weather.arff data set. This document
assumes that appropriate pre-processing has been performed.
HIERARCHICAL CLUSTERING
PROCEDURE:
1.Open the data file in Weka Explorer. It is presumed that the required data fields have been
discretized.
2. Clicking on the cluster tab will bring up the interface for cluster algorithm.
4. Inorder to change the parameters for the run ( Euclidean, Manhattan, Minkowski distance ) we
click on the text box immediately to the right of the choose button.
5. Visualization of the graph
Dataset weather.arff
@relation weather
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
29
IT6711- Data Mining Laboratory Department of IT 2017-2018
Number of Instances: 14
Number of Attributes: 5
Number of Class: 2
The following screenshot shows the clustering rules that were generated when hierarchical
clustering algorithm is applied on the given dataset.
30
IT6711- Data Mining Laboratory Department of IT 2017-2018
31
IT6711- Data Mining Laboratory Department of IT 2017-2018
32
IT6711- Data Mining Laboratory Department of IT 2017-2018
OUTPUT:
Model and evaluation on training set
Cluster 0
((1.0:1,1.0:1):0,1.0:1)
Cluster 1
(((((0.0:1,0.0:1):0.41421,((((0.0:1,0.0:1):0,
(0.0:1,0.0:1):0):0.41421,1.0:1.41421):0,0.0:1.41421):0):0,0.0:1.41421):0,0.0:1.41421):0,1.0:1.41
421)
Clustered Instances
0 3 ( 21%)
1 11 ( 79%)
RESULT:
33
IT6711- Data Mining Laboratory Department of IT 2017-2018
BAYESIAN CLASSIFICATION
Ex.No.7
AIM
This experiment illustrates the use of Bayesian classifier with Weka explorer. The sample
data set used for this example is based on the soyabean.arff data set. This document assumes that
appropriate pre-processing has been performed.
BAYESIAN CLASSIFICATION
Bayesian classification is based on Bayes theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the probability
that a given tuple belongs to a particular class.
PROCEDURE:
1. Open the data file in Weka Explorer. It is presumed that the required data fields have been
discretized.
2. Next we select the “classify” tab and click choose button to select the “NavieBayes” in the
classifier.
3. Now we specify the various parameters. These can be specified by clicking in the text box to
the right of the chose button. In this example, we accept the default values his default version
does perform some pruning but does not perform error pruning.
4. We select the 10-fold cross validation as our evaluation approach. Since we don’t have
separate evaluation data set, this is necessary to get a reasonable idea of accuracy of generated
model.
5. We now click start to generate the model .the ASCII version of the tree as well as evaluation
statistic will appear in the right panel when the model construction is complete.
6. Note that the classification accuracy of model is about 69%.this indicates that we may find
more work. (Either in preprocessing or in selecting current parameters for the classification)
34
IT6711- Data Mining Laboratory Department of IT 2017-2018
7. Now weka also lets us a view a graphical version of the classification tree.
Dataset soyabean.arff
@RELATION soybean
@DATA
october, normal, gt-norm, norm, yes, same-lst-yr, low-areas, pot-severe, none, 90-100, abnorm,
abnorm, absent, dna, dna, absent, absent, absent, abnorm, no, above-sec-nde, brown, present,
firm-and-dry, absent, none, absent, norm, dna, norm, absent, absent, norm, absent, norm,
diaporthe-stem-canker
august, normal, gt-norm, norm, yes, same-lst-two-yrs, scattered, severe, fungicide, 80-89,
abnorm, abnorm, absent, dna, dna, absent, absent, absent, abnorm, yes, above-sec-nde, brown,
present, firm-and-dry, absent, none, absent, norm, dna, norm, absent, absent, norm, absent, norm,
diaporthe-stem-canker
july, normal, gt-norm, norm, yes, same-lst-yr, scattered, severe, fungicide, lt-80, abnorm,
abnorm, absent, dna, dna, absent, absent, absent, abnorm, yes, above-sec-nde, dna, present, firm-
and-dry, absent, none, absent, norm, dna, norm, absent, absent, norm, absent, norm, diaporthe-
stem-canker
36
IT6711- Data Mining Laboratory Department of IT 2017-2018
37
IT6711- Data Mining Laboratory Department of IT 2017-2018
38
IT6711- Data Mining Laboratory Department of IT 2017-2018
OUTPUT:
Correctly Classified Instances 635 92.9722 %
Incorrectly Classified Instances 48 7.0278 %
Kappa statistic 0.923
Mean absolute error 0.0096
Root mean squared error 0.0817
Relative absolute error 9.9344 %
Root relative squared error 37.2742 %
Coverage of cases (0.95 level) 95.1684 %
Mean rel. region size (0.95 level) 6.5501 %
Total Number of Instances 683
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 diaporthe-stem-canker
1 0 1 1 1 1 charcoal-rot
1 0 1 1 1 1 rhizoctonia-root-rot
1 0.003 0.978 1 0.989 1 phytophthora-rot
1 0 1 1 1 1 brown-stem-rot
1 0 1 1 1 1 powdery-mildew
1 0 1 1 1 1 downy-mildew
0.837 0.008 0.939 0.837 0.885 0.989 brown-spot
1 0.003 0.909 1 0.952 1 bacterial-blight
0.9 0 1 0.9 0.947 1 bacterial-pustule
1 0 1 1 1 1 purple-seed-stain
1 0 1 1 1 1 anthracnose
0.85 0.008 0.773 0.85 0.81 0.994 phyllosticta-leaf-spot
1 0.049 0.758 1 0.863 0.991 alternarialeaf-spot
0.714 0.007 0.942 0.714 0.813 0.98 frog-eye-leaf-spot
1 0.001 0.938 1 0.968 1 diaporthe-pod-&-stem-blight
1 0 1 1 1 1 cyst-nematode
39
IT6711- Data Mining Laboratory Department of IT 2017-2018
a b c d e f g h i j k l m n o p q r s <-- classified as
20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | a = diaporthe-stem-canker
0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | b = charcoal-rot
0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | c = rhizoctonia-root-rot
0 0 0 88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | d = phytophthora-rot
0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | e = brown-stem-rot
0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 | f = powdery-mildew
0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 | g = downy-mildew
0 0 0 0 0 0 0 77 0 0 0 0 5 6 4 0 0 0 0 | h = brown-spot
0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 | i = bacterial-blight
0 0 0 0 0 0 0 0 2 18 0 0 0 0 0 0 0 0 0 | j = bacterial-pustule
0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 | k = purple-seed-stain
0 0 0 0 0 0 0 0 0 0 0 44 0 0 0 0 0 0 0 | l = anthracnose
0 0 0 0 0 0 0 2 0 0 0 0 17 1 0 0 0 0 0 | m = phyllosticta-leaf-spot
0 0 0 0 0 0 0 0 0 0 0 0 0 91 0 0 0 0 0 | n = alternarialeaf-spot
0 0 0 0 0 0 0 3 0 0 0 0 0 22 65 1 0 0 0 | o = frog-eye-leaf-spot
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 | p = diaporthe-pod-&-stem-blight
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 | q = cyst-nematode
0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 | r = 2-4-d-injury
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 | s = herbicide-injury
40
IT6711- Data Mining Laboratory Department of IT 2017-2018
RESULT
DECISION TREE
Ex.No.8
AIM
This experiment illustrates the use of j-48 classifier in weka. The sample data set used in
this experiment is weather dataset available at arff format. This document assumes that
appropriate data pre processing has been performed.
PROCEDURE:
2. Next we select the “classify” tab and click “choose” button to select the “j48”classifier.
3. Now we specify the various parameters. These can be specified by clicking in the text box to
the right of the chose button. In this example, we accept the default values the default version
does perform some pruning but does not perform error pruning.
4.Under the “text “options in the main panel. We select the 10-fold cross validation as our
evaluation approach. Since we don’t have separate evaluation data set, this is necessary to get a
reasonable idea of accuracy of generated model.
5. We now click ”start” to generate the model .the ASCII version of the tree as well as evaluation
statistic will appear in the right panel when the model construction is complete.
6. Note that the classification accuracy of model is about 69%.this indicates that we may find
more work. (Either in preprocessing or in selecting current parameters for the classification)
7. Now weka also lets us a view a graphical version of the classification tree. This can be done
by right clicking the last RESULt set and selecting “visualize tree” from the pop-up menu.
41
IT6711- Data Mining Laboratory Department of IT 2017-2018
9. In the main panel under “text “options click the “supplied test set” radio button and then click
the “set” button. This wills pop-up a window which will allow you to open the file containing
test instances.
Dataset weather.arff
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
42
IT6711- Data Mining Laboratory Department of IT 2017-2018
Number of Instances: 14
Number of Attributes: 5
Number of Class: 2
The following screenshot shows the classification rules that were generated when j48 algorithm
is applied on the given dataset.
43
IT6711- Data Mining Laboratory Department of IT 2017-2018
44
IT6711- Data Mining Laboratory Department of IT 2017-2018
45
IT6711- Data Mining Laboratory Department of IT 2017-2018
OUTPUT:
Correctly Classified Instances 9 64.2857 %
Incorrectly Classified Instances 5 35.7143 %
Kappa statistic 0.186
Mean absolute error 0.2857
Root mean squared error 0.4818
Relative absolute error 60 %
Root relative squared error 97.6586 %
Coverage of cases (0.95 level) 92.8571 %
Mean rel. region size (0.95 level) 64.2857 %
Total Number of Instances 14
46
IT6711- Data Mining Laboratory Department of IT 2017-2018
a b <-- classified as
7 2 | a = yes
3 2 | b = no
RESULT:
AIM
This experiment illustrates the use of Support vector classifier in weka. The sample data
set used in this experiment is vote dataset available in arff format. This document assumes that
appropriate data pre processing has been performed.
PROCEDURE:
47
IT6711- Data Mining Laboratory Department of IT 2017-2018
2. Next we select the classify tab and click choosefunction button to select the Support vector
machine .
3. Now we specify the various parameters. These can be specified by clicking in the text box to
the right of the chose button.
4. Under the “text “options in the main panel. We select the 10-fold cross validation as our
evaluation approach. Since we don’t have separate evaluation data set, this is necessary to get a
reasonable idea of accuracy of generated model.
5. We now click ”start” to generate the model .the ASCII version of the tree as well as evaluation
statistic will appear in the right panel when the model construction is complete.
6. Note that the classification accuracy of model is about 69%.this indicates that we may find
more work. (Either in preprocessing or in selecting current parameters for the classification)
7. The run information of the support vector classifier will be displayed with the correctly and
incorrectly classified instances.
Dataset vote.arff
@relation vote
@attribute 'handicapped-infants' { 'n', 'y'}
@attribute 'water-project-cost-sharing' { 'n', 'y'}
@attribute 'adoption-of-the-budget-resolution' { 'n', 'y'}
@attribute 'physician-fee-freeze' { 'n', 'y'}
@attribute 'el-salvador-aid' { 'n', 'y'}
@attribute 'religious-groups-in-schools' { 'n', 'y'}
@attribute 'anti-satellite-test-ban' { 'n', 'y'}
@attribute 'aid-to-nicaraguan-contras' { 'n', 'y'}
@attribute 'mx-missile' { 'n', 'y'}
@attribute 'immigration' { 'n', 'y'}
@attribute 'synfuels-corporation-cutback' { 'n', 'y'}
@attribute 'education-spending' { 'n', 'y'}
@attribute 'superfund-right-to-sue' { 'n', 'y'}
@attribute 'crime' { 'n', 'y'}
@attribute 'duty-free-exports' { 'n', 'y'}
48
IT6711- Data Mining Laboratory Department of IT 2017-2018
49
IT6711- Data Mining Laboratory Department of IT 2017-2018
50
IT6711- Data Mining Laboratory Department of IT 2017-2018
51
IT6711- Data Mining Laboratory Department of IT 2017-2018
OUTPUT:
Correctly Classified Instances 418 96.092 %
Incorrectly Classified Instances 17 3.908 %
Kappa statistic 0.9178
Mean absolute error 0.0391
Root mean squared error 0.1977
Relative absolute error 8.2405 %
Root relative squared error 40.6018 %
Coverage of cases (0.95 level) 96.092 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 435
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.963 0.042 0.973 0.963 0.968 0.96 democrat
0.958 0.037 0.942 0.958 0.95 0.96 republican
Weighted Avg. 0.961 0.04 0.961 0.961 0.961 0.96
=== Confusion Matrix ===
a b <-- classified as
257 10 | a = democrat
7 161 | b = republican
52
IT6711- Data Mining Laboratory Department of IT 2017-2018
RESULT:
BANK APPLICATION
Ex.No.10
AIM
To analyze a banking application using naive bayes classification method
PROCEDURE
Loading the Data
In addition to the native ARFF data file format, WEKA has the capability to read in ".csv"
format files. This is fortunate since many databases or spreadsheet applications can save or
export data into flat files in this format. As can be seen in the sample data file, the first row
contains the attribute names (separated by commas) followed by each data row with attribute
values listed in the same order (also separated by commas). In fact, once loaded into WEKA, the
data set can be saved into ARFF format.
Load the data set into WEKA, perform a series of operations using WEKA's
preprocessing filters. Initially (in the Preprocess tab) click "open" and navigate to the directory
containing the data file (.csv or .arff).
bank_data.csv:
The data contains the following fields :
Id-- a unique identification number
Age-- age of customer in years (numeric)
sex --MALE / FEMALE
region --inner_city/rural/suburban/town
income-- income of customer (numeric)
married-- is the customer married (YES/NO)
children-- number of children (numeric)
car-- does the customer own a car (YES/NO)
save_acct --does the customer have a saving account (YES/NO)
current_acct --does the customer have a current account (YES/NO)
mortgage -- does the customer have a mortgage (YES/NO)
eligibility – whether eligible for availing loan (YES/NO)
53
IT6711- Data Mining Laboratory Department of IT 2017-2018
54
IT6711- Data Mining Laboratory Department of IT 2017-2018
Evaluator: weka.attributeSelection.CfsSubsetEval
Search:weka.attributeSelection.LinearForwardSelection -D 0 -N 5 -I -K 50 -T 0
Relation: bank-data-weka.filters.unsupervised.attribute.Remove-R1
Instances: 600
Attributes: 11
age
sex
region
income
married
children
car
save_act
current_act
mortgage
Eligibility
Evaluation mode:evaluate on all training data
=== Attribute Selection on all input data ===
Search Method:
Linear Forward Selection.
Start set: no attributes
Forward selection method: forward selection
Stale search after 5 node expansions
Linear Forward Selection Type: fixed-set
Number of top-ranked attributes that are used: 11
Total number of subsets evaluated: 63
Merit of best subset found: 0.099
55
IT6711- Data Mining Laboratory Department of IT 2017-2018
Scheme:weka.classifiers.bayes.NaiveBayes
Relation: bank-data-weka.filters.unsupervised.attribute.Remove-R1
Instances: 600
Attributes: 11
age
sex
region
income
married
children
car
save_act
current_act
mortgage
Eligibility
Test mode:evaluate on training data
Class
Attribute YES NO
56
IT6711- Data Mining Laboratory Department of IT 2017-2018
(0.46) (0.54)
=====================================
age
mean 45.1277 40.0982
std. dev. 14.3018 14.1018
weight sum 274 326
precision 1 1
sex
FEMALE 131.0 171.0
MALE 145.0 157.0
[total] 276.0 328.0
region
INNER_CITY 124.0 147.0
TOWN 72.0 103.0
RURAL 47.0 51.0
SUBURBAN 35.0 29.0
[total] 278.0 330.0
income
mean 30644.8069 24902.2958
std. dev. 13585.1095 11640.5073
weight sum 274 326
precision 97.1838 97.1838
married
NO 121.0 85.0
YES 155.0 243.0
[total] 276.0 328.0
children
mean 0.9453 1.0675
57
IT6711- Data Mining Laboratory Department of IT 2017-2018
car
NO 137.0 169.0
YES 139.0 159.0
[total] 276.0 328.0
save_act
NO 96.0 92.0
YES 180.0 236.0
[total] 276.0 328.0
current_act
NO 64.0 83.0
YES 212.0 245.0
[total] 276.0 328.0
mortgage
NO 183.0 210.0
YES 93.0 118.0
[total] 276.0 328.0
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 393 65.5 %
Incorrectly Classified Instances 207 34.5 %
Kappa statistic 0.2956
Mean absolute error 0.4154
Root mean squared error 0.4613
Relative absolute error 83.7093 %
Root relative squared error 92.6161 %
Total Number of Instances 600
58
IT6711- Data Mining Laboratory Department of IT 2017-2018
TEXT CLASSIFICATION
Ex.No.11
AIM
To classify the text document based on movie reviews
Algorithm
1.Create Text documents and store it in a folder
2.Open the text document in WEKA using Textdirectoryloader option
3.Assign the attributes values for coverting the text documents into ARFF format
4.Infomation gain Feature Selection NB classifer are selected.
5.Accuracy is measured after executing classifer with informationgain for the textdocument
converted ARFF data.
Procedure
59
IT6711- Data Mining Laboratory Department of IT 2017-2018
Assign IDFTransfarm=True,
TFTTransform =True
.lowercasetokes=true ,
Stemmer=IteratedLovinsStemmer
Use stoplist=true
Tokenizer=Unigram tokenizer
Apply-ok
60
IT6711- Data Mining Laboratory Department of IT 2017-2018
61
IT6711- Data Mining Laboratory Department of IT 2017-2018
62
IT6711- Data Mining Laboratory Department of IT 2017-2018
63
IT6711- Data Mining Laboratory Department of IT 2017-2018
Preprocess-filter-supervised-attribute-attribute selection-ok-ok-apply
Evaluator: weka.attributeSelection.InfoGainAttributeEval
Search:weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: C__Users_INTEL_Desktop_txt_sentoken-
weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-T-I-N0-L-S-
stemmerweka.core.stemmers.IteratedLovinsStemmer-M1-
tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"
Instances: 2000
Attributes: 1172
[list of attributes omitted]
Evaluation mode:evaluate on all training data
Search Method:
Attribute ranking.
64
IT6711- Data Mining Laboratory Department of IT 2017-2018
Instances: 2000
Attributes: 10
bad
wast
worst
stupid
bor
perfect
ridicl
portr
outstand
@@class@@
Test mode:10-fold cross-validation
Class
Attribute negpos
(0.5) (0.5)
===============================
bad
mean 0.3387 0.1693
65
IT6711- Data Mining Laboratory Department of IT 2017-2018
wast
mean 0.2992 0.0581
std. dev. 0.5875 0.2846
weight sum 1000 1000
precision 1.4525 1.4525
worst
mean 0.2868 0.0636
std. dev. 0.5846 0.2999
weight sum 1000 1000
precision 1.4784 1.4784
stupid
mean 0.2788 0.0594
std. dev. 0.5892 0.295
weight sum 1000 1000
precision 1.5237 1.5237
bor
mean 0.2988 0.1089
std. dev. 0.5375 0.3549
weight sum 1000 1000
precision 1.2659 1.2659
perfect
mean 0.1379 0.3071
std. dev. 0.3681 0.4999
weight sum 1000 1000
precision 1.1208 1.1208
ridicl
mean 0.2296 0.0629
std. dev. 0.5811 0.321
weight sum 1000 1000
precision 1.7006 1.7006
portr
mean 0.0896 0.2568
std. dev. 0.3546 0.5635
weight sum 1000 1000
precision 1.4932 1.4932
outstand
mean 0.0139 0.1504
66
IT6711- Data Mining Laboratory Department of IT 2017-2018
RESULT
Ex.No.12
DISCRETIZATION
Aim :
To perform the task of data discretization for student dataset.
Procedure
Association rule mining can only be performed on categorical data. This requires performing
discretization on numeric or continuous attributes.
1. Let us divide the values of age attribute into three bins(intervals).
67
IT6711- Data Mining Laboratory Department of IT 2017-2018
68
IT6711- Data Mining Laboratory Department of IT 2017-2018
Output:
Result
69
IT6711- Data Mining Laboratory Department of IT 2017-2018
Algorithm:
1.Load the dataset VOTE.ARFF in WEKA
2.Set the classlabel from the dataset
3.Set the crossvalidation property to 10
4.Evaluate the dataset with Naïve Bayes classifier
5.Store the classification and predications value
6.Display the accuracy for the dataset with NB classifier
return evaluation;
70
IT6711- Data Mining Laboratory Department of IT 2017-2018
}
public static double calculateAccuracy(FastVector predictions) {
double correct = 0;
for (int i = 0; i<predictions.size(); i++) {
NominalPrediction np = (NominalPrediction) predictions.elementAt(i);
if (np.predicted() == np.actual()) {
correct++;
}
}
return 100 * correct / predictions.size();
}
public static Instances[][] crossValidationSplit(Instances data, int numberOfFolds)
{
Instances[][] split = new Instances[2][numberOfFolds];
71
IT6711- Data Mining Laboratory Department of IT 2017-2018
Classifier[] models = {
new NaiveBayes()
//new J48()//, // a decision tree
//new PART(),
//new DecisionTable(),//decision table majority classifier
//new DecisionStump(),//one-level decision tree};
// Run for each model
for (Classifier model : models) {
// Collect every group of predictions for current model in a FastVector
FastVector predictions = new FastVector();
// For each training-testing split pair, train and test the classifier
for (int i = 0; i<trainingSplits.length; i++) {
Evaluation validation = classify(model, trainingSplits[i], testingSplits[i]);
predictions.appendElements(validation.predictions());
// Uncomment to see the summary for each training-testing pair.
//System.out.println(models[j].toString());
}
// Calculate overall accuracy of current classifier on all splits
double accuracy = calculateAccuracy(predictions);
// Print current classifier's name and accuracy in a complicated,
// but nice-looking way.
System.out.println("Accuracy of " + model.getClass().getSimpleName() + ": " +
String.format("%.2f%%", accuracy) + "\n---------------------------------");
}
}
// private static void wekaattrsel(ASEvaluationeval, ASSearch search) {
// throw new UnsupportedOperationException("Not supported yet."); //To change body of
generated methods, choose Tools | Templates.
//}
}
KNOWLEDGE FLOW
72
IT6711- Data Mining Laboratory Department of IT 2017-2018
Output:
=== Evaluation result ===
Scheme: NaiveBayes
Relation: vote
73
IT6711- Data Mining Laboratory Department of IT 2017-2018
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.891 0.083 0.944 0.891 0.917 0.797 0.973 0.984 democrat
0.917 0.109 0.842 0.917 0.877 0.797 0.973 0.957 republican
Weighted Avg. 0.901 0.093 0.905 0.901 0.902 0.797 0.973 0.973
a b <-- classified as
238 29 | a = democrat
14 154 | b = republican
RESULT
74
IT6711- Data Mining Laboratory Department of IT 2017-2018
76
IT6711- Data Mining Laboratory Department of IT 2017-2018
Clustering algorithm is used to group sets of data with similar characteristics also called
as clusters. These clusters help in making faster decisions, and exploring data.
20. Explain Association algorithm in Data mining?
Association algorithm is used for recommendation engine that is based on a market based
analysis. This engine suggests products to customers based on what they bought earlier.
The model is built on a dataset containing identifiers.
21. What are the goals of data mining?
Prediction, identification, classification and optimization
22. Is data mining independent subject?
No, it is interdisciplinary subject. includes, database technology, visualization, machine
learning, pattern recognition, algorithm etc.
23. What are different types of database?
Relational database, data warehouse and transactional database.
24. What are data mining functionality?
Mining frequent pattern, association rules, classification and prediction, clustering, evolu-
tion analysis and outlier Analise
25. What are issues in data mining?
Issues in mining methodology, performance issues, user interactive issues, different
source of data types issues etc.
26. List some applications of data mining.
Agriculture, biological data analysis, call record analysis, DSS, Business intelligence sys-
tem etc
27. What do you mean by interesting pattern?
A pattern is said to be interesting if it is 1. easily understood by human 2. valid 3. poten-
tially useful 4. novel
28. Why do we pre-process the data?
To ensure the data quality. [accuracy, completeness, consistency, timeliness, believabil-
ity, interpret-ability]
29. What are the steps involved in data pre-processing?
Data cleaning, data integration, data reduction, data transformation.
30. What is distributed data warehouse?
77
IT6711- Data Mining Laboratory Department of IT 2017-2018
Distributed data warehouse shares data across multiple data repositories for the purpose
of OLAP operation.
31. Define virtual data warehouse.
A virtual data warehouse provides a compact view of the data inventory. It contains meta
data and uses middle-ware to establish connection between different data sources.
32. What is are different data warehouse model?
Enterprise data ware housing, Data marts and Virtual Data warehouse
33. List few roles of data warehouse manager.
Creation of data marts, handling users, concurrency control, updation etc,
34. What are different types of cuboids?
0-D cuboids are called as apex cuboids
n-D cuboids are called base cuboids
Middle cuboids
35. What are the forms of multidimensional model?
Star schema
Snow flake schema
Fact constellation Schema
36. What are frequent pattern?
A set of items that appear frequently together in a transaction data set. eg milk, bread,
sugar
37. What are the issues regarding classification and prediction?
Preparing data for classification and prediction and Comparing classification and predic-
tion
38. Define model over fitting.
A model that fits training data well can have generalization errors. Such situation is
called as model over fitting.
39. What are the methods to remove model over fitting?
Pruning [Pre-pruning and post pruning)
Constraint in the size of decision tree
Making stopping criteria more flexible
40. What is regression?
78
IT6711- Data Mining Laboratory Department of IT 2017-2018
Regression can be used to model the relationship between one or more independent and
dependent variables. Types :Linear regression and non-linear regression
41. Compare K-mean and K-mediods algorithm.
K-mediods is more robust than k-mean in presence of noise and outliers. K-Mediods can
be computationally costly.
42. What is K-nearest neighbor algorithm?
It is one of the lazy learner algorithm used in classification. It finds the k-nearest neigh-
bor of the point of interest.
79
IT6711- Data Mining Laboratory Department of IT 2017-2018
80
IT6711- Data Mining Laboratory Department of IT 2017-2018
81
IT6711- Data Mining Laboratory Department of IT 2017-2018
76. If there are 3 dimensions, how many cuboids are there in cube?
2^3 = 8 cuboids
77. Differentiate between star schema and snowflake schema.
Star Schema is a multi-dimension model where each of its disjoint dimension is repre-
sented in single table. •Snow-flake is normalized multi-dimension schema when each of
disjoint dimension is represent in multiple tables.
78. List the advantages of star schema.
Star Schema is very easy to understand, even for non-technical business manager and
Star Schema provides better performance and smaller query times.
79. What are the characteristics of data warehouse?
Integrated ,Non-volatile , Subject oriented and Time variant
80. Define support and confidence.
The support for a rule R is the ratio of the number of occurrences of R, given all occur-
rences of all rules.
81. What are the criteria on the basic of which classification and prediction can be com -
pared?
82
IT6711- Data Mining Laboratory Department of IT 2017-2018
83
IT6711- Data Mining Laboratory Department of IT 2017-2018
VLDB is abbreviated as Very Large Database and its size is set to be more than one terabyte
database. These are decision support systems which is used to server large number of users.
91. What is real-time datawarehousing?
Real-time datawarehousing capt
ures the business data whenever it occurs. When there is business activity gets completed, that
data will be available in the flow and become available for use instantly.
92. What are Aggregate tables?
Aggregate tables are the tables which contain the existing warehouse data which has been
grouped to certain level of dimensions.
93. What is factless fact tables?
A factless fact tables are the fact table which doesn’t contain numeric fact column in the fact
table.
94. How can we load the time dimension?
Time dimensions are usually loaded through all possible dates in a year and it can be done
through a program. Here, 100 years can be represented with one row per day.
95. What are Non-additive facts?
Non-Addictive facts are said to be facts that cannot be summed up for any of the dimensions
present in the fact table. If there are changes in the dimensions, same facts can be useful.
84
IT6711- Data Mining Laboratory Department of IT 2017-2018
Datawarehouse is a place where the whole data is stored for analyzing, but OLAP is used for
analyzing the data, managing aggregations, information partitioning into minor level
information.
100. What are the key columns in Fact and dimension tables?
Foreign keys of dimension tables are primary keys of entity tables. Foreign keys of fact tables
are the primary keys of the dimension tables.
85