Introduction To Weka-A Toolkit For Machine Learning
Introduction To Weka-A Toolkit For Machine Learning
1. Introduction
Weka is open source software under the GNU General Public License. System is
developed at the University of Waikato in New Zealand. “Weka” stands for the Waikato
Environment for Knowledge Analysis. The software is freely available at
http://www.cs.waikato.ac.nz/ml/weka. The system is written using object oriented
language Java. There are several different levels at which Weka can be used. Weka
provides implementations of state-of-the-art data mining and machine learning algorithms.
Weka contains modules for data preprocessing, classification, clustering and association
rule extraction.
• Explorer
– preprocessing, attribute selection, learning, visualiation
• Experimenter
– testing and evaluating machine learning algorithms
• Knowledge Flow
– visual design of KDD process
– Explorer
• Simple Command-line
– A simple interface for typing commands
Attribute Relation File Format (ARFF) is the default file type for data analysis in weka but
data can also be imported from various formats.
ARFF format of weather dataset from sample data in weka is presented here. Attribute type
is specified in the header tag. Nominal attribute have the distinct values of attribute in curly
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
129
Introduction to Weka- A Toolkit for Machine Learning
brackets along with attribute name. Numeric attribute is specified by the keyword real
along with attribute name.
@relation weather
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
2. WEKA Explorer
Weka expects data file it to be in ARFF format, because it is necessary to have type
information about each attribute which cannot be automatically deduced from the attribute
values. Before you can apply any algorithm to your data, is must be converted to ARFF
form. This can be done very easily. Most spreadsheet and database programs allow you to
export your data into a file in comma separated format—as a list of records where the items are
separated by commas. Once this has been done, you need only load the file into a text editor or a
word processor; add the dataset’s name using the @relation tag, the attribute information using
@attribute, and a @data line; save the file as raw text. Following example presents conversion of
data to arff format from a Microsoft Excel spreadsheet. From the excel spreadsheet, save the data
in .CSV format. In weka, On the Preprocess tab, select Open file…Then select the Dataset.csv file
(Fig. 2). Make sure that you have selected files of type csv, or you won’t see the dataset that we
want to open.
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
130
Introduction to Weka- A Toolkit for Machine Learning
4. Data Preprocessing
Some attributes may not be required in the analysis, and then those attributes can be removed from
the dataset before analysis. For example, attribute instance number of iris dataset is not required in
analysis. This attribute can be removed by selecting it in the Attributes check box, and clicking
Remove (Fig. 3). Resulting dataset then can be stored in arff file format.
In case some attributes needs to be removed before the data mining step, this can be done
using the Attribute filters in WEKA. In the "Filter" panel, click on the "Choose" button.
This will show a popup window with a list available filters. Scroll down the list and select
the "weka.filters.unsupervised.attribute.Remove" filter as shown in Figure 4. Next, click on
text box immediately to the right of the "Choose" buttom. In the resulting dialog box enter
the index of the attribute to be filtered out (this can be a range or a list separated by
commas). In this case, we enter 1 which is the index of the "id" attribute (see the left panel).
Make sure that the "invertSelection" option is set to false (otherwise everything except
attribute 1 will be filtered) (Fig 5). Then click "OK"
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
131
Introduction to Weka- A Toolkit for Machine Learning
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
132
Introduction to Weka- A Toolkit for Machine Learning
4.2 Discretization
You can observe that WEKA has assigned its own labels to each of the value ranges for the
discretized attribute. For example, the lower range in the "age" attribute is labeled "(-inf-
34.333333]" (enclosed in single quotes and escape characters), while the middle range is
labeled "(34.333333-50.666667]", and so on. These labels now also appear in the data
records where the original age value was in the corresponding range.
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
133
Introduction to Weka- A Toolkit for Machine Learning
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
134
Introduction to Weka- A Toolkit for Machine Learning
with each other, or use some function of one or more attributes. Leaf nodes give a
classification that applies to all instances that reach the leaf, or a set of classifications, or a
probability distribution over all possible classifications. To classify an unknown instance,
it is routed down the tree according to the values of the attributes tested in successive
nodes, and when a leaf is reached the instance is classified according to the class assigned
to the leaf. ID3 is the basic decision tree classifier. Following is the example of ID3 on
weather data from sample datasets of weka (Fig. 8).
The first two columns are the TP Rate (True Positive Rate) and the FP Rate (False
Positive Rate). For the first level where ‘play=yes’ TP Rate is the ratio of play cases
predicted correctly cases to the total of positive cases (eg: 8 out of 9 is predicted
correctly =8/9=0.88).
The FP Rate is then the ratio no play cases incorrectly predicted as play yes cases to the
total of play no cases. 1 play no case was wrongly predicted as play yes. So the FP Rate
is 1/5=0.2
The next two columns are terms related to information retrieval theory. When one is
conducting a search for relevant documents, it is often not possible to get to the
relevant documents easily or directly. In many cases, a search will yield lots results
many of which will be irrelevant. Under these circumstances, it is often impractical to
get all results at once but only a portion of them at a time. In such cases, the terms
recall and precision are important to consider.
Recall is the ratio of relevant documents found in the search result to the total of all
relevant documents. Thus, higher recall values imply that relevant documents are
returned more quickly. A recall of 30% at 10% means that 30% of the relevant
documents were found with only 10% of the results examined. Precision is the
proportion of relevant documents in the results returned. Thus a precision of 0.75
means that 75% of the returned documents were relevant.
In our example, such measures are not very applicable…the recall in this case just
corresponds to the TP Rate, as we are always looking at 100% of test sample and
precision is just the proportion of low and normal weight cases in the test sample.
the F-measure is a way of combining recall and precision scores into a single measure
of performance. The formula for it is:
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
135
Introduction to Weka- A Toolkit for Machine Learning
Confusion matrix specifies the classes of obtained results. For example, class a has
majority of objects (8 objects) from yes category, hence a is treated as class of “yes” group.
Similarly b has majority of objects (4) from no category, hence b is treated as class of “no”
group. Hence one object each from both the classes is misclassified, which leads to
misclassified instance as 2. User can see the plot of tree too.
K-means is the most popularly used algorithm for clustering. User need to specify the
number of clusters (k) in advance. Algorithm randomly selects k objects as cluster mean or
center. It works towards optimizing square error criteria function, defined as:
k 2
∑ ∑ x − mi , where mi is the mean of cluster C i .
i =1 x∈Ci
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
136
Introduction to Weka- A Toolkit for Machine Learning
Following is the example of K means on weather data from sample datasets of weka (Fig.
9).
Figure 10 shows the results of k means on weather data. Confusion matrix specifies the
classes of obtained results as we have selected the classes to cluster evaluation. For
example, cluster0 has total 9 objects, out of which majority of objects (6) are from yes
category, hence this cluster is treated as cluster of “yes”. Similarly, cluster1 has total of 5
objects, out of which 3 objects are from “no” category, hence it is considered as cluster of
no category.
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
137
Introduction to Weka- A Toolkit for Machine Learning
References:
Winter School on "Data Mining Techniques and Tools for Knowledge Discovery in Agricultural Datasets”
138