Weka Regression LinearRegression
Weka Regression LinearRegression
In this exercise, you will use the cholesterol dataset to build a linear model that expressing the
relationship between cholesterol level and an individual’s identification and vitals data, including age,
sex, and presence of heart disease, chest pain type, and blood pressure. To model is built from the
training data where the cholesterol level is known. Then, the model is used to estimate the cholesterol
level for the new data, and the model accuracy is evaluated.
Introduction 1
Table of Contents
Introduction ................................................................................................................................................. 2
1.0 The Data File Content ....................................................................................................................... 2
2.0 Run the Algorithm ............................................................................................................................ 6
2.1 Loading the Data File .................................................................................................................... 6
2.2 Setting Test Options ..................................................................................................................... 9
2.3 Setting Evaluation Options ......................................................................................................... 10
2.4 Algorithm Parameters ................................................................................................................ 10
3.0 Analyzing Result ............................................................................................................................. 12
3.1 Run Information ......................................................................................................................... 12
3.2 Model ......................................................................................................................................... 13
3.3 Predictions on Test Split ............................................................................................................. 14
3.4 Evaluation on Test Split .............................................................................................................. 15
4.0 Results Visualization ....................................................................................................................... 16
Introduction 2
To protect the individual’s identity, the social security numbers have been removed. The values of the
categorical attributes were recoded as numeric classes. For example, sex=0 stands for sex=female.
The relationship between cholesterol level and predictors is expressed as a multiple regression
equation.
𝑦 = 𝛼 + 𝛽1 ∗ 𝑥1 + 𝛽2 ∗ 𝑥2 + ⋯ … … … … + 𝛽𝑛 ∗ 𝑥𝑛
X1 –Xn are the values of predictors (independent variables), where n is the number of predictors.
The goal of an algorithm is to find the values of β1 – βn and α, where the average difference between the
estimated and actual cholesterol level is minimal.
An algorithm requires that dependent variable y is numeric. Otherwise, LinearRegression option will be
disabled in Weka menu. The independent attributes can be numeric, binary, or nominal.
As you go through an exercise, notice that some menu options that we used for classification algorithms
are disabled for prediction algorithms, including linear regression. An algorithm output does not have
confusion matrix or detailed accuracy by class section.
Depending on Weka version you use, your results might be slightly different.
• Nominal (categorical) – an attribute definition includes the @attribute tag, an attribute name in
quotes and a list of valid values in braces. The quotes around an attribute name are optional if the
name does not contain whitespaces.
The Data File Content 3
Tag Attribute name Valid Values
• Real (continuous numbers) - an attribute definition includes the @attribute tag, an attribute
name, and a keyword real. The keyword real is case insensitive.
Missing values are represented as ? For example, the last instance is missing a value for ca attribute.
Relation
Header Attributes
Data
(Instances)
cp Chest pain type - nominal:
1=typical angina -23 instances
2=atypical angina -144 instances
3=non-anginal pain – 86 instances
4=asymptomatic -50 instances
trestbps resting blood pressure – real
mean=131.69
StdDev=17.6
fbs Is fasting blood sugar > 120 mg/dl – nominal
1=true -45 instances
0=false -258 instances
restecg resting electrocardiographic results – nominal
0=normal -151 instances
1=ST-T wave abnormality -4 instances
2=left ventricular hypertrophy -148 instances
thalach maximum heart rate achieved – real
mean=149.607
StdDev=22.875
The Data File Content 5
oldpeak depression induced by exercise – real
mean=1.04
StdDev=1.161
Slope Nominal
1=upsloping -142 instances
2=flat -140 instances
3=downsloping -21 instances
ca number of major vessels – real
mean=0.672
StdDev=0.937
4 missing values
Thal Nominal
3=mormal – 166 instances
6=fixed defect -18 instances
7=reversible defect -117 instances
2 missing values
Num Presence of heart disease – real
Mean=0.937
StdDev=1.229
chol Cholesterol level – real
Mean=246.693
StdDev=51.777
Run the Algorithm 6
Click on Explore button to open Weka Explorer interface on Figure 3. By default, preprocess tab is
active. Since we have not loaded the data file, the attributes list and selected attribute panel is empty.
The remaining tabs are greyed out.
Since we have
not opened the
data file, only
Preprocess tab is
active.
The rest of the
tabs are greyed
out.
Click Open file… button on Figure 4 and browse to open the cholesterol.arff data file.
Click to open
the data file
Once the data file is loaded, all tabs become available. The current relation panel on Figure 5 displays
the relation name, number of instances, and number of attributes. Selected attribute panel show the
statistics for the first attribute age, selected from an attribute list by default. Since age attribute is
numeric, the statistics are minimum, maximum, mean, and StdDev.
The drop down under the selected attribute panel enables specifying a dependent variable, or a variable
to be predicted. Chol is selected by default because it is the last attribute. Under the drop down, is the
histogram for the age attribute values distribution.
Run the Algorithm 8
Current
relation
Selected
panel shows
that attribute age
cholesterol is numeric
dataset has
303 Statistics for
instances the selected
with 14 attribute age
attributes
Click Visualize
all to view the
histograms for
Attributes
all attributes
list
By default, last
Age is the attribute chol
selected is selected as a
attribute
dependent
variable.
Histogram for
age attribute
Click the on a Classify tab to open an interface on Figure 6. By default, the ZeroR algorithm is selected.
Chol attribute is selected as a dependent variable from the drop down under more options button.
Clicking
choose button
opens up the
hierarchical
menu with
data mining
algorithms ZeroR algorithm is selected by default
Test options
An attribute to
predict
(dependent
variable); last
attribute chol is
selected by
default
Click on a textbox next to the choose button to expand the hierarchical menu on Figure 7. Expand
classifiers folder, expand functions folder, and select LinearRegression from the functions list.
Expand
classifiers
folder
Expand
functions
folder
Select Linear
Regression
Percentage split: The value in the ‘%’ field specifies the percentage of data to be used for building an
initial model (training data). By default, the value is set to 66%. After the data model is built, the
remaining default 34% of data (test data) are used to test the accuracy of the model.
Algorithm
name
Click OK to continue.
Click more to
Click to open read about the
the generic algorithm
parameters.
object
editor. Click to read
about an
algorithm
attribute type
requirements.
Choose M5
attribute
selection
method
Select true
to eliminate
collinear
attributes
Shrinks the
coefficient values
to minimize the
over fitting (keep
the default)
Click to continue
Figure 10: Specify LinearRegression parameters in Generic Object Editor
Click Start to run the algorithm. The algorithm results will be displayed in the classifier output panel on
Figure 11. The results list panel has a new entry.
Algorithm name
and specified
parameters
Percentage split is
a selected test
option
An attribute to
predict is chol
Click to run the
algorithm
New result
Attributes
list entry
Test option
Right-click on the last results entry in the bottom left panel to open the popup menu, and select save result
buffer, and save the file as resultbuffer1.txt.
Right click on
the results
entry
Select Save
result buffer
Scheme weka.classifiers.functions.LinearRegression
Relation name cholesterol
Number of instances 303
Number of attributes 14
Attributes list, including independent and dependent attributes.
Test mode – percentage split with 66% of data used for training.
Analyzing Result 13
Algorithm name
and specified
options
Relation
name
Number of
instances
and
number of
attributes
66% of dataset is
used as training
data, and the
remaining 34% is
used as test data
3.2 Model
The algorithm generates a linear function, which is the weighted sum of the independent attributes.
𝑦 = 𝛼 + 𝛽1 ∗ 𝑥1 + 𝛽2 ∗ 𝑥2 + ⋯ … … … … + 𝛽𝑛 ∗ 𝑥𝑛
Where y is a dependent variable, α is an intercept, x1-xn are independent variables, and β1 – βn are the
coefficients.
Each coefficient β is the change in cholesterol level as the corresponding value of an independent variable
is increased by 1 while the values of other variables remain constant. For example, 1.0949 coefficient for
an age attribute means increasing that an age by a year adds 1.0949 to cholesterol level.
Although the dataset has 13 independent attributes, only age, sex, restecg, thalach, and thal attributes
are used in the model on Figure 15 because we chose M5 attribute selection method. It means that
omitted attributes cp, trestbps, fbs, exang, oldpeak, slope, ca, and num do not significantly affect the
cholesterol level.
Analyzing Result 14
Intercept
For example:
Chol= 1.0949 *age + 24.0828* sex=0 + 16.0357 * restecg=2,1 + 0.2328 * thalach +12.9793 * thal=7 + 131.4592
The algorithm output includes the predicted and actual cholesterol level for each instance in the test
set. Figure 16 shows the first 13 instances. An error=predicted value – actual value. If we graph the
model and the data point corresponding to an actual cholesterol level, the prediction error is a vertical
distance between the line and the point.
An error is positive when the predicted value is higher than an actual value. An error is negative when
predicted value is below the actual value.
Analyzing Result 15
Error= Error =
Predicted vertical
value-actual distance
value
Predicted cholesterol level value
254.196-
185=69.196
Negative
error Positive
error
Figure 16: Predictions on test split
The correlation coefficient is the correlation between the estimated and actual cholesterol level. The
value ranges between -1 and 1. The magnitude of correlation indicates the linear relationship strength
between the predictors and cholesterol level dependent variable. Hence, the relationship is the
strongest as the correlation approaches -1 or 1. In our case, the correlation is 0.2202.
Although value 0 indicates the absence of linear relationship between predictors and dependent
variable, it does not indicate the absence of relationship in general. We would need to consider the
non-linear models, such as quadratic regression and/or logarithmic regression.
Right click on results entry to open a pop-up menu, and select visualize classifier error, as shown on
Figure 18.
Right click on
result list
entry to open
the popup
menu
Select visualize
classifier errors
from popup
menu
Make sure that predictedchol is selected for the Y-axis, and chol is selected for an X-axis. Each instance
is represented as X. The size of X indicates the magnitude of a difference between predicted and actual
cholesterol level.
The cholesterol level value at the origin is an intercept α in the model above. The instances represented
as the smallest x form a line. The X marks are larger when they are further away from the line.
Click on X to view the corresponding instance information. The instance information on Figure 20
includes an algorithm name, instance number, independent attribute values, predicted cholesterol and
actual cholesterol level.
Algorithm
name
Instance
number
Independent
Click on X to open
attributes
the corresponding
instance into
Predicted Cholesterol
Actual Cholesterol