0% found this document useful (0 votes)
89 views18 pages

Weka Regression LinearRegression

The document describes a dataset containing cholesterol and patient information used to build a linear regression model. The dataset has 303 records with 14 attributes, including age, sex, blood pressure readings, and other medical details. Some attributes are numeric while others are categorical. The goal is to predict cholesterol levels based on the other attribute values.

Uploaded by

Hazwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views18 pages

Weka Regression LinearRegression

The document describes a dataset containing cholesterol and patient information used to build a linear regression model. The dataset has 303 records with 14 attributes, including age, sex, blood pressure readings, and other medical details. Some attributes are numeric while others are categorical. The goal is to predict cholesterol levels based on the other attribute values.

Uploaded by

Hazwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UMUC

Prediction via Linear Regression


Prediction Exercise

DBST 667 Data Mining

In this exercise, you will use the cholesterol dataset to build a linear model that expressing the
relationship between cholesterol level and an individual’s identification and vitals data, including age,
sex, and presence of heart disease, chest pain type, and blood pressure. To model is built from the
training data where the cholesterol level is known. Then, the model is used to estimate the cholesterol
level for the new data, and the model accuracy is evaluated.
Introduction 1

Prediction via Linear Regression

Table of Contents
Introduction ................................................................................................................................................. 2
1.0 The Data File Content ....................................................................................................................... 2
2.0 Run the Algorithm ............................................................................................................................ 6
2.1 Loading the Data File .................................................................................................................... 6
2.2 Setting Test Options ..................................................................................................................... 9
2.3 Setting Evaluation Options ......................................................................................................... 10
2.4 Algorithm Parameters ................................................................................................................ 10
3.0 Analyzing Result ............................................................................................................................. 12
3.1 Run Information ......................................................................................................................... 12
3.2 Model ......................................................................................................................................... 13
3.3 Predictions on Test Split ............................................................................................................. 14
3.4 Evaluation on Test Split .............................................................................................................. 15
4.0 Results Visualization ....................................................................................................................... 16


Introduction 2

Prediction via Linear Regression


Introduction
The purpose of this exercise is to build a linear model for estimating the cholesterol level when an
individual’s age, sex, and vitals are known. An exercise illustrates the steps for building the model and
for using the model to estimate the cholesterol level for new data. The analyses include the accuracy
evaluation of a model and prediction visualization.

The modified version of the original dataset was taken from


http://tunedit.org/repo/UCI/numeric/cholesterol.arff

To protect the individual’s identity, the social security numbers have been removed. The values of the
categorical attributes were recoded as numeric classes. For example, sex=0 stands for sex=female.
The relationship between cholesterol level and predictors is expressed as a multiple regression
equation.
𝑦 = 𝛼 + 𝛽1 ∗ 𝑥1 + 𝛽2 ∗ 𝑥2 + ⋯ … … … … + 𝛽𝑛 ∗ 𝑥𝑛

Y is the cholesterol level (dependent variable)

X1 –Xn are the values of predictors (independent variables), where n is the number of predictors.

β1 – βn are the regression coefficients, and α is an intercept.

The goal of an algorithm is to find the values of β1 – βn and α, where the average difference between the
estimated and actual cholesterol level is minimal.

An algorithm requires that dependent variable y is numeric. Otherwise, LinearRegression option will be
disabled in Weka menu. The independent attributes can be numeric, binary, or nominal.

As you go through an exercise, notice that some menu options that we used for classification algorithms
are disabled for prediction algorithms, including linear regression. An algorithm output does not have
confusion matrix or detailed accuracy by class section.

Depending on Weka version you use, your results might be slightly different.

1.0 The Data File Content



Figure 1 shows the partial content of the cholesterol.arff file. The file header consists of
• Relation name - data file name is specified after the tag @relation.
• Attributes list - each attribute definition follows the tag @attribute
The dataset has 14 attributes. The attribute types are

• Nominal (categorical) – an attribute definition includes the @attribute tag, an attribute name in
quotes and a list of valid values in braces. The quotes around an attribute name are optional if the
name does not contain whitespaces.
The Data File Content 3


Tag Attribute name Valid Values


• Real (continuous numbers) - an attribute definition includes the @attribute tag, an attribute
name, and a keyword real. The keyword real is case insensitive.

Tag Attribute name Keyword real



The @data token indicates the beginning of the data section. Each row in the data section (an instance)
corresponds to a specific cholesterol record, and there are 303 records. The order in which the attributes
are declared indicates the column position in the data section. For example, if sex is the second attribute
on a list, the sex value for each cholesterol record is in the second column of the data row.
Example - First data row

Age sex cp trestbps fbs restecg thalach

Missing values are represented as ? For example, the last instance is missing a value for ca attribute.

Relation

Header Attributes

Data
(Instances)

Figure 1: Partial cholesterol data file content


The Data File Content 4

Table 1 – Dataset Attributes Summary

age Age in years – real



Mean=54.439
StdDev=9.039


sex Sex – nominal
1=male - 206 instances
0=female -97 instances


cp Chest pain type - nominal:
1=typical angina -23 instances
2=atypical angina -144 instances
3=non-anginal pain – 86 instances
4=asymptomatic -50 instances

trestbps resting blood pressure – real

mean=131.69
StdDev=17.6


fbs Is fasting blood sugar > 120 mg/dl – nominal
1=true -45 instances
0=false -258 instances


restecg resting electrocardiographic results – nominal
0=normal -151 instances
1=ST-T wave abnormality -4 instances
2=left ventricular hypertrophy -148 instances


thalach maximum heart rate achieved – real

mean=149.607
StdDev=22.875


The Data File Content 5

exang exercise induced angina – nominal


0=no -204 instances
1=yes -99 instances


oldpeak depression induced by exercise – real

mean=1.04
StdDev=1.161


Slope Nominal
1=upsloping -142 instances
2=flat -140 instances
3=downsloping -21 instances


ca number of major vessels – real

mean=0.672
StdDev=0.937
4 missing values

Thal Nominal
3=mormal – 166 instances
6=fixed defect -18 instances
7=reversible defect -117 instances

2 missing values
Num Presence of heart disease – real

Mean=0.937
StdDev=1.229


chol Cholesterol level – real

Mean=246.693
StdDev=51.777


Run the Algorithm 6

2.0 Run the Algorithm


2.1 Loading the Data File



From Windows desktop, click start, choose “All programs”, and select “Weka 3.6” to open the GUI
Chooser interface on Figure 2.

For this exercise, we


will use explorer
application

Figure 2: GUI Chooser Interface

Click on Explore button to open Weka Explorer interface on Figure 3. By default, preprocess tab is
active. Since we have not loaded the data file, the attributes list and selected attribute panel is empty.
The remaining tabs are greyed out.

Since we have
not opened the
data file, only
Preprocess tab is
active.
The rest of the
tabs are greyed
out.

The attributes list


is empty

The status bar


Number 0 next to
shows a welcome an X means that
message
no processes are
currently running.
Run the Algorithm 7

Figure 3: Preprocess tab before opening the data file

Click Open file… button on Figure 4 and browse to open the cholesterol.arff data file.

Click to open
the data file

Figure 4: Open the cholesterol.arff file

Once the data file is loaded, all tabs become available. The current relation panel on Figure 5 displays
the relation name, number of instances, and number of attributes. Selected attribute panel show the
statistics for the first attribute age, selected from an attribute list by default. Since age attribute is
numeric, the statistics are minimum, maximum, mean, and StdDev.

The drop down under the selected attribute panel enables specifying a dependent variable, or a variable
to be predicted. Chol is selected by default because it is the last attribute. Under the drop down, is the
histogram for the age attribute values distribution.
Run the Algorithm 8

All tabs are


available.
Preprocess
tab is active

Current
relation
Selected
panel shows
that attribute age
cholesterol is numeric
dataset has
303 Statistics for
instances the selected
with 14 attribute age
attributes
Click Visualize
all to view the
histograms for
Attributes
all attributes
list
By default, last
Age is the attribute chol
selected is selected as a
attribute
dependent
variable.
Histogram for
age attribute

Figure 5: Preprocess tab after opening the data file

Click the on a Classify tab to open an interface on Figure 6. By default, the ZeroR algorithm is selected.
Chol attribute is selected as a dependent variable from the drop down under more options button.
Clicking
choose button
opens up the
hierarchical
menu with
data mining
algorithms ZeroR algorithm is selected by default
Test options

An attribute to
predict
(dependent
variable); last
attribute chol is
selected by
default

Figure 6: Classify tab


Run the Algorithm 9

Click on a textbox next to the choose button to expand the hierarchical menu on Figure 7. Expand
classifiers folder, expand functions folder, and select LinearRegression from the functions list.

Expand
classifiers
folder
Expand
functions
folder
Select Linear
Regression

Figure 7: Select LinearRegression function algorithm

2.2 Setting Test Options



Select Percentage split in the Test options panel on Figure 8, and keep the default 66% in adjacent text
field.

Percentage split: The value in the ‘%’ field specifies the percentage of data to be used for building an
initial model (training data). By default, the value is set to 66%. After the data model is built, the
remaining default 34% of data (test data) are used to test the accuracy of the model.

Algorithm
name

66% of dataset The algorithm output will be displayed in


will be used as
training data,
this area.
and the
remaining 34%
Options for the
will be used as
algorithm output
test data
content
An attribute to
predict
New entry will be added to result list
(dependent
area after each algorithm run.
variable).

Figure 8: Select percentage split test option


Run the Algorithm 10

2.3 Setting Evaluation Options



Click More Options under the test options panel to open an
interface on Figure 9. Make sure that the check boxes nest to
Output model, store predictions for visualization, and output
predictions options are checked.

Output model – if checked, an algorithm results will include the


regression model.

Store predictions for visualization – if checked, the predicted


values are saved to enable visualizing them.

Output predictions – if checked, the algorithm output will include


the predicted values.

Notice that Output per-class stats and Output confusion matrix


options are greyed out for prediction algorithm (second and fourth
checkbox).

Click OK to continue.

Figure 9: Classifier evaluation options

2.4 Algorithm Parameters



Click on a textbox on the right of Choose button to open a GenericObjectEditor dialog box, and make
sure the values match Figure 10.
1. Select M5 method for the attribute selection option.
2. Make sure that eliminate collinear attributes is set to true.
The available attribute selection methods are:
No attribute selection – all attributes are used to build the model, regardless of statistical
significance.
M5 method – during the initial iteration, all attributes are used to construct the model. The
attributes with the lowest ranking coefficients are iteratively removed until the change in error rate is
insignificant. The final model includes the attributes that affect the accuracy of a model (statistically
significant).
Greedy method – unlike M5 method, the first iteration starts from an empty subset. As different
combinations of attributes are examined, and attribute can be added or removed for iteration.
Collinearity is a high correlation among predictors. Setting eliminateColinearAttributes=true enables the
algorithm to eliminate the collinear attributes.
Run the Algorithm 11

Click more to
Click to open read about the
the generic algorithm
parameters.
object
editor. Click to read
about an
algorithm
attribute type
requirements.
Choose M5
attribute
selection
method
Select true
to eliminate
collinear
attributes
Shrinks the
coefficient values
to minimize the
over fitting (keep
the default)

Click to continue
Figure 10: Specify LinearRegression parameters in Generic Object Editor

Click Start to run the algorithm. The algorithm results will be displayed in the classifier output panel on
Figure 11. The results list panel has a new entry.

Algorithm name
and specified
parameters

Percentage split is
a selected test
option
An attribute to
predict is chol
Click to run the
algorithm
New result
Attributes
list entry

Test option

Figure 11: Classifier output panel after running the algorithm.


Analyzing Result 12

Right-click on the last results entry in the bottom left panel to open the popup menu, and select save result
buffer, and save the file as resultbuffer1.txt.

Right click on
the results
entry
Select Save
result buffer

Figure 12: Save result buffer

3.0 Analyzing Result

3.1 Run Information



Run information on Figure 13 includes

Scheme weka.classifiers.functions.LinearRegression
Relation name cholesterol
Number of instances 303
Number of attributes 14
Attributes list, including independent and dependent attributes.
Test mode – percentage split with 66% of data used for training.

Analyzing Result 13

Algorithm name
and specified
options
Relation
name

Number of
instances
and
number of
attributes

Attributes list – all attributes in the dataset

66% of dataset is
used as training
data, and the
remaining 34% is
used as test data

Figure 14: Run Information

3.2 Model

The algorithm generates a linear function, which is the weighted sum of the independent attributes.

𝑦 = 𝛼 + 𝛽1 ∗ 𝑥1 + 𝛽2 ∗ 𝑥2 + ⋯ … … … … + 𝛽𝑛 ∗ 𝑥𝑛

Where y is a dependent variable, α is an intercept, x1-xn are independent variables, and β1 – βn are the
coefficients.

Each coefficient β is the change in cholesterol level as the corresponding value of an independent variable
is increased by 1 while the values of other variables remain constant. For example, 1.0949 coefficient for
an age attribute means increasing that an age by a year adds 1.0949 to cholesterol level.

Although the dataset has 13 independent attributes, only age, sex, restecg, thalach, and thal attributes
are used in the model on Figure 15 because we chose M5 attribute selection method. It means that
omitted attributes cp, trestbps, fbs, exang, oldpeak, slope, ca, and num do not significantly affect the
cholesterol level.
Analyzing Result 14

Dependent variable (y)

Each year adds 1.0949 to the cholesterol level

Cholesterol level is 24.0828 higher if a person is a


female (Sex=0)
Restecg=2 or 1 adds 16.0357 to the cholesterol level
Increasing thalach by 1 adds 0.2328 to cholesterol level

The cholesterol level is 12.9793 higher when thal=7

Intercept

Figure 15: Linear regression model

3.3 Predictions on Test Split



After constructing the “best fit” model from the training data, we use the model to estimate the
cholesterol level for the instances in the test data. To get the predicted cholesterol for an instance, we
can substitute the attribute values into an expression.

For example:

Age sex Restecg thalach thal


Chol= 1.0949 *age + 24.0828* sex=0 + 16.0357 * restecg=2,1 + 0.2328 * thalach +12.9793 * thal=7 + 131.4592

=1.0949 * 60+ 24.0828 * 0+ 16.0357 * 1 + 0.2328*155 + 12.9793*0 +131.4592

The algorithm output includes the predicted and actual cholesterol level for each instance in the test
set. Figure 16 shows the first 13 instances. An error=predicted value – actual value. If we graph the
model and the data point corresponding to an actual cholesterol level, the prediction error is a vertical
distance between the line and the point.

An error is positive when the predicted value is higher than an actual value. An error is negative when
predicted value is below the actual value.
Analyzing Result 15

Actual cholesterol level value

Error= Error =
Predicted vertical
value-actual distance
value
Predicted cholesterol level value
254.196-
185=69.196

Negative
error Positive
error


Figure 16: Predictions on test split

3.4 Evaluation on Test Split



The evaluation on test split algorithm output section on Figure 17 includes the error measures based on
a difference between the actual and estimated cholesterol level. The goal is to minimize the errors by
building the model with the minimum average difference between predicted and actual vales.

The correlation coefficient is the correlation between the estimated and actual cholesterol level. The
value ranges between -1 and 1. The magnitude of correlation indicates the linear relationship strength
between the predictors and cholesterol level dependent variable. Hence, the relationship is the
strongest as the correlation approaches -1 or 1. In our case, the correlation is 0.2202.

Although value 0 indicates the absence of linear relationship between predictors and dependent
variable, it does not indicate the absence of relationship in general. We would need to consider the
non-linear models, such as quadratic regression and/or logarithmic regression.

Number of instances in the test data =34% out of total 303


instances in the dataset =103

Figure 17: Error metrics, correlation, and number of instances


Results Visualization 16

4.0 Results Visualization


Right click on results entry to open a pop-up menu, and select visualize classifier error, as shown on
Figure 18.

Right click on
result list
entry to open
the popup
menu

Select visualize
classifier errors
from popup
menu

Figure 18: Select Visualize classifier errors

Make sure that predictedchol is selected for the Y-axis, and chol is selected for an X-axis. Each instance
is represented as X. The size of X indicates the magnitude of a difference between predicted and actual
cholesterol level.

The cholesterol level value at the origin is an intercept α in the model above. The instances represented
as the smallest x form a line. The X marks are larger when they are further away from the line.

Select chol for Select


X-axis predictedchol
for and Y-axis
An instance is
represented as
an X. The lager
sixe of an X
indicates the
further vertical
distance from the Each strip shows
line the distribution of
values for a
The instances corresponding
represented as attribute.
the smallest X
form a line.
Intercept
α=131.4592

Figure 19: Predicted cholesterol vs. actual cholesterol


Results Visualization 17

Click on X to view the corresponding instance information. The instance information on Figure 20
includes an algorithm name, instance number, independent attribute values, predicted cholesterol and
actual cholesterol level.

Algorithm
name

Instance
number

Independent
Click on X to open
attributes
the corresponding
instance into

Predicted Cholesterol
Actual Cholesterol

Figure 20: Instance info

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy