Decision Tree in R
Programming
Language
R is a programming language for statistical computing and data
visualization. It has been adopted in the fields of data
mining, bioinformatics and data analysis.
The core R language is augmented by a large number of extension
packages, containing reusable code, documentation, and sample
data.
R software is open-source and free software. It is licensed by the GNU
Project and available under the GNU General Public License. It is
written primarily in C, Fortran, and R
itself. Precompiled executable are provided for various operating
systems.
Why use R Programming?
There are several tools available in the market to perform data
analysis. Learning new languages is time taken.
The data scientist can use two excellent tools, i.e., R and Python.
Data scientist job is to understand the data, manipulate it, and
expose the best approach.
For machine learning, the best algorithms can be implemented
with R.
R communicate with the other languages and possibly calls Python,
Java, C++. The big data world is also accessible to R.
We can connect R with different databases like Spark or Hadoop.
EXAMPLE
Let us consider the scenario where a medical company wants to
predict whether a person will die if he is exposed to the Virus.
The important factor determining this outcome is the strength of his
immune system, but the company doesn’t have this info.
Since this is an important variable, a decision tree can be
constructed to predict the immune strength based on factors like the
sleep cycles, cortisol levels, supplement intaken, nutrients derived
from food intake, and so one of the person which is all continuous
variables.
Working of a Decision Tree in R
Partitioning:
It refers to the process of splitting the data set into subsets.
The decision of making strategic splits greatly affects the accuracy of
the tree.
Many algorithms are used by the tree to split a node into sub-nodes
which results in an overall increase in the clarity of the node with
respect to the target variable.
Various Algorithms like the chi-square and Gini index are used for
this purpose and the algorithm with the best efficiency is chosen.
Pruning:
This refers to the process wherein the branch nodes are turned into
leaf nodes which results in the shortening of the branches of the tree.
The essence behind this idea is that overfitting is avoided by simpler
trees as most complex classification trees may fit the training data
well but do an underwhelming job in classifying new values.
Selection of the tree:
The main goal of this process is to select the smallest tree that fits
the data due to the reasons discussed in the pruning section.
Important factors to consider while
selecting the tree in R
Entropy:
Mainly used to determine the uniformity in the given sample.
If the sample is completely uniform then entropy is 0, if it’s uniformly
partitioned it is one.
The higher the entropy more difficult it becomes to draw conclusions
from that information.
Information Gain:
Statistical property which measures how well training examples are
separated based on the target classification.
The main idea behind constructing a decision tree is to find an
attribute that returns the smallest entropy and the highest
information gain.
It is basically a measure in the decrease of the total entropy, and it is
calculated by computing the total difference between the entropy
before split and average entropy after the split of dataset based on
the given attribute values.
R – Decision Tree Example
Let us now examine this concept with the help of an example, which
in this case is the most widely used “readingSkills” dataset by
visualizing a decision tree for it and examining its accuracy.
Installing the required libraries
Import required libraries and Load the dataset
readingSkills and execute head(readingSkills)
As you can see clearly there 4 columns nativeSpeaker, age, shoeSize,
and score. Thus basically we are going to find out whether a person is
a native speaker or not using the other criteria and see the accuracy
of the decision tree model developed in doing so.
Splitting dataset into 4:1 ratio for train
and test data
Separating data into training and testing sets is an important part of
evaluating data mining models. Hence it is separated into training and
testing sets. After a model has been processed by using the training set,
you test the model by making predictions against the test set. Because
the data in the testing set already contains known values for the
attribute that you want to predict, it is easy to determine whether the
model’s guesses are correct.
Create the decision tree model using
ctree and plot the model
The basic syntax for creating a decision
tree in R is:
where, formula describes the predictor and response variables and data
is the data set used.
In this case, nativeSpeaker is the response variable and the other
predictor variables are represented by, hence when we plot the model we
get the following output
From the tree, it is clear that those who have a score less
than or equal to 31.08 and whose age is less than or
OUTPUT equal to 6 are not native speakers and for those whose
score is greater than 31.086 under the same criteria,
they are found to be native speakers.
Making a prediction
OUTPUT
The model has correctly predicted 13 people to be non-native speakers but
classified an additional 13 to be non-native, and the model by analogy has
misclassified none of the passengers to be native speakers when actually
they are not.
Determining the accuracy of the model
developed
Here the accuracy-test from the confusion matrix is calculated and is
found to be 0.74. Hence this model is found to predict with an
accuracy of 74 %.
Inference
Thus Decision Trees are very useful algorithms as they are not only
used to choose alternatives based on expected values but are also
used for the classification of priorities and making predictions.
It is up to us to determine the accuracy of using such models in the
appropriate applications.
Advantages of Decision Trees
Easy to understand and interpret
Does not require Data normalization
Doesn’t facilitate the need for scaling of data
The pre-processing stage requires lesser effort compared to other
major algorithms, hence in a way optimizes the given problem
Disadvantages of Decision Trees
Requires higher time to train the model
It has considerable high complexity and takes more time to process
the data
When the decrease in user input parameter is very small it leads to
the termination of the tree
Calculations can get very complex at times