Instructions For Using R To Create Predictive Models v5
Instructions For Using R To Create Predictive Models v5
Download R
Main project page:
https://cran.r-project.org/
https://cran.r-project.org/bin/windows/base/
https://cran.r-project.org/bin/macosx/
https://www.rstudio.com/products/rstudio/download/
http://www.rcommander.com/
http://mran.revolutionanalytics.com/download/
Read the download notes to install the correct version of software for your system (i.e. Windows,
Mac, or Linux, 32 or 64 bit operating system). Before installing any software, it is advisable to scan the
downloaded package for viruses and malware using your anti-virus software and other software such
as Malwarebytes to ensure the installation package has not been spoofed.
1
Navigating RStudio
The RStudio interface has four panes (see image below):
Source Pane = R code is written, saved, and run from this pane.
Console Pane = Commands and output are shown in this pane. R commands can also be written directly
in this pane and it has all the functionality of the basic R terminal.
Workspace Pane = Shows datasets and objects created during an analysis. This workspace can be saved
separately from the code in the Source Pane and allows previously imported data, subsets, parameters,
and models to be reloaded readily without rerunning source code.
View Pane = Shows graphical output and allows them to be exported, allows loading of packages from
the library without code, help documents, and other information.
Note these are not official RStudio designations and some tutorials use different names for these panes.
Across the top of the panes is a menu bar that allows many commands to be accessed without utilizing
code. For example, under “Plots” are options to save graphical output and under “Session” are options
to save the workspace. RStudio is similar to other software in that there are often multiple ways to
execute the same command or function. Some RStudio tutorial resources include:
https://support.rstudio.com/hc/en-us/sections/200271437-Getting-Started
http://dss.princeton.edu/training/RStudio101.pdf
http://web.cs.ucla.edu/~gulzar/rstudio/basic-tutorial.html
https://www.youtube.com/watch?v=5YmcEYTSN7k
https://github.com/rstudio/webinars
2
Step-by-Step Instructions
1. Launch RStudio
2. Upload the data into the workspace.
The command to read in the data depends on where the file directory path is defaulted.
a. Set the file directory path so R will point to the correct file location in the workspace.
EX: setwd(“C:/Users/Me/Documents/MMAP”)
Use the following command to upload the data file into the workspace and name the
data set in the workspace:
Example:
You can also specify the full path to the file if you have not set a directory.
Command breakdown
o MMAPMath <- assigns the dataset to the right the following name
o row.names = NULL avoids issues in svm with duplicated row names
o header = T identifying the first row of data as the header/variable
names
o stringsAsFactors = FALSE keeps character variables as they are instead
of converting them into factors with specific levels. Variables can be
converted to factors later if needed.
o Note you can also use read.delim to load files.
3
3. Try Basic Commands
Now you have your data file loaded, let’s review some basic commands for gaining familiarity with R and
your data set. Most commands have optional values that can be used such as adding labels to plots. To
obtain help on a command, type ? before the command. For example, ?plot will involve the help page
for the plot command. Web searches can also be helpful for examples and additional command options.
These commands work for command line R and many have menu driven alternatives in overlay
interfaces such as RStudio. Examples use actual variable names from the latest data set available but
note that variable names may have changed subsequent to the posting of this guide and should be
adjusted accordingly.
R is a UNIX based package, therefore, commands and variable names are case-sensitive. This
tutorial will show primarily lower case variable names to reflect the latest version of MMAP
analysis files. However, if an error is encountered, check the case of the variable name.
Comments begin with a hashmark (#)
R operates on objects that are assigned using <-
o x <- 2 assigns the value of 2 to an object called x that can be used in subsequent
commands. Data sets, tables, formulae, model fits, and more can be assigned to an
object in this way.
dataset$variable
Example:
MMAPMath$hs_12_gpa
To view the variables names and type, use the following command:
str(MMAPMath)
edit(MMAPMath)
summary(MMAPMath$hs_12_gpa)
hist(MMAPMath$hs_12_gpa)
plot(MMAPMath$hs_12_gpa,MMAPMath$cc_first_course_grade_points)
table(MMAPMath$cc_first_level_rank)
4
The table can also be assigned to an object:
Writing table to tab delimited text file that can be opened with Excel:
write.table(object,"~/sub-directory/file.txt",sep="\t")
Example:
write.table(table1,"~/MathTables/MathFirstCCRank.txt",sep="\t")
table(MMAPMath$hs_last_course_rank,MMAPMath$cc_first_level_rank)
First the table will be loaded into the R workspace as an object called “table 1” that can be called in
subsequent commands:
Entering ‘table1’ into the command line will call the table. Any number, table, plot, data set, etc. can be
loaded into an object for later use. Loading a new value into an existing object will overwrite the
previous item without warning.
barplot(table1)
http://www.statmethods.net/advgraphs/ggplot2.html
https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
To install rCharts, please follow the following instructions inside the r workspace:
install.packages('devtools')
require('devtools')
install_github('ramnathv/rCharts')
5
4. Install and load the packages used for the MMAP models:
a. rpart: Classification and Regression Tree (CART) modeling
b. rpart.plot: Tool to create visual graphics of the CART
c. dplyr: filtering and data management tool
install.packages ("rpart")
install.packages ("rpart.plot")
install.packages("dplyr")
You will likely be asked to select a download mirror. Select the mirror closest to you (there are
two in Berkeley, CA). Once the packages are installed, you must load them from your library:
library (rpart)
library (rpart.plot)
library (dplyr)
In RStudio, you can select the “packages” tab in the lower left “view” pane and click on the
package to load it.
This is a 4 position indicator variable where each position indicates if GPA is present for
a given grade level.
Examples:
If hs_gpa_present_by_grade equals 1010, the student only has GPA’s for 9th and 11th grade.
If hs_gpa_present_by_grade equals 1110, the student has GPA’s for 9th, 10th, and 11th grade but not
their senior year.
6
If hs_gpa_present_by_grade equals 1111, the student has GPA’s available for all four years.
cc_first_level_rank
This variable provides the number of levels below transfer of the first attempted course
at community college.
cc_first_level_rank
value Description CB21
0 Transfer level or above Y
1 One level below transfer A
2 Two levels below transfer B
3 Three levels below transfer C
4 Four levels below transfer D
5 Five levels below transfer E
b. General Subsetting
To create a data subset of students with 4 years of high school GPA’s and attempted a
specific level of a college course in a discipline, use the following command (note there
are several ways to subset data):
Command Breakdown
o == a double equal sign must be used to indicate that the value must be
exact
o The filter command in the dplyr package was used in this case but other
approaches exist
Example:
c. Recursive Subsetting
A key difference between Phase I and Phase II of the MMAP project was the
introduction of recursive subsetting of data for lower levels that account for placement
rules applied at the higher levels. For example, for non-STEM (Science Technology,
Engineering, and Math) math the Phase II rules found that a high school GPA of 3.0 or
higher or 2.3 or higher with a C or better in Pre-Calculus for direct matriculants (up
through 11th grade) corresponded with being highly likely to succeed in Statistics.
Therefore, when building a subset to model for one level below transfer or Intermediate
Algebra, students meeting either Statistics placement rule are filtered out as they would
already have been placed. Then the remaining students would be modeled for
7
placement into Intermediate Algebra or equivalent. The rules for one level below
transfer level placement would then be applied to the filter for modeling two levels
below transfer and so on.
Example:
#filter for predicting success in one level below transfer math with filters for placements into STEM and
non-STEM transfer level math in two steps
#step 2: use the m1 filter object as the first argument then add exclusions for higher level placement
rules using the not (!) expression
A typical set of control parameters used as a starting place for MMAP modelling has been:
Larger values for cp result in simpler trees with fewer branches while smaller values for cp
result in more complex trees with more branches. It is recommended to run several models
with different cp values.
Below is an excerpt from Therneau & Atkinson (1997) describing the control parameters
shown above in detail. The document below shows all possible parameters.
https://cran.r-project.org/web/packages/rpart/rpart.pdf
minsplit: The minimum number of observations in a node for which the routine
will even try to compute a split. The default is 20. This parameter can save
computation time, since smaller nodes are almost always pruned away by cross-
validation.
minbucket: The minimum number of observations in a terminal node. This
defaults to minsplit/3.
xval: The number of cross-validations to be done. Usually set to zero during
exploratory phases of the analysis. A value of 10, for instance, increases the
compute time to 11-fold over a value of 0.
8
cp: The threshold complexity parameter. The complexity parameter cp is, like
minsplit, an advisory parameter, but is considerably more useful. It is specified
according to the formula:
The code below shows the basic structure of running a regression tree model:
Example:
Command Breakdown
o rpart() this is identifying the desired package to run the analysis
o ~ separates the Dependent variable from the Predictor/Independent Variables in the
model
o + Identifying the additional Predictor/Independent Variables to include in the model
o data = m0.Statistics identifies the specific data set the model should be modeled on.
Replace this value for each data set that will be modeled using the regression tree
analysis.
o method = “poisson” identifies the classification type and is dependent on the data
structure of the dependent variable.
Class = used for categorical dependent variables
ANOVA = used for continuous dependent variables (used by MMAP team in
Phase I when grade points were used)
Poisson = used for count of events in time frame such as survival data (used by
MMAP team in Phase II using success indicator that resulted in more
interpretable trees than using Class)
Exponential = can also be used for survival with different distributional
assumptions
o control = ctrl identifies the control parameters for the model. See #7 for the specific
parameters.
9
For more information about the control parameters in rpart see: https://stat.ethz.ch/R-
manual/R-devel/library/rpart/html/rpart.control.html
An alternative approach is to save the formula as an object and then reference that object in the
rpart model:
Note the formula shown here is a simplified version of the formulae used in the MMAP project.
The formula above includes high school GPA, grade is last math course, a series of indicators for
completing various levels of math with a C or better in high school, an indicator for completing
any advanced placement (AP) math with a C or better, and an indicator of achieving “college
ready” on the Early Assessment Project (EAP) test. The MMAP team included various other
grades earned such as B or better or any grade in addition
printcp(ModelName)
print(ModelName)
rsq.rpart(ModelName)
Example:
printcp(fit.m0.Statistics.DM)
print(fit.m0.Statistics.DM)
rsq.rpart(fit.m0.Statistics.DM)
Command Breakdown
o printcp() prints the output with the control parameter estimates based
on the model
o print() prints the output of the analysis and provides a summary of the
branches
o rsq.rpart() prints plot showing R-square change for each split
10
d. Plot the tree
To obtain a visual of the model (decision tree), use the following command:
prp(ModelName, options)
Example:
prp(fit.m0.Statistics.DM,main="Statistics DM",extra=100,varlen=0,left=FALSE)
prp() is a command to plot the decision tree based on a specific analysis
extra = 100 is one of many options you can include to customize the tree display
and in this case, show the predicted value and the percent of sample in the
node.
varlen = 0 forces entire variable name to be printed
left = FALSE makes rules flow to the right.
http://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf
http://www.milbo.org/rpart-plot/prp.pdf
11
f. Interpreting the Tree
The tree depicted above includes all students in the California Community College (CCC) System
who had four full years of high school data available in CalPASS Plus and who enrolled at a
community college directly out of high school (i.e. direct matriculants or DM) and whose first
math class at a CCC was Statistics. The decision tree algorithm sorts students into distinct
groups, or nodes, based on what their high school performance can tell us about their
likelihood of succeeding in Statistics at college. In this case, the algorithm creates distinct
groups by parsing students out according to their cumulative high school GPA as well as Pre-
Calculus and Algebra II enrollments.
The root node (node 1) splits on high school 11th grade cumulative GPA with the right hand
path including students having a GPA equal to or greater than 3.0 with higher predicted rates of
success in college statistics. To the left are students with a cumulative high school GPA less than
3.0. Cumulative GPA appearing in the root node at the highest level of the tree suggests GPA is
the best predictor of success in college statistics among the variables in the data set.
Reading down the right side of the tree includes students who had an 11 th grade cumulative
GPA of 3.0 or higher in high school and enrolled in statistics in college. The first internal node
(node 2) again splits on GPA with students with a 3.3 GPA or higher on the right branch (note
12
that node numbering was set to go from highest to lowest success rates and differs from the
node numbers assigned within R). At the end of the branch is a terminal node or leaf. This leaf
contains 30% of the student population and has a predicted likelihood of success in college
statistics of 0.88 or 88%. This pathway results in a rule or interpretation that could be phrased
as “students with a cumulative high school GPA of B+ or better are highly likely to succeed in
college statistics.”
If we follow this same internal node to the left to node 4, we are looking at students with a GPA
of 3.0 or more but less than 3.3 and considering whether or not they enrolled in Pre-Calculus.
This is an indicator variable with 0 indicating the student did not attempt Pre-Calculus or 1
indicating the student did enroll in Pre-Calculus. The node 4 rule asks if the value for a student’s
PRE_CALUC_UP11 variable is greater than or equal to 0.5 so that means a “yes” indicates the
student’s value is 1 and did enroll in Pre-Calculus and leads down the right branch to node 5
(note this tree was run with an older data set that contained all cap variable names). This node
contains 8% of the sample with an estimated 0.81 or 81% probability of success in college
statistics. If the student’s PRE_CALC_UP11 is less than 0.5, then the value is 0 and indicates the
student did not attempt Pre-Calculus and leads down the left branch to node 6. This node
contains 16% of the sample with an estimated 0.7 or 70% probability of success in college
statistics.
If you start back up at the root and travel down the left side of the tree which forks at less than
a 3.0 GPA you will follow the same process of interpreting the nodes and leaves. All leaves
combined represent 100% of the sample that was used to generate the decision tree output
and the rule set for Statistics for direct matriculants. It should be noted that math courses also
have suggested minimum course requirements. For Statistics, this is passing at least Algebra I or
a higher course prior to enrolling in college level Statistics. The data file from CalPASS Plus
checks for the minimum course requirement for all math courses, however, if your college is
deviating from these models, you will want to check the student’s transcripts to meet this
recommendation.
The rule sets for this tree are based only on those nodes which meet or exceed the criterion of
0.70, or a 70% success rate or higher. This rate is often higher than most colleges’ success rate
in this course. If a college would like to relax this criterion to say, 0.55 or 55%, then they could
also include node 10 and 11 by allowing students into Statistics with a 2.3 GPA or greater as
well as those completing Algebra II with a C or better.
13
Advanced Tips and Tricks
Run model predicting grade points
In Phase 1, the MMAP team ran models that predicted grade points rather than a binary success
indicator. In Phase 2, binary success indicators with the Poisson method were used as they provided
more interpretable trees. To run a tree predicting grade points, you must use the method ANOVA to
handle a continuous independent variable that, in this example will be the grade points earned in their
first community college statistics course attempted. The parts of the code that have been changed are
highlight in red below.
Run Model
fit.m0.Statistics.DM.gp <- rpart(formula = cc_first_course_grade_points ~
hs_11_gpa + pre_alg_any_c + alg_i_any_c + alg_ii_any_c + geo_any_c + trig_any_c +
pre_calc_any_c + calc_any_c + stat_any_c + math_eap_ind + hs_exit_subj_to_cc_entry_subj + ap_any_c
,data = m0.Statistics
,method="anova"
,control=ctrl0015)
View Output
printcp(fit.m0.Statistics.DM.gp)
print(fit.m0.Statistics.DM.gp)
rsq.rpart(fit.m0.Statistics.DM.gp)
prp(fit.m0.Statistics.DM.gp,main="Statistics with Grade Points, DM",extra=100,varlen=0,left=FALSE)
caret packages:
install.packages(“caret”) #core package for caret
install.packages (“e1071”) #additional packages that fix errors that arise with some models
Tutorials on caret:
http://www.edii.uclm.es/~useR-2013/Tutorials/kuhn/user_caret_2up.pdf
https://www.youtube.com/watch?v=7Jbb2ItbTC4
List of the methods in the caret package:
http://topepo.github.io/caret/modelList.html
caret training site:
http://topepo.github.io/caret/training.html
Additional information on training parameters:
http://www.inside-r.org/packages/cran/caret/docs/train
14
Correlation Matrix
Create a list of variables to use for correlation matrix. Note we are using up through 12th grade data and
adding the delay variable for semesters between high school and college math for non-direct
matriculants.
MMAPMathVars <-
c("cc_first_course_grade_points","hs_12_gpa","hs_12_course_grade_points","ap_any_c","hs_e
xit_subj_to_cc_entry_subj")
Create subset of the statistics prediction subset with only variables to be used in correlation matrix as
defined in the above step. This subsetting method uses square brackets [] with the general format:
To leave all rows but select only columns defined in the set above, the row condition is left blank and
the column condition references the variable name object just created.
cor(m0.Statistics.subset)
install.packages(“Hmisc”)
library(Hmisc)
rcorr(as.matrix(m0.Statistics.subset), type="pearson") # type can be pearson or spearman
install.packages(“corrgram”)
library(corrgram)
corrgram(m0.Statistics.subset, order=FALSE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="High School Achievement to College Statistics Grades Correlations")
15
Logistic Regression
Create a formula. Note we are again using up through 12th grade data with a delay variable.
Run the logistic model using glm function. By changing the family other models can be tested.
summary(LogRegStat)
exp(coef(LogRegStat))
TrueCrypt was a preferred open source option for many years but is no longer supported. However, it is
still used by some legacy users. Search TrueCrypt on the web for more information.
16
In addition, it is important to securely wipe data when no longer used for analysis. Deleted files can still
be recovered and standard deletion is not sufficient to protect data. To more securely erase files, they
must be overwritten numerous times in a process called “shredding” or “wiping”. Again, your campus IT
department can advise on the best solution for file wiping. Some options to be aware of include:
Note that the most secure methods of wiping files involve highly destructive techniques such as
degaussing (demagnetizing) hard drives, drilling holes in the hard drive, hammering hard drives into
small pieces, dissolving in solvents, and/or incineration. These methods are unlikely to be necessary and
involve some risk to the user.
Additional R Resources
http://www.inside-r.org/
http://cran.r-project.org/doc/manuals/R-intro.pdf
http://www.ats.ucla.edu/stat/r/seminars/intro.htm
http://www.statmethods.net/interface/packages.html
For assistance with obtaining a MMAP analytical file for your college, contact:
For assistance with helping your feeder high schools submit data to CalPASS+, contact:
17