0% found this document useful (0 votes)
79 views17 pages

Instructions For Using R To Create Predictive Models v5

R is an open source statistical software package used for creating predictive models. It has classification and regression tree (CART) methods that were used for this project. R consumes more memory than other software and computers with at least 8GB RAM are recommended. RStudio provides a graphical user interface for R and was used for this tutorial. Basic commands in R are demonstrated including importing data, examining variables, and creating plots and tables. The rpart, rpart.plot, and dplyr packages used for modeling are installed and loaded.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views17 pages

Instructions For Using R To Create Predictive Models v5

R is an open source statistical software package used for creating predictive models. It has classification and regression tree (CART) methods that were used for this project. R consumes more memory than other software and computers with at least 8GB RAM are recommended. RStudio provides a graphical user interface for R and was used for this tutorial. Basic commands in R are demonstrated including importing data, examining variables, and creating plots and tables. The rpart, rpart.plot, and dplyr packages used for modeling are installed and loaded.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Using R for Creating Predictive Models

Multiple Measures Assessment Project Phase II Models

Getting Started With R


R is an open source statistical package developed and maintained by the user community. It is free to
use and has many basic and advanced methods available including Classification and Regression Trees
(CART), which were primary factors leading to the decision to use R for this project. R consumes more
memory than other statistical software and computers with at least 8G RAM are recommended.

Download R
Main project page:

https://cran.r-project.org/

Download link for Windows users:

https://cran.r-project.org/bin/windows/base/

Download link for Mac users:

https://cran.r-project.org/bin/macosx/

Download RStudio (optional)


R uses a command line interface. While not necessary, some users like to use an overlay GUI interface to
work with R. The MMAP team uses RStudio and this tutorial will reference that interface:

https://www.rstudio.com/products/rstudio/download/

Other common interfaces include:

http://www.rcommander.com/

http://mran.revolutionanalytics.com/download/

Read the download notes to install the correct version of software for your system (i.e. Windows,
Mac, or Linux, 32 or 64 bit operating system). Before installing any software, it is advisable to scan the
downloaded package for viruses and malware using your anti-virus software and other software such
as Malwarebytes to ensure the installation package has not been spoofed.

1
Navigating RStudio
The RStudio interface has four panes (see image below):

Source Pane = R code is written, saved, and run from this pane.

Console Pane = Commands and output are shown in this pane. R commands can also be written directly
in this pane and it has all the functionality of the basic R terminal.

Workspace Pane = Shows datasets and objects created during an analysis. This workspace can be saved
separately from the code in the Source Pane and allows previously imported data, subsets, parameters,
and models to be reloaded readily without rerunning source code.

View Pane = Shows graphical output and allows them to be exported, allows loading of packages from
the library without code, help documents, and other information.

Note these are not official RStudio designations and some tutorials use different names for these panes.
Across the top of the panes is a menu bar that allows many commands to be accessed without utilizing
code. For example, under “Plots” are options to save graphical output and under “Session” are options
to save the workspace. RStudio is similar to other software in that there are often multiple ways to
execute the same command or function. Some RStudio tutorial resources include:

https://support.rstudio.com/hc/en-us/sections/200271437-Getting-Started

RStudio cheat sheets: https://www.rstudio.com/resources/cheatsheets/

http://dss.princeton.edu/training/RStudio101.pdf

http://web.cs.ucla.edu/~gulzar/rstudio/basic-tutorial.html

https://www.youtube.com/watch?v=5YmcEYTSN7k

https://github.com/rstudio/webinars

Google is, of course, an additional help to find answers to specific questions.

2
Step-by-Step Instructions
1. Launch RStudio
2. Upload the data into the workspace.
The command to read in the data depends on where the file directory path is defaulted.
a. Set the file directory path so R will point to the correct file location in the workspace.

Use the following command to set the directory path:

setwd(“insert file directory path here”)

EX: setwd(“C:/Users/Me/Documents/MMAP”)

b. Load your data. It is recommended that csv files be used.


This tutorial will use the math file as a running example. Adjust file and variable
names accordingly to use the English, Reading, or ESL data files.

Use the following command to upload the data file into the workspace and name the
data set in the workspace:

dataset <- read.delim(“File Name”, header = T, row.names = NULL,


stringsAsFactors=FALSE)

Example:

MMAPMath <- read.delim (“000_mmap_retrospective_math.txt, header = T, row.names


=NULL, stringsAsFactors=FALSE)

You can also specify the full path to the file if you have not set a directory.

 Command breakdown
o MMAPMath <- assigns the dataset to the right the following name
o row.names = NULL avoids issues in svm with duplicated row names
o header = T identifying the first row of data as the header/variable
names
o stringsAsFactors = FALSE keeps character variables as they are instead
of converting them into factors with specific levels. Variables can be
converted to factors later if needed.
o Note you can also use read.delim to load files.

3
3. Try Basic Commands
Now you have your data file loaded, let’s review some basic commands for gaining familiarity with R and
your data set. Most commands have optional values that can be used such as adding labels to plots. To
obtain help on a command, type ? before the command. For example, ?plot will involve the help page
for the plot command. Web searches can also be helpful for examples and additional command options.
These commands work for command line R and many have menu driven alternatives in overlay
interfaces such as RStudio. Examples use actual variable names from the latest data set available but
note that variable names may have changed subsequent to the posting of this guide and should be
adjusted accordingly.

 R is a UNIX based package, therefore, commands and variable names are case-sensitive. This
tutorial will show primarily lower case variable names to reflect the latest version of MMAP
analysis files. However, if an error is encountered, check the case of the variable name.
 Comments begin with a hashmark (#)
 R operates on objects that are assigned using <-
o x <- 2 assigns the value of 2 to an object called x that can be used in subsequent
commands. Data sets, tables, formulae, model fits, and more can be assigned to an
object in this way.

Referencing variables within a data set

R uses $ to reference variables in the following format:

dataset$variable

Example:

MMAPMath$hs_12_gpa

To view the variables names and type, use the following command:

str(MMAPMath)

View the data set:

edit(MMAPMath)

Provide basic summary statistics for a numeric variable:

summary(MMAPMath$hs_12_gpa)

Plot a simple histogram for a numeric variable:

hist(MMAPMath$hs_12_gpa)

Scatterplot of two numeric variables:

plot(MMAPMath$hs_12_gpa,MMAPMath$cc_first_course_grade_points)

One way table of frequencies:

table(MMAPMath$cc_first_level_rank)

4
The table can also be assigned to an object:

table1 <- table(MMAPMath$cc_first_level_rank)

Writing table to tab delimited text file that can be opened with Excel:

write.table(object,"~/sub-directory/file.txt",sep="\t")

The tilde (~) points to your home directory set previously.

Example:

write.table(table1,"~/MathTables/MathFirstCCRank.txt",sep="\t")

Two way table of frequencies:

table(MMAPMath$hs_last_course_rank,MMAPMath$cc_first_level_rank)

Bar plot based on table:

First the table will be loaded into the R workspace as an object called “table 1” that can be called in
subsequent commands:

table1 <- table(MMAPMath$cc_first_level_rank)

Entering ‘table1’ into the command line will call the table. Any number, table, plot, data set, etc. can be
loaded into an object for later use. Loading a new value into an existing object will overwrite the
previous item without warning.

barplot(table1)

Add labels to barplot. Note “smart quotes” will generate an error.

barplot(table1, main="Level of First College English Course", xlab="Level")

See the ggplot2 package for more advanced graphics:

http://www.statmethods.net/advgraphs/ggplot2.html

https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf

https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Or the rCharts package:


http://rcharts.io/

To install rCharts, please follow the following instructions inside the r workspace:
install.packages('devtools')
require('devtools')
install_github('ramnathv/rCharts')

Tutorials and examples using rCharts:


https://sites.google.com/site/usingrcharts
https://github.com/ramnathv/rCharts

5
4. Install and load the packages used for the MMAP models:
a. rpart: Classification and Regression Tree (CART) modeling
b. rpart.plot: Tool to create visual graphics of the CART
c. dplyr: filtering and data management tool

Use the following commands in the workspace:

install.packages ("rpart")
install.packages ("rpart.plot")
install.packages("dplyr")

In RStudio, you can also use Tools -> Install Packages

You will likely be asked to select a download mirror. Select the mirror closest to you (there are
two in Berkeley, CA). Once the packages are installed, you must load them from your library:

library (rpart)
library (rpart.plot)
library (dplyr)

In RStudio, you can select the “packages” tab in the lower left “view” pane and click on the
package to load it.

5. Create subsets for analysis


Select students with 4 years of high school GPAs and a specific CB21 level for first attempted
college course.

a. Key Variable Descriptions


hs_gpa_present_by_grade

This is a 4 position indicator variable where each position indicates if GPA is present for
a given grade level.

Position 1 = 9th grade level GPA

Position 1 = 10th grade level GPA

Position 1 = 11th grade level GPA

Position 1 = 12th grade level GPA

Examples:

If hs_gpa_present_by_grade equals 1010, the student only has GPA’s for 9th and 11th grade.

If hs_gpa_present_by_grade equals 1110, the student has GPA’s for 9th, 10th, and 11th grade but not
their senior year.

6
If hs_gpa_present_by_grade equals 1111, the student has GPA’s available for all four years.

cc_first_level_rank

This variable provides the number of levels below transfer of the first attempted course
at community college.

cc_first_level_rank
value Description CB21
0 Transfer level or above Y
1 One level below transfer A
2 Two levels below transfer B
3 Three levels below transfer C
4 Four levels below transfer D
5 Five levels below transfer E

b. General Subsetting
To create a data subset of students with 4 years of high school GPA’s and attempted a
specific level of a college course in a discipline, use the following command (note there
are several ways to subset data):

subset < - filter(dataset, variable1 == value1, variable2 == value2,…, variableN == valueN)

 Command Breakdown
o == a double equal sign must be used to indicate that the value must be
exact
o The filter command in the dplyr package was used in this case but other
approaches exist

Example:

#attempted transfer level statistics

m0.Statistics <- filter(MMAPMath, hs_gpa_present_by_grade==1111, cc_first_level_rank==0,


cc_statistics==1)

c. Recursive Subsetting
A key difference between Phase I and Phase II of the MMAP project was the
introduction of recursive subsetting of data for lower levels that account for placement
rules applied at the higher levels. For example, for non-STEM (Science Technology,
Engineering, and Math) math the Phase II rules found that a high school GPA of 3.0 or
higher or 2.3 or higher with a C or better in Pre-Calculus for direct matriculants (up
through 11th grade) corresponded with being highly likely to succeed in Statistics.
Therefore, when building a subset to model for one level below transfer or Intermediate
Algebra, students meeting either Statistics placement rule are filtered out as they would
already have been placed. Then the remaining students would be modeled for

7
placement into Intermediate Algebra or equivalent. The rules for one level below
transfer level placement would then be applied to the filter for modeling two levels
below transfer and so on.

Example:

#filter for predicting success in one level below transfer math with filters for placements into STEM and
non-STEM transfer level math in two steps

#step 1: filter to one level below algebra type courses

m1 <- filter(MMAPmath, cc_first_course_level_id=='A', grepl ("alg", cc_first_course_title,inore.case = T))


# only if includes alg in title

#step 2: use the m1 filter object as the first argument then add exclusions for higher level placement
rules using the not (!) expression

m1.DM<-filter(m1,!(hs_11_gpa >= 3.4 |


(hs_11_gpa >= 2.6 & calc_up11 == 1 ) |
hs_11_gpa >= 3.0 |
(hs_11_gpa >= 2.3 & pre_calc_up11_c == 1 )))
6. Run the model
a. Set the control parameters for the modeling.

A typical set of control parameters used as a starting place for MMAP modelling has been:

ctrl0015 <- rpart.control(minsplit=100, cp=0.0015, xval=10)

Larger values for cp result in simpler trees with fewer branches while smaller values for cp
result in more complex trees with more branches. It is recommended to run several models
with different cp values.

Below is an excerpt from Therneau & Atkinson (1997) describing the control parameters
shown above in detail. The document below shows all possible parameters.
https://cran.r-project.org/web/packages/rpart/rpart.pdf

minsplit: The minimum number of observations in a node for which the routine
will even try to compute a split. The default is 20. This parameter can save
computation time, since smaller nodes are almost always pruned away by cross-
validation.
minbucket: The minimum number of observations in a terminal node. This
defaults to minsplit/3.
xval: The number of cross-validations to be done. Usually set to zero during
exploratory phases of the analysis. A value of 10, for instance, increases the
compute time to 11-fold over a value of 0.

8
cp: The threshold complexity parameter. The complexity parameter cp is, like
minsplit, an advisory parameter, but is considerably more useful. It is specified
according to the formula:

Rcp(T) _ R(T) + cp _ |T| * R(T0)

where T0 is the tree with no splits.

b. Regression Tree Modeling

The code below shows the basic structure of running a regression tree model:

ModelName <- rpart(Dependent Variable ~ Predictor Variable 1 + Predictor Variable 2 + ….+


Predictor Variable N, data = dataset, method=”poisson”, control = ctrl)

Example:

fit.m0.Statistics.DM <- rpart(formula = cc_first_course_success_ind ~ hs_11_gpa +


hs_11_course_grade_points + pre_alg_up11_c + alg_i_up11_c + alg_ii_up11_c + geo_up11_c +
trig_up11_c + pre_calc_up11_c + calc_up11_c + stat_up11_c + ap_up11_c + math_eap_ind
,data = m0.Statistics
,method="poisson"
,control=ctrl0015)

 Command Breakdown
o rpart() this is identifying the desired package to run the analysis
o ~ separates the Dependent variable from the Predictor/Independent Variables in the
model
o + Identifying the additional Predictor/Independent Variables to include in the model
o data = m0.Statistics identifies the specific data set the model should be modeled on.
Replace this value for each data set that will be modeled using the regression tree
analysis.
o method = “poisson” identifies the classification type and is dependent on the data
structure of the dependent variable.
 Class = used for categorical dependent variables
 ANOVA = used for continuous dependent variables (used by MMAP team in
Phase I when grade points were used)
 Poisson = used for count of events in time frame such as survival data (used by
MMAP team in Phase II using success indicator that resulted in more
interpretable trees than using Class)
 Exponential = can also be used for survival with different distributional
assumptions
o control = ctrl identifies the control parameters for the model. See #7 for the specific
parameters.

9
For more information about the control parameters in rpart see: https://stat.ethz.ch/R-
manual/R-devel/library/rpart/html/rpart.control.html

An alternative approach is to save the formula as an object and then reference that object in the
rpart model:

formula11 <- cc_first_course_success_ind ~ hs_11_gpa + hs_11_grade_points + pre_alg_up11_c


+ alg_i_up11_c + alg_ii_up11_c + geo_up11_c + trig_up11_c + pre_calc_up11_c + calc_up11_c +
stat_up11_c + ap_up11_c + math_eap_ind

fit.m0.Statistics.DM <- rpart(formula = formula11, data = m0.Statistics, method="poisson",


control=ctrl0015)

Note the formula shown here is a simplified version of the formulae used in the MMAP project.
The formula above includes high school GPA, grade is last math course, a series of indicators for
completing various levels of math with a C or better in high school, an indicator for completing
any advanced placement (AP) math with a C or better, and an indicator of achieving “college
ready” on the Early Assessment Project (EAP) test. The MMAP team included various other
grades earned such as B or better or any grade in addition

c. View the output

To view the output, enter the following command:

printcp(ModelName)
print(ModelName)
rsq.rpart(ModelName)

Example:

printcp(fit.m0.Statistics.DM)
print(fit.m0.Statistics.DM)
rsq.rpart(fit.m0.Statistics.DM)

 Command Breakdown
o printcp() prints the output with the control parameter estimates based
on the model
o print() prints the output of the analysis and provides a summary of the
branches
o rsq.rpart() prints plot showing R-square change for each split

10
d. Plot the tree

To obtain a visual of the model (decision tree), use the following command:

prp(ModelName, options)

Example:

prp(fit.m0.Statistics.DM,main="Statistics DM",extra=100,varlen=0,left=FALSE)
 prp() is a command to plot the decision tree based on a specific analysis
 extra = 100 is one of many options you can include to customize the tree display
and in this case, show the predicted value and the percent of sample in the
node.
 varlen = 0 forces entire variable name to be printed
 left = FALSE makes rules flow to the right.

For more information about customizing the plots see:

http://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf

http://www.milbo.org/rpart-plot/prp.pdf

e. Export tree graphic


Within RStudio, from the View Pane in the Plots tab, select Export -> Save as Image and
an options window will appear. Select the file type (MMAP recommends png), set the
directory where the image should be saved, other desired options, and click “Save”.

11
f. Interpreting the Tree

The tree depicted above includes all students in the California Community College (CCC) System
who had four full years of high school data available in CalPASS Plus and who enrolled at a
community college directly out of high school (i.e. direct matriculants or DM) and whose first
math class at a CCC was Statistics. The decision tree algorithm sorts students into distinct
groups, or nodes, based on what their high school performance can tell us about their
likelihood of succeeding in Statistics at college. In this case, the algorithm creates distinct
groups by parsing students out according to their cumulative high school GPA as well as Pre-
Calculus and Algebra II enrollments.
The root node (node 1) splits on high school 11th grade cumulative GPA with the right hand
path including students having a GPA equal to or greater than 3.0 with higher predicted rates of
success in college statistics. To the left are students with a cumulative high school GPA less than
3.0. Cumulative GPA appearing in the root node at the highest level of the tree suggests GPA is
the best predictor of success in college statistics among the variables in the data set.
Reading down the right side of the tree includes students who had an 11 th grade cumulative
GPA of 3.0 or higher in high school and enrolled in statistics in college. The first internal node
(node 2) again splits on GPA with students with a 3.3 GPA or higher on the right branch (note

12
that node numbering was set to go from highest to lowest success rates and differs from the
node numbers assigned within R). At the end of the branch is a terminal node or leaf. This leaf
contains 30% of the student population and has a predicted likelihood of success in college
statistics of 0.88 or 88%. This pathway results in a rule or interpretation that could be phrased
as “students with a cumulative high school GPA of B+ or better are highly likely to succeed in
college statistics.”
If we follow this same internal node to the left to node 4, we are looking at students with a GPA
of 3.0 or more but less than 3.3 and considering whether or not they enrolled in Pre-Calculus.
This is an indicator variable with 0 indicating the student did not attempt Pre-Calculus or 1
indicating the student did enroll in Pre-Calculus. The node 4 rule asks if the value for a student’s
PRE_CALUC_UP11 variable is greater than or equal to 0.5 so that means a “yes” indicates the
student’s value is 1 and did enroll in Pre-Calculus and leads down the right branch to node 5
(note this tree was run with an older data set that contained all cap variable names). This node
contains 8% of the sample with an estimated 0.81 or 81% probability of success in college
statistics. If the student’s PRE_CALC_UP11 is less than 0.5, then the value is 0 and indicates the
student did not attempt Pre-Calculus and leads down the left branch to node 6. This node
contains 16% of the sample with an estimated 0.7 or 70% probability of success in college
statistics.
If you start back up at the root and travel down the left side of the tree which forks at less than
a 3.0 GPA you will follow the same process of interpreting the nodes and leaves. All leaves
combined represent 100% of the sample that was used to generate the decision tree output
and the rule set for Statistics for direct matriculants. It should be noted that math courses also
have suggested minimum course requirements. For Statistics, this is passing at least Algebra I or
a higher course prior to enrolling in college level Statistics. The data file from CalPASS Plus
checks for the minimum course requirement for all math courses, however, if your college is
deviating from these models, you will want to check the student’s transcripts to meet this
recommendation.
The rule sets for this tree are based only on those nodes which meet or exceed the criterion of
0.70, or a 70% success rate or higher. This rate is often higher than most colleges’ success rate
in this course. If a college would like to relax this criterion to say, 0.55 or 55%, then they could
also include node 10 and 11 by allowing students into Statistics with a 2.3 GPA or greater as
well as those completing Algebra II with a C or better.

13
Advanced Tips and Tricks
Run model predicting grade points
In Phase 1, the MMAP team ran models that predicted grade points rather than a binary success
indicator. In Phase 2, binary success indicators with the Poisson method were used as they provided
more interpretable trees. To run a tree predicting grade points, you must use the method ANOVA to
handle a continuous independent variable that, in this example will be the grade points earned in their
first community college statistics course attempted. The parts of the code that have been changed are
highlight in red below.

Run Model
fit.m0.Statistics.DM.gp <- rpart(formula = cc_first_course_grade_points ~
hs_11_gpa + pre_alg_any_c + alg_i_any_c + alg_ii_any_c + geo_any_c + trig_any_c +
pre_calc_any_c + calc_any_c + stat_any_c + math_eap_ind + hs_exit_subj_to_cc_entry_subj + ap_any_c
,data = m0.Statistics
,method="anova"
,control=ctrl0015)

View Output
printcp(fit.m0.Statistics.DM.gp)
print(fit.m0.Statistics.DM.gp)
rsq.rpart(fit.m0.Statistics.DM.gp)
prp(fit.m0.Statistics.DM.gp,main="Statistics with Grade Points, DM",extra=100,varlen=0,left=FALSE)

Testing multiple models with caret


CART models were used for the decision rule sets but in Phase 1, predictions were compared against
linear regression, support vector machines, and gradient basis models using the caret package. This
package allows you to keep the same predictive equation and easily change the algorithm to readily test
a variety of different analytical approaches with a minimum of coding. Example code of using caret for
MMAP is shown in the Phase 1 R Scripts document. Resources for the caret package are show below.

caret packages:
install.packages(“caret”) #core package for caret
install.packages (“e1071”) #additional packages that fix errors that arise with some models
Tutorials on caret:
http://www.edii.uclm.es/~useR-2013/Tutorials/kuhn/user_caret_2up.pdf
https://www.youtube.com/watch?v=7Jbb2ItbTC4
List of the methods in the caret package:
http://topepo.github.io/caret/modelList.html
caret training site:
http://topepo.github.io/caret/training.html
Additional information on training parameters:
http://www.inside-r.org/packages/cran/caret/docs/train

14
Correlation Matrix

Create a list of variables to use for correlation matrix. Note we are using up through 12th grade data and
adding the delay variable for semesters between high school and college math for non-direct
matriculants.

MMAPMathVars <-
c("cc_first_course_grade_points","hs_12_gpa","hs_12_course_grade_points","ap_any_c","hs_e
xit_subj_to_cc_entry_subj")

Create subset of the statistics prediction subset with only variables to be used in correlation matrix as
defined in the above step. This subsetting method uses square brackets [] with the general format:

dataset [row condition, column condition]

To leave all rows but select only columns defined in the set above, the row condition is left blank and
the column condition references the variable name object just created.

m0.Statistics.subset <- m0.Statistics [,MMAPMathVars]

Basic correlation function native to R that shows only coeffients

cor(m0.Statistics.subset)

Correlations with significance levels using the Hmisc package.

install.packages(“Hmisc”)
library(Hmisc)
rcorr(as.matrix(m0.Statistics.subset), type="pearson") # type can be pearson or spearman

One of several packages to visually display a correlation matrix is corrram.

install.packages(“corrgram”)
library(corrgram)
corrgram(m0.Statistics.subset, order=FALSE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt,
main="High School Achievement to College Statistics Grades Correlations")

15
Logistic Regression

Create a formula. Note we are again using up through 12th grade data with a delay variable.

formulaNDM <- cc_first_course_success_ind ~ hs_12_gpa + hs_12_course_grade_points +


pre_alg_any_c + alg_i_any_c + alg_ii_any_c + geo_any_c + trig_any_c + pre_calc_any_c +
calc_any_c + stat_any_c + ap_any_c + math_eap_ind + hs_exit_subj_to_cc_entry_subj

Run the logistic model using glm function. By changing the family other models can be tested.

LogRegStat <- glm(formulaNDM,family = "binomial",data= m0.Statistics)

Standard output with unstandardized coeffients.

summary(LogRegStat)

Show standardized coeffients (not pretty).

exp(coef(LogRegStat))

Show standardized coeffients with confidence intervals.

exp(cbind(OR = coef(LogRegStat), confint(LogRegStat)))

General Notes About Security


While the MMAP data files do not contain sensitive identifiers such as SSN, they are still student level
files that should be maintained securely. It is recommended that you store the files in a protected
location and/or use encryption. Your campus IT professionals can provide advice for the best security
solution for your campus. It is strongly recommended that you consult with your IT department
before installing encryption software on campus managed computers. Some encryption software to be
aware of includes:

 BitLocker for PC’s (included with Windows Professional) http://windows.microsoft.com/en-


us/windows-vista/bitlocker-drive-encryption-overview
 FileVault included on Macs https://support.apple.com/en-us/HT204837
 (Linux Unified Key Setup) LUKS included with Ubuntu
https://help.ubuntu.com/community/EncryptedFilesystems
 Pretty Good Privacy (PGP) by Symantec http://buy.symantec.com/estore/clp/category-
encryption
 An extensive list of encryption software can be found here:
https://en.wikipedia.org/wiki/Comparison_of_disk_encryption_software

TrueCrypt was a preferred open source option for many years but is no longer supported. However, it is
still used by some legacy users. Search TrueCrypt on the web for more information.

16
In addition, it is important to securely wipe data when no longer used for analysis. Deleted files can still
be recovered and standard deletion is not sufficient to protect data. To more securely erase files, they
must be overwritten numerous times in a process called “shredding” or “wiping”. Again, your campus IT
department can advise on the best solution for file wiping. Some options to be aware of include:

 sdelete free utility for Windows https://technet.microsoft.com/en-us/sysinternals/sdelete.aspx


 CCleaner for Windows http://www.piriform.com/ccleaner
 “Secure empty trash” built in on Macs
https://support.apple.com/kb/PH18638?locale=en_US&viewlocale=en_US
 Shred and Wipe commands for Linux in Ubuntu http://askubuntu.com/questions/57572/how-
to-delete-files-in-secure-manner

Note that the most secure methods of wiping files involve highly destructive techniques such as
degaussing (demagnetizing) hard drives, drilling holes in the hard drive, hammering hard drives into
small pieces, dissolving in solvents, and/or incineration. These methods are unlikely to be necessary and
involve some risk to the user.

Additional R Resources
http://www.inside-r.org/

http://cran.r-project.org/doc/manuals/R-intro.pdf

http://www.ats.ucla.edu/stat/r/seminars/intro.htm

http://www.statmethods.net/interface/packages.html

For assistance with this document, contact:

Terrence Willett, twillett@rpgroup.org

Craig Hayward, chayward@rpgroup.org

Loris Fagioli, lfagioli@ivc.edu

Mallory Newell, newellmallory@deanza.edu

For assistance with obtaining a MMAP analytical file for your college, contact:

Daniel Lamoree, dlamoree@edresults.org

For assistance with helping your feeder high schools submit data to CalPASS+, contact:

John Hetts, jhetts@edresults.org

17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy