Intro 2 R
Intro 2 R
Aedin Culhane
Email: aedin@jimmy.harvard.edu
http://bcb.dfci.harvard.edu/˜aedin
http://www.hsph.harvard.edu/research/aedin-culhane/
Contents
1 Introduction to R Software 2
1.1 Obtaining R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The default R interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Default R Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Default R Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 R Shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 RStudio Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Workspace, history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Files, Plots, Packages, Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.5 Projects, SVN in RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Starting out - setting a working directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Managing and accessing folders in R . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 First R Encounter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Getting help with functions and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 R as a big calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.8 A few important points on R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8.1 Use Arrows to key browse command history . . . . . . . . . . . . . . . . . . . . . 15
1.9 Basic operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.9.1 Comparison operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.9.2 Logical operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.10 Data sets in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.11 R sessions (workspace) and saving session history . . . . . . . . . . . . . . . . . . . . . . . 18
1.12 R Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.13 Installing new R libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.14 Customizing Startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Objects in R 24
2.1 Using ls and rm to managing R Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Types of R objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Attributes of R Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Creating and accessing objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Modifying elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1 Sorting and Ordering items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.3 Creating empty vectors and matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 36
i
2.6 Quick recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.7 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
ii
5 Introduction to graphics in R 79
5.1 The R function plot() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Exercise 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.1 Arguments to plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.2 Other useful basic graphics functions . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Editing the default plot with low-level plotting commands . . . . . . . . . . . . . . . . . . . 96
5.4 Default parameters - par . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4.1 Interactive plots in R Studio - Effect of changing par . . . . . . . . . . . . . . . . . 100
5.4.2 R Colors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5 Interacting with graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.1 Exercise 8 - Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.6 Saving plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6.1 Rstudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6.2 Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6.3 Difference between vector and pixel images . . . . . . . . . . . . . . . . . . . . . . 108
5.7 Useful Graphics Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
iii
7.8 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.8.1 Linear Regression: Weighted Models, Missing Values . . . . . . . . . . . . . . . . 169
7.8.2 Generalized linear modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.8.3 Other packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.9 Survival modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.9.1 Censored Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.9.2 Kaplan-Meier curve estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.9.3 Cox proportional hazards model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.10 Exercise 11: Survival Anlaysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
iv
R set up script for this manual
lapply(pkgs, checkPkg)
lapply(Biocpkgs, checkBioCPkg)
search()
1
Chapter 1
Introduction to R Software
– File - load script, load/save session (workspace) or command history. Change Directory
– Edit - Cut/Paste. GUI preferences
– View
– Misc - stop computations, list/remove objects in session
– Packages - allows one to install new, update packages
– Windows
– Help - An essential resource!
2
Figure 1.1: R interface with windows installation of R
• If the prompt instead show the continuation prompt is ’+’, it means the command you typed is in-
complete and R is waiting for it to complete (usually its missing a closing quote or bracket)
• Use up and down arrow keys to scroll through previous commands. This is useful if you would like
to repeat a previous command
• R also includes basic automatic completions for function names and filenames. Type the ”tab” key to
see a list of possible completions for a function or filenames.
3
• Highlight the commands and type CtrlˆR to submit the commands for R evaluation
• Evaluation of the commands can be stopped by pressing the Esc key or the Stop button
1.2.3 R Shortcuts
Keyboard Shortcuts for traditional R GUI
• Down Arrow: brings back the next command (Up and down basically scrolls up and down through
your history)
4
1.3 RStudio Interface
RStudio is a free and open source integrated development environment for R. Those familiar with matlab
will recognize the layout as its pretty similar. RStudio have a brief 2 minute guide to the interface on their
website http://rstudio.org/ which I recommend watching.
On startup R Studio brings up a window with 3 or 4 panels. If you only see 3 panels, click on File
-> New -> New R Script.
The first thing to notice, is that the bottom left panel ”console” is the exact same as the standard R
console. RStudio just loads your local version of R. You can specify a different version of R (if you have
multiple versions of R running on your machine) by clicking on Tools -> Options and selecting R
version.
1.3.1 console
RStudio has a nice console features
• start typing a command, for example fi, press the TAB key, it will suggest function that begin with
fi
fisher.test(
5
Figure 1.3: Tab not only auto-completes, but also suggests parameters and input to the function. Note it says
press F1 for further help on this function
and then press the TAB key. You will notice it bring up help on each parameter, you can browse these
and it will help you get familiar with R.
• Press F1, it will bring up a help document about the function in the help panel (right bottom)
• Press F2, it will show the source code for the function
There are many useful keyboard shortcuts in RStudio for a full list of these see http://rstudio.org/
docs/using/keyboard_shortcuts
1.3.2 Editor
The top left panel is an editor which can be used to edit R scripts (.R), plain text (.txt), html web files or
Sweave (rnw) or markdown (md), the latter two of these can be converted to pdf files. There are several nice
features to this text editor which we will describe during the course. But for now note, that it highlights R
code, and that the code is searchable (click Control-F to search)
In the menu code, you can set preferences to highlight, indent or edit code.
• Workspace list the objects in the current R session. You can load, save or ”Clear All” object for a
workspace
• Note that under the workspace panel there is the option to Import Dataset
• The history panel lists all of the command that have been typed or input in the console. There are
options to load, save, search or delete history
• One can easily repeat a command by highlighting one or more line(s) and sending these To Console
• One can easily copy a command to a new R script or text file, by highlighting one or more line(s) and
sending these To Source
6
1.3.4 Files, Plots, Packages, Help
On the bottom right there is a tab menu Files, Plots, Packages and Help.
• Files is a file browser, which allows you to create a new folder, rename a folder or delete a folder.
Click on the triple dot icon (...) on the far right on the menu to browse folders. Under the More menu
you can set your current working directory (more about that below). If you double click on a text, .R,
Sweave or html file it will automatically open in the Editor.
• The Plots window displays plots generated in R. Simply type the following command into the Console
window
plot(1:10)
plot(rnorm(10), 1:10)
It will create 2 plots. Use the arrows keys to browse plots, click on zoom, export or delete to manage
plots.
• Packages list all of the packages installed in your computer. The packages with tick marked are those
loaded in your current R session. Click on a package name to view help on that package. Note you
can install packages or check for updates. You can also search for a package or search package
descriptions using the search window.
• The Help menu provides an extensive R help. The arrows button go forward or back through recent
help pages you have viewed. You can go home (house icon), print or open help in new window. You
can search help, use the search window. Help can also be browsed through main menu bar at the open
of the page
1. In the classic R interface. Use the file menu, to change directory File − > Change dir
7
2. If you start R by clicking on an R icon. You may wish to change the default start location by right
mouse clicking on the R icon on the desktop/start menu and changing the ”Start In” property. For
example make a folder ’C:/work’, and make this as a ”Start in” folder
3. In RStudio Tools − > Set Working Directory
4. Or in RStudio click on the Files tab (on the bottom right panel). Use the File browser window to view
the contents of a directory and navigate to the directory you wish to set as your home directory.
• click on the triple dot icon on the top right. Navigate to the correct directory
• Once you are in the correct directory and see your data files click on the More (blue cogwheel),
and select ”Set as Working Directory”
Figure 1.4: Note the triple dot icon on the far right and the blue cog wheel
To see folders or files in the working directory, use the command dir() (or browse the files using the
Files Browser panel in RStudio)
dir()
dir(pattern = ".txt")
8
To create a full directory or more complex directory path. A path can be relative to the current location,
in this case two dots mean ”the directory above”
setwd("../../RWork/colonJan13")
Or you can specific a full directory path. For cross-platform compatibility, its best to use file.path() to
create paths. for example
if (file.exists(wkdir)) dir.create(newdirPath)
dir(pattern = "My")
Important side note: R doesn’t like windows a back slash (\) that separate folders in a file path. Indeed
it will return a rather cryptic error
> setwd("C:\Users\aedin\Documents\Rwork\colonJan13")
Error: "\U" used without hex digits
in character string starting "C:\U"
There are a couple of ways to prevent this, either replace backslash (\) with forward (/) slash or double
back slash (\\) The first back slash tells R to treat the character literally, it is called an escape character
which invokes an alternative interpretation on subsequent characters.
setwd("C:/Users/aedin/Documents/Rwork/colonJan13")
setwd("C:\\Users\\aedin\\Dropbox\\Talks\\CDC\\Notes")
This can be rather tedious.A nice way to make scripts work across platform is to use the command
path.expand, file.path. The tilde symbol ( ˜ ) is a shortcut to your home drive (on any operating system)
path.expand("˜ ")
myhome <- path.expand("˜ ")
newdir <- file.path(path.expand("˜ "), "Rwork", " colonJan13")
setwd(newdir)
In my case, I work on several windows boxes, each have fairly different directory structures, but by
setting the home directory properly I can sync code between computers and have them run properly on each
one since where I run my R projects have similar directory structures.
Advanced Side Note: What is your system home directory
Your system HOME (˜ ) is set by your operating system. To view or change it, type
Sys.getenv("HOME")
## [1] "C:/Users/aedin/Documents"
9
Sys.setenv(HOME = "path") # 'path' is the directory you wish to set as
your new home directory
10
1.5 First R Encounter
When you start R, and see the prompt > , it may appear as if nothing is happening. But this prompt is
now awaiting commands from you. This can be daunting to the new user, however it is easy to learn a few
commands and get started
demo()
Note we place round brackets after all functions ALWAYS. Note if the command is not complete on
one line (missing close bracket or quote), R will use continuation prompt is ’+’.
If the brackets are empty () it runs the function with default parameters. To specify a parameters these
are inserted into the brackets. For example to run a demo on a specific topics we give that parameters to
demo
demo(graphics)
demo(nlm)
# ?matrix
`?`(matrix)
The last command will run all the examples included with the help for a particular function. If we
want to run particular examples, we can highlight the commands in the help window and submit them
by typing CtrlˆV
• If you don’t know the command and want to do a keyword search for it.
help.search("combination")
help.start()
11
help.search will open a html web browser or a MSWindows help browser (depending on the your
preferences) in which you can browse and search R documentation.
• Finally, there is a large R community who are incredibly helpful. There is a mailing list for R,
Bioconductor and almost every R project. It is useful to search the archives of these mailing lists.
Frequently you will find someone encountered the same problem as you, and previously asked the R
mailing list for help (and got a solution!).
• There numerous useful resources for learning R on the web including the R project http://www.
r-project.org and its mailing lists but also I recommend the following:
– In the December 2009 issue of the R Journal. Transitioning to R: Replicating SAS, Stata, and
SUDAAN Analysis Techniques in Health Policy Data. Anthony Damico http://journal.
r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf
– SAS and R. A blog devoted to examples of tasks (and their code) replicated in SAS and R
http://sas-and-r.blogspot.com/
– R for SAS and SPSS Users. Download a free 80 page document, http://rforsasandspssusers.
com/
R for SAS and SPSS Users which contains over 30 programs written in all three languages.
2 + 2
## [1] 4
2 * 2
## [1] 4
2 * 100/4
12
## [1] 50
2 * 100/4 + 2
## [1] 52
2 * 100/(4 + 2)
## [1] 33.33
2^10
## [1] 1024
log(2)
## [1] 0.6931
Note even in the simple use of R as a calculator, it is useful to store intermediate results. For example
lets store the value of (tmpVal=log(2)).
## [1] 0.6931
tmpVal
## [1] 0.6931
exp(tmpVal)
## [1] 2
In this case, we assigned a symbolic variable tmpVal. Note when you assign a value to such a variable,
there is no immediate visible result. We need to print(tmpVal) or just type tmpVal in order to see
what value was assigned to tmpVal
2 * 5^2
## [1] 50
x <- 2 * 5^2
print(x)
13
## [1] 50
2 * 5^2
## [1] 50
y <- 2 * 5^2
z <- 2 * 5^2
z <- 2 * 5^2
print(y)
## [1] 50
x == y
## [1] TRUE
y == z
## [1] TRUE
## [1] 50
## [1] 50
Z <- 20
x == z
## [1] TRUE
x == Z
## [1] FALSE
14
• ’==’ and ’=’ have very different uses in R. == is a binary operator, which test for equality (A==B
determines if A ’is equal to’ B ).
• Comments can be put anywhere. To comment text, insert a hashmark #. Everything following it to
end of the line is commented out (ignored, not evaluated).
• Quotes, you can use both ” double or ’ single quotes, as long as they are matched.
• For names, normally all alphanumeric symbols are allowed plus . and _ Start names with a character
[Aa-Zz] not a numeric character [0-9]. Avoid using single characters or function names t, c, q, diff,
mean, plot etc.
• Arguments (parameters) to a function calls f(x), PROC are enclosed in round brackets. Even if no
arguments are passed to a function, the round brackets are required.
print(x)
getwd()
Bracket Use
() To set priorities 3*(2+4). Function calls f(x)
[] Indexing in vectors, matrices, data frames
{} Creating new functions. Grouping commands {mean(x); var(x)}
[[]] Indexing of lists
rnorm(5)
If you wish to generate the same set of random numbers each time, you could set.seed(10)
You can view previous expressions entered into the R session (default 25) using the function
history()
(this is discussed in more detail later on) You can also view the history of R commands in the history
tab on top right panel in RStudio
15
1.9 Basic operators
We already saw that == tests for equality or a match between 2 objects. Other operators are:
• not equal: !=
1 == 1
## [1] TRUE
x <- 1:10
y <- 10:1
x > y & x > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
x == y | x != y
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
• NOT ! The ’!’ sign returns the negation (opposite) of a logical vector.
!x > y
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
These return a logical vector of TRUE or FALSE and can be useful for filtering (we will see this later)
16
data()
To load a dataset, for example, the dataset women which gives the average heights and weights for 15
American women aged 30-39.
data(women)
ls()
ls(pattern = "w")
Figure 1.5: The Workspace windows lists the object currently in the R workspace. You can click on each
item to view or edit it. Note women is a data table with dimensions 15 rows x 2 columns, you can click on
the table icon to view it. Z and y are a single value (50). x and y are integer vectors of length 10.
17
1.11 R sessions (workspace) and saving session history
To finish up today, we will save our R session and history
1. R session One can either save one or more R object in a list to a file using save() or save the entire R
session (workspace) using save.image().
To load this into R, start a new R session and use the load()
rm(women)
ls(pattern = "women")
load("women.RData")
ls(pattern = "women")
2. R history R records the commands history in an R session. To view most recent R commands in a
session
history()
help(history)
history(100)
history(pattern = "save")
savehistory(file = "L2.Rhistory")
3. Note you can also browse and search history in RStudio, really easily in the history window. A really
nice feature of this window is the ease of sending command either to the console (to execute code
again) or to Source (to a text file or script you are writing in the editor)
Figure 1.6: You can easily save or search command history, send commands to the R console or a source
(script) file
18
1.12 R Packages
By default, R is packaged with a small number of essential packages, however as we saw there are many
contributed R packages.
1. Some packages are loaded by default with every R session. The libraries included in the Table are
loaded on the R startup.
search()
sessionInfo()
3. To see which packages are installed on your computer, issue the command
library()
Within RStudio installed packages can be view in the Package Tab of the lower right panel. You
can tick or select a library to load it in R.
You will very likely want to install additional packages or libraries.
19
Figure 1.7: The list is all installed packages. A tick marked indicates its loaded into the current R session.
Clicking on the package name will open help for that package
2. check you have write permission to the path where it will write the files (use the following R function
to list your RStudio Paths) ¡¡rsPref, eval=FALSE¿¿= .rs.defaultUserLibraryPath() Path where RStu-
dio installs packages .rs.rpc.getp ackagei nstallc ontext()Otherusef ulRStudioconf igurationinf o@
3. Install packages using the basic R GUI using the drop-down menu Packages or command line (in-
stall.packages). First Click on “Packages” and “Set CRAN mirror” and choose an available mirror (choose
one close by, it’ll be faster hopefully). Then If you know the name of the package you want to install, or if
you want to install all the available packages, click on “install Packages”
4. Open a CMD shell and type R CMD INSTALL packageName.tar.gz. You can open a CMD shell from
RStudio Tools -> Shell
20
Figure 1.9: Type the name of the package. Unfortunately RStudio does not display a list of all available
packages
Installation of all packages takes some time and space on your computer. If the name of the package is
not known, you could use taskviews help or archives of the mailing list to pinpoint one. Also look on the R
website Task views description of packages (see Additional Notes in Installation which I have provided).
To get an information on a package, type
library(help = lme4)
Once you have installed a package, you do NOT need to re-install it. But to load the library in your
current R session use the commands
search()
detach(package:lme4)
search()
21
NOTE: Packages are often inter-dependent, and loading one may cause others to be automatically
loaded.
Figure 1.10: Within the options window you can change the version of R, your default directory and SVN
preferences
In addition you can specify preferences for either a site or local installation in Rprofile.site. On Windows,
this file is in the C:\Program Files\R\R-x.y.z\etc directory where x.y.z is the version of R. Or
you can also place a .Rprofile file in the directory that you are going to run R from or in the user home
directory.
At startup, R will source the Rprofile.site file. It will then look for a .Rprofile file in the current working
directory. If it doesn’t find it, it will look for one in the user’s home directory.
There are two special functions you can place in these files. .First( ) will be run at the start of the R
session and .Last( ) will be run at the end of the session. These can be used to load a set of libraries that you
use most.
22
# General options
options(tab.width = 2)
options(width = 100)
options(digits = 5)
23
Chapter 2
Objects in R
objects()
ls()
rm(x, y, z, junk)
• A vector is an ordered collection of numerical, character, complex or logical objects. Vectors are
collection of atomic (same data type) components or modes. For example
# Numeric
vec1 <- 1:10
vec1
24
## [1] 1 2 3 4 5 6 7 8 9 10
# Character
vec2 <- LETTERS[1:10]
vec2
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
# logical
vec3 <- vec2 == "D"
vec3
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
In each case above, these vectors have 10 elements, and are of length=10.
• A matrix is a multidimensional collection of data entries of the same type. Matrices have two
dimensions. It has rownames and colnames.
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
dim(mat1)
## [1] 5 2
## A B
## N1 1 6
## N2 2 7
## N3 3 8
## N4 4 9
## N5 5 10
25
• A list is an ordered collection of objects that can be of different modes (e.g. numeric vector, array,
etc.).
a <- 20
newList1 <- list(a, vec1, mat1)
print(newList1)
## [[1]]
## [1] 20
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[3]]
## A B
## N1 1 6
## N2 2 7
## N3 3 8
## N4 4 9
## N5 5 10
##
## $a
## [1] 20
##
## $myVec
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $mat
## A B
## N1 1 6
## N2 2 7
## N3 3 8
## N4 4 9
## N5 5 10
##
• Though a data.frame is a restricted list with class data.frame, it maybe regarding as a matrix
with columns that can be of different modes. It is displayed in matrix form, rows by columns. (Its like
an excel spreadsheet)
26
## A B
## N1 1 6
## N2 2 7
## N3 3 8
## N4 4 9
## N5 5 10
## [1] "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C" "A"
## [17] "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C" "A" "B" "C"
## charVec
## A B C
## 10 10 10
## [1] A B C A B C A B C A B C A B C A B C A B C A B C A B C A B C
## Levels: A B C
attributes(fac1)
## $levels
## [1] "A" "B" "C"
##
## $class
## [1] "factor"
##
levels(fac1)
• array An array in R can have one, two or more dimensions. I find it useful to store multiple related
data.frame (for example when I jack-knife or permute data). Note if there are insufficient objects to
fill the array, R recycles (see below)
27
array(1:24, dim = c(2, 4, 3))
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 9 11 13 15
## [2,] 10 12 14 16
##
## , , 3
##
## [,1] [,2] [,3] [,4]
## [1,] 17 19 21 23
## [2,] 18 20 22 24
##
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 9 11 13 15
## [2,] 10 12 14 16
##
## , , 3
##
## [,1] [,2] [,3] [,4]
## [1,] 17 19 21 23
## [2,] 18 20 22 1
##
## , , X
##
28
## A B C D
## Patient1 1 3 5 7
## Patient2 2 4 6 8
##
## , , Y
##
## A B C D
## Patient1 9 11 13 15
## Patient2 10 12 14 16
##
## , , Z
##
## A B C D
## Patient1 17 19 21 23
## Patient2 18 20 22 1
##
# Numeric
x <- 3
mode(x)
## [1] "numeric"
# Charachter
x <- "apple"
mode(x)
## [1] "character"
x <- 3.145
x + 2 # 5.145
29
## [1] 5.145
# FALSE, logical
x == 2
## [1] FALSE
x <- x == 2
x
## [1] FALSE
mode(x)
## [1] "logical"
## [1] "numeric"
## [1] "numeric"
# Character
x <- LETTERS[1:5]
mode(x)
## [1] "character"
# logical
x <- x == "D"
mode(x)
## [1] "logical"
30
Quick Exercise
Repeat above, and find the length and class of x in each case.
attributes(x)
## $dim
## [1] 2 5
##
In summary
Object Modes Allow >1 Modes*
vector numeric, character, complex or logical No
matrix numeric, character, complex or logical No
list numeric, character, complex, logical, function, expression, ... Yes
data frame numeric, character, complex or logical Yes
factor numeric or character No
array numeric, character, complex or logical No
*Whether object allows elements of different modes. For example all elements in a vector or array have to be
of the same mode. Whereas a list can contain any type of object including a list.
• Create vectors, matrices and data frames using seq, rep, rbind and cbind
# Vector
x.vec <- seq(1, 7, by = 2)
# The function seq is very useful, have a look at the help on seq
# (hint ?seq)
31
xMat <- cbind(x.vec, rnorm(4), rep(5, 4))
yMat <- rbind(1:3, rep(1, 3))
z.mat <- rbind(xMat, yMat)
# Data frame
x.df <- as.data.frame(xMat)
names(x.df) <- c("ind", "random", "score")
• Accessing elements
NOTE Use square brackets to access elements. The number of elements within the square bracket
must equal the dimension of the object.
1. vector [1]
2. matrix [1,1]
3. array with 3 dimensions [1,1,1]
## a
## 1
## a
## 1
## [1] 5
##
## a -0.1267 5
## b -1.2919 5
## c 0.8820 5
## d -0.6514 5
# or
xMat[, -c(1)]
##
## a -0.1267 5
## b -1.2919 5
## c 0.8820 5
## d -0.6514 5
32
Here -1 means everything except for the first column.
xMat[xMat[, 1] > 3, ]
## x.vec
## c 5 0.8820 5
## d 7 -0.6514 5
If the object has class data.frame or list, you can use the dollar symbol $ to access elements. The $
can only access columns of data.frame
x.df$ind
## [1] 1 3 5 7
x.df[, 1]
## [1] 1 3 5 7
names(newList1)
newList1$a
## [1] 20
# Change the element of 'xMat' in the third row and first column to
# '6'
xMat[3, 1] <- 6
# Replace the second column of 'z.mat' by 0's
z.mat[, 2] <- 0
33
2.5.1 Sorting and Ordering items
Frequently we need to re-order the rows/columns of a matrix or see the rank order or a sorted set elements
of a vector
The functions sort and order are designed to be applied on vectors. Sort returns a sorted vector. Order
returns an index which can be used to sort a vector or matrix.
# Simplest 'sort'
z.vec <- c(5, 3, 8, 2, 3.2)
sort(z.vec)
order(z.vec)
## [1] 4 2 5 1 3
Sorting the rows of a matrix. We will use an example dataset in R called ChickWeight. First have a look
at the ChickWeight documentation (help)
Lets take a subset of the matrix, say the first 36 rows.
# ?ChickWeight
ChickWeight[1:2, ]
## by just weight
chickOrd <- chick.short[order(chick.short$weight), ]
chickOrd[1:5, ]
34
## weight Time Chick Diet
## 13 40 0 2 1
## 1 42 0 1 1
## 25 43 0 3 1
## 26 39 2 3 1
## 14 49 2 2 1
35
2.5.2 Missing Values
Missing values are assigned special value of ’NA’
## [1] 1 2 3 NA
print(z)
## [1] 1 2 3 NA
x <- z[!is.na(z)]
print(x)
## [1] 1 2 3
Check to see if a vector has all, any or a certain number of missing values. These create logical vectors
which can be used to filter a matrix or data.frame
all(is.na(z))
## [1] FALSE
any(is.na(z))
## [1] TRUE
sum(is.na(z))
## [1] 1
sum(is.na(z)) > 1
## [1] FALSE
36
x1 <- numeric()
x2 <- numeric(5)
x1.mat <- matrix(0, nrow = 10, ncol = 3)
37
2.6 Quick recap
• R Environment, interface, R help and R-project.org and Bioconductor.org website
• R Objects
Object Modes Allow >1 Modes*
vector numeric, character, complex or logical No
matrix numeric, character, complex or logical No
list numeric, character, complex, logical, function, expression, ... Yes
data frame numeric, character, complex or logical Yes
factor numeric or character No
array numeric, character, complex or logical No
*Whether object allows elements of different modes. For example all elements in a vector or array have to be
of the same mode. Whereas a list can contain any type of object including a list.
There are other objects type include ts (time series) data time etc. See the R manual for more infor-
mation. All R Objects have the attributes mode and length.
38
2.7 Exercise 1
For this exercise we will work on data from a study which examined the weight, height and age of women.
Data from the women Study is available as an R dataset and information about this study can be found by
using R help (hint ?women) which we will read directly from the website URL http://bcb.dfci.
harvard.edu/˜aedin/courses/Bioconductor/Women.txt into the object women
Basic tools for reading and writing data are respectively: read.table and write.table. We will go into
further detail about each later today, but first lets read in this file by typing these commands:
myURL <-
"http://bcb.dfci.harvard.edu/˜ aedin/courses/Bioconductor/Women.txt"
women <- read.table(myURL, sep = "\t", header = TRUE)
3. How many rows and columns are in the data? (hint try using the functions str, dim, nrow and ncol))
4. Use the summary(), to view the mean height and weight of women
7. What is the average height of women who weigh between 124 and 150 pounds (hint: need to select
the data, and find the mean).
39
Chapter 3
So far, we have only analyzed data that were already stored in R. Basic tools for reading and writing data are
respectively: read.table and write.table. We will go into further detail about each. First we will talk about
reading in simple text documents; comma and tab-delimited format text files, then Excel and import/exprot
from other statistical software.
Figure 3.1: which provides an easy approach to read a text from a local directory or directly from a web
URL
Enter a file location (either local or on the web), and RStudio will make a ”best guess” at the file format.
There are a limited number of options (heading yes or no), separators (comma, space or tab) etc but these
should cover the most common data exchange formats (The R interfaces RCommander and RExcel also
provide rich support for data import of many different file formats into R)
40
Figure 3.2: The top panel shows the plain text of the file, and the lower panels displays how R is interpreting
the data. Black rows are the column headings
Women<-read.table("Women.txt")
In order to read files that are tab or comma delimited, the defaults must be changed. We also need to
specify that the table has a header row
# Tab Delimited
Women <- read.table("Women.txt", sep = "\t", header = TRUE)
Women[1:2, ]
41
summary(Women)
class(Women$age)
## [1] "integer"
Note by default, character vector (strings) are read in as factors. To turn this off, use the parameter
as.is=TRUE
2. Important options:
header==TRUE should be set to ’TRUE’, if your file contains the column names
as.is==TRUE otherwise the character columns will be read as factors
sep=”” field separator character (often comma ’,’ or tab ”” eg: sep=”,”)
na.strings a vector of strings which are to be interpreted as ’NA’ values.
row.names The column which contains the row names
comment.char by default, this is the pound # symbol, use ”” to turn off interpre-
tation of commented text.
Note the defaults for read.table(), read.csv(), read.delim() are different. For example, in read.table()
function, we specify header=TRUE, as the first line is a line of headings among other parameters.
3. read.csv() is a derivative of read.table() which calls read.table() function with the following options
so it reads a comma separated file:
# Comma Delimited
Women2 <- read.csv("Women.csv")
Women2[1:2, ]
42
4. Reading directly from Website You can read a file directly from the web
myURL <-
"http://bcb.dfci.harvard.edu/˜ aedin/courses/Bioconductor/Women.txt"
read.table(myURL, header = TRUE)[1:2, ]
43
3.3 Exercise 2
The ToothGrowth data are from a study which examined the growth of teeth in guinea pigs (n=10) in re-
sponse to three dose levels of Vitamin C (0.5, 1, and 2 mg), which was administered using two delivery
methods (orange juice or ascorbic acid). Data from the Tooth Growth Study is available as an R dataset and
information about this study can be found by using R help (hint ?ToothGrowth)
1. Download the data set ”ToothGrowth.xls” which is available on the course website. Save it in your
local directory. Open this file ”ToothGrowth.xls” in Excel.
2. Export the data as both a comma or tab delimited text files. In Excel select File -> Save as and
Tab: select the format Text (Tab delimited) (*.txt).
CSV: select the format CSV (Comma delimited) (*.csv).
44
3.3.1 Importing text files Using scan()
NOTE: read.table() is not the right tool for reading large matrices, especially those with many columns. It
is designed to read ’data frames’ which may have columns of very different classes. Use scan() instead.
scan() is an older version of data reading facility. Not as flexible, and not as user-friendly as read.table(),
but useful for Monte Carlo simulations for instance. scan() reads data into a vector or a list from a file.
Note by default scan() expects numeric data, if the data contains text, either specify what=”text” or give
an example what=”some text”.
Other useful parameters in scan() are nmax (number of lines to be read) or n (number of items to be
read).
45
3.4 Reading data from Excel into R
There are several packages and functions for reading Excel data into R, however I normally export data as a
.csv file and use read.table().
However if you wish to directly load Excel data, here are many the options available to you. See
the section on ”Importing-from-other-statistical-systems” in the webpage http://cran.r-project.
org/doc/manuals/R-data.html for more information
library(xlsx)
ww <- read.xlsx(file = "Women.xlsx", sheetIndex = 1)
read.xlsx accepts .xls and .xlsx format. You must include a worksheet name or number. It is optional
to specify a row or column index to indicate a section of a Worksheet
require(XLConnect)
Or you can read direct from a connection, calling the file directly.
3. RExcel R can be ran from within Excel on Windows using RExcel (http://rcom.univie.ac.
at/). This add a menu to Excel that allows you to call R functions from within Excel. RExcel is part
of the much large Statconn project.
4. RODBC library. We are not sure it will not work with .xlsx files. See the vignette for more information
library(RODBC)
RShowDoc("RODBC", package = "RODBC")
The following RODBC function works under windows, but may have issues under MacOS or Linux as
may need to install ODBC drivers.
46
channel<-odbcConnectExcel("ToothGrowth.xls")
#list the spreadsheets
sqlTables(channel)
#retrieve the contents of the Excel Sheet ToothGrowth using either of the follo
ToothGrowth<-sqlFetch(channel, "ToothGrowth")
ToothGrowth<-sqlQuery(channel, "select * from [ToothGrowth$]")
ToothGrowth[1:2,]
In R
library(Hmisc)
mydata <- sasxport.get("c:/mydata.xpt")
• SAScii. Anthony Joseph Damico recently announced SAScii is a new packages to parse SAS in-
put code to read.fwf However although they stated the code below should work, I have been not so
successful with it in my hands.
require("SAScii")
47
# Load the 2010 National Health Interview Survey Persons file as an
# R data frame from CDC
NHIS10_personsx_SASInst <-
"ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Program_Code/NHIS/2010/PERSONSX.sas"
NHIS10_personsx_SASInst <-
"ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHIS/2010/personsx.zip"
• Recently Xin Wei of Roche Pharmaceuticals published a SAS macro called Proc_R that may poten-
tially ease integrating R and SAS (reference Xin Wei PROC_R: A SAS Macro that Enables Native
R Programming in the Base SAS Environment J. Stat Software. Vol. 46, Code Snippet 2, Jan 2012)
which allows you to put R code within a SAS macro.
%include "C:\aedin\sasmacros\Proc_R.sas";
%Proc_R (SAS2R =, R2SAS =);
Cards4;
******************************
***Please Enter R Code Here***
******************************
;;;;
%Quit;
3.5.2 S
PSS
From SPSS, save SPSS dataset in trasport format
get file=c:\mydata.sav .
export outfile=c:\mydata.por .
In R
library(Hmisc)
mydata <- spss.get("c:/mydata.por", use.value.labels = TRUE)
48
3.5.3 Stata or Systat
library(foreign)
mydata <- read.dta("c:/mydata.dta")
# --------------------------------------------------------- mySQL
# ---------------------------------------------------------
library(RMySQL)
drv <- dbDriver("MySQL")
# --------------------------------------------------------- Oracle
# ---------------------------------------------------------
library(ROracle)
# create an Oracle instance and create one connection.
drv <- dbDriver("Oracle")
con <- dbConnect(drv, "username/password")
rs <- dbSendQuery(con, "select * from USER_TABLES")
rel <- fetch(rs)
dbGetInfo(con)
49
3.7 Writing Data table from R
1. Function sink() diverts the output from the console to an external file
2. Writing a data matrix or data.frame using the write.table() function write.table() has similar arguments
to read.table()
3. Important options
4. Output to a webpage
The package R2HTML will output R objects to a webpage
50
##
## *** Output redirected to directory: C:/Users/aedin/Dropbox/Talks/CDC/Notes
## *** Use HTMLStop() to end redirection.
## [1] TRUE
print("Capturing Output")
df1[1:2, ]
summary(df1)
HTMLStop()
## [1] "C:/Users/aedin/Dropbox/Talks/CDC/Notes/Web_Results_main.html"
51
can be used.
## [1] FALSE
It is better to expand a path using file.path() rather than paste() as file.path() will expand the path with
delimiting characters appropriate to the operating system in use (eg / unix, \, windows etc)
Use file.exists() to test if a file can be found. This is very useful. For example, use this to test if a file
exists, and if TRUE read the file or you could ask the R to warn or stop a script if the file does not exist
if (!file.exists(myfile)) {
print(paste(myfile, "cannot be found"))
} else {
Women <- read.table(myfile, sep = "\t", header = TRUE)
Women[1:2, ]
}
52
3.8 Exercise 3
1. Use read.table() to read the space separated text file WomenStats.txt directly from the website "http:
//bcb.dfci.harvard.edu/˜aedin/courses/R/WomenStats.txt", Call this data.frame
women.
2. Change the rownames to be the letters of the alphabet eg ”A”, ”B” ”C” ”D” etc
4. Read this into R using read.table(). What parameters need modifying to read the data as a tab-
delimited file?
53
3.9 Sampling and Creating simulated data
1. seq and rep. we have already seen the function seq and rep which generate a sequence or repeat
elements.
rnorm(5, 0, 1)
rnorm(10, 6, 2)
## [1] 7.530 7.741 5.947 4.622 8.804 8.149 10.949 3.339 6.213
## [10] 6.554
For most of the classical distributions, these simple function provide probability distribution functions
(p), density functions (d), quantile functions (q), and random number generation (r). Beyond this
basic functionality, many CRAN packages provide additional useful distributions. In particular, mul-
tivariate distributions as well as copulas are available in contributed packages. See http://cran.
r-project.org/web/views/Distributions.html and http://cran.r-project.
org/doc/manuals/R-intro.html#Probability-distributions for more informa-
tion.
54
The function sample() will resample a given data set with or without replacement
sample(1:10)
## [1] 3 8 9 7 2 6 4 10 1 5
## [1] 1 1 8 6 4 8 1 6 10 10
You can also add weights to bias selection or probability of selecting of a certain subset. For example
bootstrap sample from the same sequence (1:10) with probabilities that favor the numbers 1-5
## [1] 0.25 0.25 0.25 0.25 0.25 0.05 0.05 0.05 0.05 0.05
## [1] 5 2 10 5 1 3 5 4 9 2
3.10 Exercise 4
1. Create the vector which contains the first 20 letters of the alphabet and the sequence of number 0:200
in increments of 10 (hint use seq()).
3. Use the function cat() to write this vector to a file called ”myVec.txt”.
4. Use scan() to read the first 10 items in the file, what value do you give to the parameter ’what’.
Compare running scan() with different data types; eg: what=”text”, what=123 and what=TRUE
55
Chapter 4
expression is an R expression using arguments arg1, arg2 to calculate a value. Function returns the
value of the expression
3. Call to a function within R
## [1] 19.72
myMean(testVec)
## [1] 19.72
56
5. A more complex example
Example of a function ’twosam’: takes as arguments two vectors ’y1’ and ’y2’, calculates the 2-sample
t-test statistic (assuming equal variance), and returns the t-statistic
x <- 9
if (x > 0) sqrt(x) else sqrt(-x)
## [1] 3
Vectorized version of the if/else construct: ifelse(condition, expr1, expr2) function which returns a
vector with elements expr1 if condition is true, otherwise it returns expr2.
## [1] 3
## [1] 1.043
57
spread(samp, "IQR")
## [1] 0.9385
Why IQR(x)/1.349 ? In a normal distribution 50% of the data (between 0.25 and 0.75 quartiles). So the distance between
the two quartiles IQR(x) = quantile(x,0.75) − quantile(x,0.25). For a normal distribution IQR is qnorm(0.75) - qnorm(0.25)
≈ 1.349. Therefore IQR/1.349 is an estimator of the standard deviation of a normal distribution.
where i is the loop variable, expr1 is usually a sequence of numbers, and expr2 is an ex-
pression.
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
x <- 1
y <- 16
while (x^2 < y) {
cat(x, "squared is ", x^2, "\n") # print x and sq(x)
x <- x + 1
}
## 1 squared is 1
## 2 squared is 4
## 3 squared is 9
A word of caution, it is easy to write a while() loop that doesn’t terminate, in which case your
script may go into a never-ending cycle. Therefore if possible, write a for() loop in preference
to a while() loop.
58
4.3 Viewing Code of functions from R packages
Its often useful to view the code of R functions. To see the code, type the name of that code without
parenthesis. Take a look closer at a built-in function IQR. We see it is simply taking the difference
(diff() of the 25
help(IQR)
IQR
args(IQR)
body(IQR)
IQR(xx)
## [1] 8
(Sometimes, functions don’t appear to be ”visible”. In this case, use methods or getAnywhere to view the
function. This is a bit more complex for S4 functions).
mean
59
methods(mean)
mean.default
Some function are ”Non visible” that means you can see the code. When you These are hidden in the
package namespace. use the function methods, non visible functions are marked by an asterisk.
`?`(t.test)
t.test
methods(t.test)
60
To view a hidden or non-visible function use ”PackageName:::function”
stats:::t.test.default
To reduce the output and save paper in the manual, we will just view the first 5 and last 10 lines of the
function.
head((stats:::t.test.default), 5)
##
## 1 function (x, y = NULL, alternative = c("two.sided", "less", "greater"),
## 2 mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95,
## 3 ...)
## 4 {
## 5 alternative <- match.arg(alternative)
print("truncated...")
## [1] "truncated..."
tail((stats:::t.test.default), 10)
##
## 102 names(mu) <- if (paired || !is.null(y))
## 103 "difference in means"
## 104 else "mean"
## 105 attr(cint, "conf.level") <- conf.level
## 106 rval <- list(statistic = tstat, parameter = df, p.value = pval,
## 107 conf.int = cint, estimate = estimate, null.value = mu,
## 108 alternative = alternative, method = method, data.name = dname)
## 109 class(rval) <- "htest"
## 110 return(rval)
## 111 }
There are some functions that you will not be able to see using these commands. These are most
likely written in object orientated R (called S4). Much of Bioconductor’s functions are written in S4.
However a full discussion of S4 functions is beyond the scope of this course.
Iterative ”For loops” in R may sometimes be memory intensive, and functions such as apply, sweep
or aggregate should be should instead.
(a) apply
appply() applies a function over the rows or columns of a matrix. The syntax is
61
apply(X, MARGIN, FUN, ARGs)
where X: array, matrix or data.frame; MARGIN: 1 for rows, 2 for columns, c(1,2) for both;
FUN: one or more functions; ARGs: possible arguments for function
For example, lets go back to the example dataset women which we loaded from the web.
summary(women)
colMeans(women)
apply(women, 2, mean)
## [1] TRUE
62
(b) tapply
tapply() is a member of the very important apply() functions. It is applied to ”ragged” arrays,
that is array categories of variable lengths. Grouping is defined by vector.
Example:
## over35 under35
## 5 10
## $over35
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 120 135 139 142 150 164
##
## $under35
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 115 124 130 134 145 159
##
63
print(res1)
print(paste("Class of res1:", class(res1)))
4.5 Exercise 5
(b) Write a while loop printing the consecutive powers of 2, less than 1000
64
4.5.1 Efficient For Loop in R (Use apply)
Note apply is much more computational efficient that a for loop. But if you can use built in functions
like rowMeans or colMeans these are quicker still
65
## user system elapsed
## 0 0 0
66
4.6 Functions for parsing text
There are many functions with R for parsing text. We will cover a few here.
• To search for text within an R vector, use grep. It uses the same regular expression patterns as perl is
you set perl=TRUE
grep("A", LETTERS)
## [1] 1
## [1] "A"
## [1] "A" "A" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
## [17] "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
a <- date()
strsplit(a, " ")
## [[1]]
## [1] "Tue" "Jun" "19" "15:01:08" "2012"
##
strsplit(a, "J")
## [[1]]
## [1] "Tue " "un 19 15:01:08 2012"
##
## [1] "list"
b <- unlist(b)
class(b)
## [1] "character"
67
• For special characters you need to precede them with a double back slash
a <- "aedin@jimmy.harvard.edu"
strsplit(a, "\\.")
## [[1]]
## [1] "aedin@jimmy" "harvard" "edu"
##
mean(a)
## [1] NA
## [1] 5.5
68
4.8 Exercise 6: Parsing Real Data - World Population Data from Wikipedia
We demonstrated how to get data tables a URL. This means we can retrieve data from most any table on
the web. The function readHTMLTable is very flexible for this. Please retrieve the table entited ”Estimated
world population at various dates (in millions)” (Table 13) from
http://en.wikipedia.org/wiki/World_population.
require(XML)
## [1] "NULL"
## [2] "toc"
## [3] "NULL"
## [4] "World population milestones (USCB estimates)"
## [5] "The 10 countries with the largest total population:"
## [6] "10 most densely populated countries (with population above 1 million)"
## [7] "Countries ranking highly in terms of both total population (more than 15 million p
## [8] "NULL"
## [9] "UN (medium variant 2010 revision) and US Census Bureau (December 2010) estimates[
## [10] "UN 2008 estimates and medium variant projections (in millions)[98]"
## [11] "World historical and predicted populations (in millions)[102][103]"
## [12] "World historical and predicted populations by percentage distribution [102][103]"
## [13] "Estimated world population at various dates (in millions)"
## [14] "Starting at 500 million"
## [15] "Starting at 375 million"
## [16] "NULL"
## [17] "NULL"
## [18] "NULL"
## [19] "NULL"
1. First tidy the data. The data are factors, its easier to edit data that are characater. Apply as.character
to each column
2. Remove rows with dates before 1750. Remove the additional header in row 32.
6. In what year did the population of Europe, Africa and Asia exceed 500 million?
7. Bonus: Plot the population growthof the World, Africa or Europe since 1750. Given this plot, would
you guess that the the population of the World, Africa or Europe would be more likely to double again
before the end of 21st century?
69
4.8.1 Writing functions: More on arguments
We are equipped now with all basic tools we need for writing functions. We include a few tips on arguments
to functions.
1. Function arguments: Default values In many cases arguments have default values. For example the
qnorm(x, mean = 0, sd = 1,lower.tail = TRUE, log.p = FALSE ) has default values for the mean,
standard deviation, cdf calculation and probabilities on the original scale.
• The argument sequence may begin in the unnamed, positional form, and specify named argu-
ments after the positional arguments
• If arguments to functions are given in the form name=object form, they may be given in any
order
• Sometimes you may see the parameter ”...”, this is normally when functions call other functions
and arguments are passed from one function to another.
• If commands are stored in an external R script file, say L2.R they can be executed at any time
in R
source(paste(myPath, "L2.R",sep=))
• The built-in functions supplied with R are a valuable resource for learning about R programming
70
4.9 Writing functions: more technical discussion -Scoping
1. Scoping
Symbols in the body of a function can be divided into three classes:
• Local variables (values are determined by the evaluation of expressions in the body of the
functions)
fn <- function(x) {
y <- 2 * x
print(x)
print(y)
print(z)
}
z <- 2
x <- 4
fn(x = 2)
## [1] 2
## [1] 4
## [1] 2
2. Lexical scope.
Example: function called cube.
## [1] 8
n <- 4
cube(2)
## [1] 8
71
4.10 Options for Running memory or CPU intensive jobs in R
4.10.1 Distributed computing in R
There are two ways to split or distribute a big job. The simplest it to send jobs to different processors on the
same machine (assuming it has multiple cores, which most new machines do). The second option is to split
or parallelize a job across many machines or a cluster of machines. For both of these, see the Bioconductor
package parallel which builds upon the older R packages snow and multicore
To install parallel use the Bioconductor package installer, BiocInstaller
library(BiocInstaller)
biocLite("parallel")
The package parallel has many functions which work like apply to distribute a computation. For example
use mclapply just like lapply to split a job over 4 cores.
library(parallel)
system.time(mclapply(1:4, function(i) mc.cores <- 4))
mclapply is a parallelized version of lapply, and will not work on windows (as far as I know) but on
Windows you can use functions parLapply, clusterApply and clusterApplyLB all in the parallel package.
The packages has several functions for different types of apply loops including parLapply, parSapply,
and parApply which are parallel versions of lapply, sapply and apply respectively.
library(parallel)
cl <- makeCluster(3)
parLapply(cl, 1:3, sqrt)
stopCluster(cl)
There are several other packages for distributed computing see the reviews of R packages on CRAN task
views http://cran.r-project.org/web/views/HighPerformanceComputing.html. I
have received recommendations on R packages biglm, ff and bigmemory.
72
4.11 Efficient R coding
4.11.1 What is an R script
A R script is simply a text file, with R commands. There are two ways to call these R commands, start R
and using the R function source, or at the command line using R CMD BATCH
#####################
### Author: Mr Bob Parr
### Date: 2011-01-20
### Version: 1.0
### License: GPL (>= 3)
###
### Description: survival analysis function
#####################
You can save this script in a file named censortime.R in your working directory. If you want to
define this function in your workspace, just type source("censortime.R").
Of course, an R script may contain more than functions, it may also contain any analytical pipeline.
Here is another example:
#####################
### Author: Mr Bob Parr
73
### Date: 2011-01-20
### Version: 1.0
### License: GPL (>= 3)
###
### Description: Script fitting a Cox model on the colon data
### and writing the coefficients in a txt file
#####################
## load library
library(survival)
Save this script in a file named coxColon.R in your working directory. you can run it from your R
session using the command source(”coxColon.R”) or you can run it in batch mode from a command line
(e.g., shell console) using the command R CMD BATCH coxColon.
1. Indentation
• No lines longer than 80 characters. No linking long lines of code using ”;”
2. Variable Names
3. Function Names
• Use camelCaps: initial lower case, then alternate case between words.
4. Use of space
74
• Always use space after a comma. This: a, b, c. Not: a,b,c.
• No space around ”=” when using named arguments to functions. This: somefunc(a=1, b=2),
not: somefunc(a = 1, b = 2).
5. Comments
6. Misc
8. R packages which tidy your code There is a package called formatR https://github.com/
yihui/formatR/wiki/ which will format all R script in a folder, indenting loops, convert the =
to -> etc. See its wiki pages above if you are interesting in testing it.
• cat() or print() are used only when displaying an object to the user, e.g., in a show method.
4.11.6 system.time
If you wish to check the efficient of your code to see how long it is taking to run, use the function system.time
which gives the compute time for a function
75
## user system elapsed
## 0.18 0.03 0.22
system.time(rowMeans(df))
system.time()
• Include information about your operating system and version of R. The easiest way to do this is
using sessionInfo() for example, see this recent post on the mailing list https://stat.ethz.
ch/pipermail/r-sig-mixed-models/2010q3/004467.html
Writing R packages
Once you have written all your functions in one or several R files, you can use the function package.skeleton
to generate the necessary directories and empty help pages for your package.
package.skeleton(name = "myFirstRPackage")
Hint: all the packages on CRAN and BioConductor are open source, so you can easily download the
source of any package to take a closer look at it. It may be extremely insightful to see how experienced R
developers implemented their own packages.
76
Using SVN
RStudio v0.96 contains an easy interface to subversion (either GIT or SVN), but here is a detailed guide to
using svn
http://tortoisesvn.net/docs/release/TortoiseSVN_en/tsvn-repository.html#
tsvn-repository-create-tortoisesvn
Step 1. Create local SVN repository
3. Right-click on the newly created folder and select TortoiseSVN Create Repository here....
4. A repository is then created inside the new folder. Don’t edit those files yourself!!!. If you get any
errors make sure that the folder is empty and not write protected.
5. For Local Access to the Repository you now just need the path to that folder. Just remember that
Subversion expects all repository paths in the form file:///C:/SVNRepository/. Note the use of forward
slashes throughout.
6. So far this is an empty repository, even though Subversion has created several directories and files!
We need to fill it with our project files and connect it with our working project directory
1. Somewhere in your hard drive create a directory (e.g. tmp) with the following three subdirectories:
C:\tmp\new\branches
C:\tmp\new\tags
C:\tmp\new\trunk
2. Backup and Tidy your exisiting scripts and project files (C:\Projects\MyProject). (ie delete
unecessary files)
4. Import the ’new’ directory into the repository (Right-click/TortoiseSVN/Import). Select URL as
file:///C:/SVNRepository/Myproject (forward slashes!)
5. To see it works, right mouse click start TortoiseSVN/Repo-browser... see your Imported
files.. Happy days. Now you have an SVN with all your files
1. Now we have created the SVN, the trick is to use it!!! Start by checking out your data. Create a new
scripts directory (or go back to your old one and delete its contents). And right mouse click and select
”SVN Checkout”
2. To use the SVN Sending (checking in) your changes to the repository: Right-click on selected files
then ”SVN Commit”
3. To add new files to the repository. This is a two step process: , first Right-click on selected files then
”TortoiseSVN/Add” Then Right-click on selected files then ”SVN Commit”
77
4. If you wish to delete files (remember the SVN will always have a history of them) use ”Tortois-
eSVN/Delete”
5. Happy Subversioning!
78
Chapter 5
Introduction to graphics in R
To start let’s look at the basic plots that can be produced in R using the demo() function
demo(graphics)
On start up, R initiates a graphics device driver which opens a special graphics window for the display
of interactive graphics. If a new graphics window needs to be opened either win.graph() or windows()
command can be issued.
Once the device driver is running, R plotting commands can be used to produce a variety of graphical
displays and to create entirely new kinds of display.
• Plot of Vector(s)
x <- 1:10
plot(x)
79
10
8
6
x
4
2
2 4 6 8 10
Index
set.seed(13)
x <- -30:30
y <- 3 * x + 2 + rnorm(length(x), sd = 20)
plot(x, y)
100
50
y
0
−50
• Plot of data.frame elements If the first argument to plot() is a data.frame, this can be as simply as
plot(x,y) providing 2 columns (variables in the data.frame).
80
Lets look at the data in the data.frame airquality which measured the 6 air quality in New York, on a
daily basis between May to September 1973. In total there are 154 observation (days).
airquality[1:2, ]
0 100 250 60 70 80 90 0 10 20 30
100
Ozone
50
0
250
Solar.R
100
0
20
15
Wind
10
5
60 70 80 90
Temp
9
8
Month
7
6
5
30
20
Day
10
0
0 50 100 5 10 15 20 5 6 7 8 9
Note most plotting commands always start a new plot, erasing the current plot if necessary. We’ll
81
discuss how to change the layout of plots so you can put multiple plots on the same page a bit later.
But a simple way to put multiple plots in the same window is by splitting the display using mfrow
Note if you give plot a vector and factor plot(factor, vector) or plot(vector factor) it will produce a
boxplot.
plot(Ozone, Temp)
150
90
100
80
Ozone
Temp
70
50
60
0 50 100 150 5 6 7 8 9
Ozone month
plot(airquality$Ozone~factor(airquality$Month)
detach(airquality)
82
5.2 Exercise 7
Using the ToothGrowth data we read earlier. Please draw the following plot
35
35
30
30
25
25
Tooth Length
Tooth Length
20
20
15
15
10
10
5
5
OJ 0.5
OJ 1
OJ 2
VC 0.5
VC 1
VC 2
OJ VC
Treatment Treatment and Dose
83
5.2.1 Arguments to plot
axes=FALSE Suppresses generation of axes-useful for adding your own custom axes with the axis()
function. The default, axes=TRUE, means include axes.
type= The type= argument controls the type of plot produced, as follows:
type=”h” Plot vertical lines from points to the zero axis (high-density)
type=”n” No plotting at all. However axes are still drawn (by default) and the coordinate system is set up
according to the data. Ideal for creating plots with subsequent low-level graphics functions.
xlab=string
ylab=string Axis labels for the x and y axes. Use these arguments to change the default labels, usually the
names of the objects used in the call to the high-level plotting function.
main=string Figure title, placed at the top of the plot in a large font.
xp <- 1:100/100
yp <- 3 * xp^2 - 2 * xp + rnorm(100, sd = 0.2)
84
Plot type: l Plot type: b
1.5
1.5
1.0
1.0
0.5
0.5
yp
yp
0.0
0.0
−0.5
−0.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
xp xp
Plot type: o Plot type: h
1.5
1.5
1.0
1.0
0.5
0.5
yp
yp
0.0
0.0
−0.5
−0.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
xp 0.00 0.25
xp
0.50 0.75 1.00
R simple plot
1.5
1.2
1.0
0.6
0.5
values
yp
0.0
0.0
−0.5
−0.6
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
index xp
boxplot(airquality)
85
300
250
200
150
100
50
0
Note if you give plot a vector and factor plot(factor, vector) or plot(vector factor) it will produce a
boxplot.
86
Equivalent plots
150
150
100
100
ozone
ozone
50
50
0
0
5 6 7 8 9 5 6 7 8 9
month month
boxplot(airquality$Ozone~airquality$Month plot(factor(airquality$Month), airquality$Ozone
150
airquality$Ozone
100
50
0
5 6 7 8 9
factor(airquality$Month)
plot(airquality$Ozone~factor(airquality$Month)
• barplot Plot a bar plot of the mean ozone quality by month. First use tapply to calculate the mean of
ozone by month
87
Mean Ozone by month
50
40
30
20
10
0
5 6 7 8 9
• pie chart
7
5
• hist(x)- histogram of a numeric vector x with a few important optional arguments: nclass= for the
number of classes, and breaks= for the breakpoints
88
xt <- rt(100, 3)
hist(xt)
Histogram of xt
40
30
Frequency
20
10
0
−8 −6 −4 −2 0 2 4 6
xt
plot(density(xt))
density.default(x = xt)
0.30
0.25
0.20
Density
0.15
0.10
0.05
0.00
−5 0 5
N = 100 Bandwidth = 0.4973
• 3D scatterplot
89
require(scatterplot3d)
data(trees)
trees[1:2, ]
90
Example of scatterplot3d plot: Tree Data
80
70
60
50
Volume
40
Height
90
85
30
80
75
20
70
65
10
60
8 10 12 14 16 18 20 22
Girth
detach(trees)
• Rvenn - draw a venn diagram. Input is a list. It will draw a venn diagram showing the intersect
between 2-6 vectors in a list.
require(gplots)
sample(LETTERS, 10)
## [1] "Z" "R" "M" "U" "Q" "E" "L" "S" "J" "D"
91
## $Lucy
## [1] "Q" "P" "Z" "M" "J" "T" "F" "O" "A" "D"
##
## $Sally
## [1] "B" "Z" "V" "K" "D" "F" "Y" "Q" "M" "S"
##
## $Kate
## [1] "Y" "P" "H" "N" "Q" "J" "V" "I" "O" "G"
##
venn(tt)
Kate
3 2
2 3
Lucy Sally
4
Plot 4 intersections
92
List 2 List 3
List 1 2 3 List 4
3 1 1
3 3
0 1
2
0 1
2 0
0
Color plots
require(venneuler)
plot(venneuler(xx))
93
List 1
3
List 4
List 2
94
CDGKMOSUVX
1
CDJMNPWXYZ
CFGHIKPQVY 1 1
1 1 1
0
1
1
1 1
2
0 1
1 0 1
1
0
0
0
1
1 1 0
0
0 1
0
0
3
ADEINOTVYZ CDHJKMOUYZ
95
5.3 Editing the default plot with low-level plotting commands
Sometimes the standard plot functions don’t produce exactly the kind of plot you desire. In this case, low-
level plotting commands can be used to add edit or extra information (such as points, lines or text) to the
current plot. Some of the more useful low-level plotting functions are:
points(x, y)
text(x, y, labels, ...) Add text to a plot at points given by x, y. Normally labels is an integer or character
vector in which case labels[i] is plotted at point (x[i], y[i]). The default is 1:length(x). Note: This
function is often used in the sequence
The graphics parameter type=”n” suppresses the points but sets up the axes, and the text() function
supplies special characters, as specified by the character vector names for the points.
polygon(x, y, ...) Draws a polygon defined by the ordered vertices in (x, y) and (optionally) shade it in with
hatch lines, or fill it if the graphics device allows the filling of figures.
legend(x, y, legend, ...) Adds a legend to the current plot at the specified position. Plotting characters, line
styles, colors etc., are identified with the labels in the character vector legend. At least one other
argument v (a vector the same length as legend) with the corresponding values of the plotting unit
must also be given, as follows:
legend( , fill=v) Colors for filled boxes
legend( , col=v) Colors in which points or lines will be drawn
legend( , lty=v) Line styles
legend( , lwd=v) Line widths
legend( , pch=v) Plotting characters
title(main, sub) Adds a title main to the top of the current plot in a large font and (optionally) a sub-title
sub at the bottom in a smaller font.
axis(side, ...) Adds an axis to the current plot on the side given by the first argument (1 to 4, counting
clockwise from the bottom.) Other arguments control the positioning of the axis within or beside
the plot, and tick positions and labels. Useful for adding custom axes after calling plot() with the
axes=FALSE argument.
To add Greek characters, either specify font type 5 (see below) or use the function expression
96
A random eqn x = ∑
1.0
n
0.5
cos(x)
0.0
−0.5
−1.0
attach(cars)
plot(cars, type = "n", xlab = "Speed [mph]", ylab = "Distance [ft]")
points(speed[speed < 15], dist[speed < 15], pch = "s", col = "blue")
points(speed[speed >= 15], dist[speed >= 15], pch = "f", col = "green")
lines(lowess(cars), col = "red")
legend(5, 120, pch = c("s", "f"), col = c("blue", "green"),
legend = c("Slow", "Fast"))
title("Breaking distance of old cars")
97
Breaking distance of old cars
120
f
s Slow
f Fast
100 f
f f
80
s
f
Distance [ft]
f f
f f
60
s
f f f f
f f
s f f
f
40
s f f
s s f
s s f f f
s s s s f
s
20
s s s f
s s
s s
s s
0
5 10 15 20 25
Speed [mph]
detach(2)
98
Mean and Median of a Skewed Distribution
40
30
Frequency
n xi
x=∑
i=1 n
20
x^ = median(xi, i = 1, n)
10
0
0 2 4 6 8 10
x
1.0
sinφ
cosφ
0.5
f(φ)
0.0
−0.5
−1.0
−3 −2 −1 0 1 2 3
φ
99
device. See help on par() for more details.
To see a sample of point type available in R, type
example(pch)
library(manipulate)
manipulate(plot(1:x), x = slider(1, 100))
manipulate(plot(cars, xlim = c(0, x.max), type = type, ann = label,
col = col, pch = pch, cex = cex), x.max = slider(10, 25, step = 5,
initial = 25), type = picker(Points = "p", Line = "l", Step = "s"),
label = checkbox(TRUE, "Draw Labels"), col = picker(red = "red",
green = "green", yellow = "yellow"), pch = picker(`1` = 1, `2` = 2,
`3` = 3, `4` = 4, `5` = 5, `6` = 6, `7` = 7, `8` = 8, `9` = 9,
`10` = 10, `11` = 11, `12` = 12, `13` = 13, `14` = 14, `15` = 15,
`16` = 16, `17` = 17, `18` = 18, `19` = 19, `20` = 20, `21` = 21,
`22` = 22, `23` = 23, `24` = 24), cex = picker(`1` = 1, `2` = 2,
`3` = 3, `4` = 4, `5` = 5, `6` = 6, `7` = 7, `8` = 8, `9` = 9,
`10` = 10))
100
type = p, col=red, pch=19, cex=1 type = p, col=blue, pch=21, cex=0.5
120
120
100
100
80
80
dist
dist
60
60
40
40
20
20
0
0
5 10 15 20 25 5 10 15 20 25
speed speed
120
100
100
80
80
dist
dist
60
60
40
40
20
20
0
5 10 15 20 25 5 10 15 20 25
speed speed
101
5.4.2 R Colors
Thus far, we have frequently used numbers in plot to refer to a simple set of colors. There are 8 colors where
0:8 are white, black, red, green, blue, cyan, magenta, yellow and grey. If you provide a number greater than
8, the colors are recycled. Therefore for plots where other or greater numbers of colors are required, we
need to access a larger palette of colors.
3
2
10
1
8
8
7
6
6
5
4
4
3
2
2
2 4 6 8 10 12
R has a large list of over 650 colors that R knows about. This list is held in the vector colors(). Have a
look at this list, and maybe search for a set you are interested in.
colors()[1:10]
length(colors())
## [1] 657
102
## [5] "lightyellow2" "lightyellow3"
## [7] "lightyellow4" "yellow"
## [9] "yellow1" "yellow2"
## [11] "yellow3" "yellow4"
## [13] "yellowgreen"
R are has defined palettes of colors, which provide complementing or contrasting color sets. For example
look at the color palette rainbow.
example(rainbow)
For a more complete listing of colors, along with the RGB numbers for each colors, the follow script
generates a several page pdf document which maybe a useful reference document for you.
source("http://research.stowers-institute.org/efg/R/Color/Chart/ColorChart.R")
• Sequential palettes are suited to ordered data that progress from low to high. Lightness steps dominate
the look of these schemes, with light colors for low data values to dark colors for high data values.
• Diverging palettes put equal emphasis on mid-range critical values and extremes at both ends of the
data range. The critical class or break in the middle of the legend is emphasized with light colors and
low and high extremes are emphasized with dark colors that have contrasting hues.
• Qualitative palettes do not imply magnitude differences between legend classes, and hues are used to
create the primary visual differences between classes. Qualitative schemes are best suited to repre-
senting nominal or categorical data.
library(RColorBrewer)
example(brewer.pal)
I use RColorBrewer to produce nicer colors in clustering heatmaps. For example if we look at the US
state fact and figure information in the package state, which contains a matrix called state.x77 containing
information on 50 US states (50 rows) on population, income, Illiteracy, life expectancy, murder, high school
graduation, number of days with frost, and area (8 columns). The default clustering of this uses a rather ugly
red-yellow color scheme which I changed to a red/brown-blue.
library(RColorBrewer)
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(500)
heatmap(t(state.x77), col = hmcol, scale = "row")
103
Alaska
Texas
Montana
California
Colorado
Wyoming
Oregon
New Mexico
Nevada
Arizona
West Virginia
Maine
South Carolina
Rhode Island
Delaware
Massachusetts
New Jersey
Hawaii
Connecticut
Maryland
104
Vermont
New Hampshire
North Dakota
Washington
Oklahoma
Missouri
South Dakota
Nebraska
Minnesota
Kansas
Utah
Idaho
New York
Ohio
Pennsylvania
Indiana
Virginia
Kentucky
Tennessee
Arkansas
Alabama
North Carolina
Louisiana
Mississippi
Georgia
Iowa
Wisconsin
Florida
Michigan
Illinois
Area
Frost
Murder
Income
Life Exp
Illiteracy
HS Grad
Population
5.5 Interacting with graphics
R also provides functions which allow users to extract or add information to a plot using a mouse via
locator() and verb+identify() functions respectively.
Identify members in a hierarchical cluster analysis of distances between European cities
Stockholm
Athens
Rome
Gibraltar
Lisbon
Madrid
0
Barcelona
Marseilles
Copenhagen
Hamburg
Cherbourg
Munich
Vienna
Milan
Calais
Paris
Cologne
Brussels
Hook of Holland
Geneva
Lyons
eurodist
hclust (*, "complete")
(x <- identify(hca))
x
105
plot(1:20, rt(20, 1))
text(locator(1), "outlier", adj = 0)
Waits for the user to select locations on the current plot using the left mouse button.
attach(women)
plot(height, weight)
identify(height, weight, women)
detach(2)
Allow the user to highlight any of the points (identify(x,y,label)) defined by x and y (using
the left mouse button) by plotting the corresponding component of labels nearby (or the index number of
the point if labels is absent).
Right mouse click, to ”stop”.
106
5.5.1 Exercise 8 - Plotting
Using the women dataset
3. Switch the orientation, Draw weight on the X axis and height on the Y axis.
4. Drawing a new plot, set the pch (point type) to be a solid circle, and color them red. Add a title ”study
of Women” to the plot
5. Drawing another plot, set the pch (point type) to be a solid square, Change the X axis label to be
”Weight of Women” and make the point size (using the parameter cex) larger to 1.5
107
5.6 Saving plots
5.6.1 Rstudio
In RStudio, there is a simple interface to export plots. Click on the ”Export” button in the plot window.
5.6.2 Devices
R can generate graphics (of varying levels of quality) on almost any type of display or printing device.
Before this can begin, however, R needs to be informed what type of device it is dealing with. This is done
by starting a device driver. The purpose of a device driver is to convert graphical instructions from R (”draw
a line,” for example) into a form that the particular device can understand. Device drivers are started by
calling a device driver function. There is one such function for every device driver: type help(Devices) for a
list of them all.
The most useful formats for saving R graphics:
pdf() Produces a PDF file, which can also be included into PDF files.
jpeg() Produces a bitmap JPEG file, best used for image plots.
108
In R, to save the current image to file. Either use the file menu File -> Save as. Or use the functions
dev2bitmap, dev.copy2eps or dev.copy(device, file), where device can be one of png, jpeg or pdf and file is
your filename. For example:
To find out more about the image formats that can be saved in R, see the help on ?Devices.
If you wish to write an image directly to a file, without ”seeing” the plot screen (called X11 or Quartz
depending on the operating system). Use the functions pdf(), postscript(), jpeg() with the syntax:
pdf(file = "myplot.pdf")
plot(1:10, col = "blue", xlab = "X axis", ylab = "Y axis")
dev.off()
Remember it is very important to type dev.off in order to properly save the file
To list the current graphics devices that are open use dev.cur. When you have finished with a device, be
sure to terminate the device driver by issuing the command dev.off().
If you have open a device to write to for example pdf or png, dev.off will ensures that the device finishes
cleanly; for example in the case of hardcopy devices this ensures that every page is completed and has been
sent to the printer or file.
Example:
109
Chapter 6
Advanced Graphics
6.1.1 ggplots2
qplot is the basic plotting function in the ggplot2 package and is a convenient wrapper for creating a number
of different types of plots using a consistent calling scheme. See http://had.co.nz/ggplot2/
book/qplot.pdf for the chapter in the ggplot2 book which describes the usage of qplot in detail.
A nice introductions to ggplots is written by its author Hadley Wickham and is available from http://
www.ceb-institute.org/bbs/wp-content/uploads/2011/09/handout_ggplot2.pdf.
The following examples are taking from that tutorial
Basic ”Quick Plot” aka qplot in ggplots2
require("ggplot2")
data(mtcars)
head(mtcars)
110
## Hornet 4 Drive 1
## Hornet Sportabout 2
## Valiant 1
levels(mtcars$cyl)
## NULL
35
30
cyl
25 4
5
mpg
6
20 7
8
15
10
2 3 4 5
wt
111
35
30
25 factor(cyl)
4
mpg
6
20 8
15
10
2 3 4 5
wt
mtcars is a dataset from 1974 Motor Trend US magazine, and comprises fuel consumption and 10
aspects of automobile design and performance (eg miles/gallon, number of cylinders, displacement, gross
horsepower, weight, seconds to complete a quarter mile, etc) for 32 automobiles. In the above plot we see a
plot of weight x miles per gallon given the number of cylinders in the car.
In the above plot, we view cylinder by color, but it could also be by shape or size
qplot(wt, mpg, data = mtcars, shape = factor(cyl))
35
30
25 factor(cyl)
4
mpg
6
20 8
15
10
2 3 4 5
wt
112
qplot(wt, mpg, data = mtcars, size = factor(cyl), colour = factor(cyl))
35
30
25 factor(cyl)
4
mpg
6
20 8
15
10
2 3 4 5
wt
The function mfrow and layout don’t work with ggplots2, so here is a little script to make layout of
multiple plots using ggplots2 (acknowledgement to Stephen Turner). First assign each ggplot2 plots to an
object, and then use the arrange function to display two or more.
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow, ncol)))
113
ii.p <- 1
for (ii.row in seq(1, nrow)) {
ii.table.row <- ii.row
if (as.table) {
ii.table.row <- nrow - ii.table.row + 1
}
for (ii.col in seq(1, ncol)) {
ii.table <- ii.p
if (ii.p > n)
break
print(dots[[ii.table]], vp = vp.layout(ii.table.row,
ii.col))
ii.p <- ii.p + 1
}
}
}
35 35
30 30
25 factor(cyl) 25 factor(cyl)
4 4
mpg
mpg
6 6
20 8 20 8
15 15
10 10
2 3 4 5 2 3 4 5
wt wt
114
qplot(wt, mpg, data = mtcars, facets = cyl ˜ .)
35
30
25
4
20
15
10
35
30
25
mpg
6
20
15
10
35
30
25
8
20
15
10
2 3 4 5
wt
More complex Facets to view cross-tabulated categories. For example you would expect a strong inter-
action between cylinder and horsepower.
table(mtcars$cyl, mtcars$hp)
##
## 52 62 65 66 91 93 95 97 105 109 110 113 123 150 175 180 205 215
## 4 1 1 1 2 1 1 1 1 0 1 0 1 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 1 0 3 0 2 0 1 0 0 0
## 8 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 3 1 1
115
##
## 230 245 264 335
## 4 0 0 0 0
## 6 0 0 0 0
## 8 1 2 1 1
52 62 65 66 91 93 95 97 105 109 110 113 123 150 175 180 205 215 230 245 264 335
35
30
25
4
20
15
10
35
30
Miles per Gallon
25
6
20
15
10
35
30
25
8
20
15
10
2345 2345 2345 2345 2345 2345 2345 2345 2345 2345 2345 2345 2345 2345 2345 2345 2345 2345 2345 2345 2345 2345
Weight
In the scatterplot examples above, we implicitly used a point geom, the default when you supply two
arguments to qplot(). qplots can produce several other plots, if a different geom is defined
(modified from ggplots2: Elegant Graphics for Data Analysis, Chapter 3)
116
Plot Geom Other features
scatterplot point
bubblechart point size defined by variable
barchart bar
box-whisper boxplot
line line
When given a single vector, the default geom is Histogram. Defining geom as density will instead draw
a density (smoothed histogram)
117
5 3.0
2.5
4
2.0
3
count
count
1.5
2
1.0
1
0.5
0 0.0
2 3 4 5 2 3 4 5
wt wt
0.5 1.0
0.4 0.8
factor(cyl)
0.3 0.6
density
density
4
6
0.2 0.4 8
0.1 0.2
0.0 0.0
2 3 4 5 2 3 4 5
wt wt
118
35 35
30
30
25
25
mpg
mpg
20
20
15
15
10
10 5
2 3 4 5 2 3 4 5
wt wt
30 0
30
25
20 factor(cyl) 5 factor(cyl)
15 1
factor(1)
10 4 4
count
5 25
0
6 6
8 10 8
20
15
factor(1) count
6.1.2 lattice
Lattice plots allow the use of the layout on the page to reflect meaningful aspects of data structure. They
offer abilities similar to those in the S-PLUS trellis library.
The lattice package sits on top of the grid package. To use lattice graphics, both these packages must be
installed. Providing it is installed, the grid package will be loaded automatically when lattice is loaded.
Resources for lattice
119
• To get on help on lattice functions, use help just like you would do for any package help(package =
lattice)
weight a numeric vector giving the body weight of the chick (gm).
Time a numeric vector giving the number of days since birth when the measurement was made.
Chick an ordered factor with levels ’18’ ¡ ... ¡ ’48’ giving a unique identifier for the chick. The ordering
of the levels groups chicks on the same diet together and orders them according to their final weight
(lightest to heaviest) within diet.
Diet a factor with levels 1,...,4 indicating which experimental diet the chick received.
Figure below shows the style of graph that one can get from xyplot().
library(lattice)
xyplot(weight ˜ Time | Diet, data = ChickWeight) # Simple use of
xyplot
120
0 5 10 15 20
3 4
300
200
100
weight
1 2
300
200
100
0 5 10 15 20
Time
Here is the statement used to get a figure with the observations for the same Chick connected via lines.
121
0 5 10 15 20
3 4
300
200
100
weight
1 2
300
200
100
0 5 10 15 20
Time
This function shows the defaults for the graphical display of Trellis displays
show.settings()
An incomplete list of lattice Functions
splom( ˜ data.frame) # Scatterplot matrix
bwplot(factor ˜ numeric , . .) # Box and whisker plot
dotplot(factor ˜ numeric , . .) # 1-dim. Display
stripplot(factor ˜ numeric , . .) # 1-dim. Display
barchart(character ˜ numeric,...)
histogram( ˜ numeric, ...) # Histogram
densityplot( ˜ numeric, ...) # Smoothed version of histogram
qqmath(numeric ˜ numeric, ...) # QQ plot
splom( ˜ dataframe, ...) # Scatterplot matrix
parallelplot( ˜ dataframe, ...) # Parallel coordinate plots
In each instance, conditioning variables can be added.
Examples:
x <- 1:10
y <- 1:10
g <- factor(1:10)
barchart(y ˜ g | 1)
122
1
10
6
y
1 2 3 4 5 6 7 8 9 10
bwplot(yy ˜ gg | 1)
123
1
1.0
0.5
yy
0.0
−0.5
−1.0
1 2
densityplot(˜ yy | 1)
0.4
0.3
Density
0.2
0.1
0.0
−2 −1 0 1 2
yy
124
histogram(˜ yy | 1)
1
25
20
Percent of Total
15
10
yy
qqmath(˜ yy | 1)
1
1.0
0.5
yy
0.0
−0.5
−1.0
−2 −1 0 1 2
qnorm
125
xyplot(xx ˜ yy | 1)
1
1.0
0.5
xx
0.0
−0.5
−1.0
yy
126
parallelplot(˜ data.frame(x = xx[1:10], y = yy[1:10]) | 1)
1
y
Min Max
127
1
3.0
2.5
2.0
yyy
1.5
1.0
0.5
0.0
xxx
1
3.0
2.5
2.0
yyy
1.5
1.0
0.5
0.0
xxx
128
cloud(zzz ˜ xxx + yyy | 1, zlab = NULL, zoom = 0.9, par.settings =
list(box.3d = list(lwd = 0.01)))
yyy
xxx
yyy
xxx
129
Note whilst, lattice plots are highly customizable. Note: the base graphics settings; in particular, par()
settings usually have no effect on lattice plots. Use trellis.par.get() and trellis.par.set() to change default plot
parameters.
130
6.2 GoogleVis and GoogleMaps visualization
There are multiple visualization tools available within the googleVis library. These include the Hans Rosling
type bubble plots. See http://code.google.com/apis/visualization/documentation/
gallery/motionchart.html
See http://blog.revolutionanalytics.com/graphics/ for some exampels of R code
# install.packages('googleVis')
library(googleVis)
GoogleVis also provide nice support for Maps and spatial visualization of trends.
131
Figure 6.1: This is actually an interactive animated html file
plot(states.Inc)
132
Figure 6.2: This is actually an interactive html file, if you hover over a state you see its information
6.3 Graph theory and Network visualization using R packages network and
igraph
# install.packages(network)
library(network)
133
g <- network(m)
# Plot the graph
plot(g)
134
"Medici"))
# install.packages('igraph')
library(igraph)
135
## is.directed, list.edge.attributes, list.vertex.attributes,
## set.edge.attribute, set.vertex.attribute
136
6.4 Tag Clouds, Literature Mining
The following script will create tag cloud given a list of PubMed abstract identifiers.
## -------------------------------------------------------- Given a
## list of PMIDs get their annotation Aedin, Dec 2011 To Run given
## pmids2tagcloud a list of pmids eg pmids=c(10521349, 10582678,
## 11004666, 11108479, 11108479, 11114790, 11156382, 11156382,
## 11156382, 11165872) pmids2tagcloud(pmids)
## ---------------------------------------------------------
getPMIDAnnot <- function(pmidlist) {
require(annotate)
require(XML)
print("Using annotate and XML to get info on each PMID")
pubmedRes <- xmlRoot(pubmed(pmidlist))
numAbst <- length(xmlChildren(pubmedRes))
absts <- list()
for (i in 1:numAbst) {
absts[[i]] <- buildPubMedAbst(pubmedRes[[i]])
}
# print(PMIDInfo[1:2,])
return(PMIDInfo)
}
137
pmids2tagcloud <- function(pmids, addTitle = TRUE, colorPalette =
c("orange",
"cyan", "green4", "maroon", "slateblue")) {
require(tm)
require(wordcloud)
require(RColorBrewer)
print(paste("Using tm and wordcloud to create tag cloud from",
length(pmids),
"abstracts"))
pubmedAbsts <- getPMIDAnnot(as.character(unique(pmids)))
words <-
tolower(unlist(strsplit(as.character(pubmedAbsts$abstText),
" ")))
# remove parentheses, comma, [semi-]colon, period, quotation marks
words <- words[-grep("[\\)\\(,;:\\.\\'\\\"]", words)]
words <- words[-grep("^\\d+$", words)]
words <- words[!words %in% stopwords()]
wt <- table(words)
138
TagCloud generated from 10 PubMed Abstracts
cdna expression
tissue specific measured
provide
class phenotypic pulmonary
c−jun consists cellular dna
signaling set microdissection vivo development
mcf+sod
grade revealed array
data results antisense
human sets found ap−1 blots mnsod
up−regulated normal et rat
leukemia
profile classes
classification
gadd153 analysis using
microarray mnsod−induced
acute diffuse target
genes capture
laser
cells
gene
PMIDS:10521349 10582678 11004666 11108479 11108479 11114790 11156382 11156382 11156382 11165872
139
using Rgraphviz http://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/
r/rgraphviz/ or see the many examples on the bioconuductor website
a recent discussion online about the topic: http://stats.stackexchange.com/questions/
6155/graph-theory-analysis-and-visualization
R cytoscape http://db.systemsbiology.net:8080/cytoscape/RCytoscape/vignette/
RCytoscape.html
3. Additional demos available in the graphics package: demo(image), demo(persp) and example(symbol).
The following web pages have many examples (and code) to produce different R plots. Browse through
the plots, see what you like and try some.
• rggobi and ggplot2 run workshops in R graphics. Their course website provides examples of ba-
sic and advanced plots, animated movies, lectures and R code to reproduce the plots at http:
//lookingatdata.com/
• The R Gallery wiki provides examples of R plots and code to reproduce these http://addictedtor.
free.fr/graphiques/. Here is one random sample from this website:
– To view or change default plot settings: par(). This will change the settings for all subsequent
plot commands.
140
Chapter 7
5. model.matrix, contrasts
6. Models considered:
• Survival analysis: Surv(), coxph() in the survival and functions in the package survcomp
7. Advanced model options are covered in detail in the recommended text of Venables and Ripley.
• The arm package contains R functions for Bayesian inference using lm, glm, mer and polr
objects. The bayesm is aims at markefting and micro economics fields but includes functions
for Bayes Regression and Hierarchical Linear Models.
• The R package doBy is useful for groupwise computations of summary statistics. Facilities
for groupwise computations of summary statistics and other facilities for working with grouped
data (similar to what can be achieved by proc means or proc summary of the sas system).
141
• See http://cran.r-project.org/src/contrib/Views/ for lists of more R pack-
ages.
142
7.2 Basic Statistics
7.2.1 Continuous Data: t test
The t.test performs a one or two sample t test. To see the arguments of t.test, look at the help documentation
?t.test
Arguments include alternative which is one of ”two.sided”, ”less” or ”greater”, and var.equal which is
a logical (FALSE or TRUE) to indicate unequal or equal variance (default is unequal). The input to t.test is
one vector (one sample t test), two vectors or a formula (two sample t test). A formula is given by y ∼ x,
where the tilde ’∼’ operator specifies ”described by”
One sample t test:
data(ChickWeight)
ChickWeight[1:2, ]
##
## One Sample t-test
##
## data: ChickWeight[, 1]
## t = 7.38, df = 577, p-value = 5.529e-13
## alternative hypothesis: true mean is not equal to 100
## 95 percent confidence interval:
## 116.0 127.6
## sample estimates:
## mean of x
## 121.8
##
t.test(ChickWeight$weight[ChickWeight$Diet == "1"],
ChickWeight$weight[ChickWeight$Diet ==
"2"])
##
## Welch Two Sample t-test
##
## data: ChickWeight$weight[ChickWeight$Diet == "1"] and ChickWeight$weight[ChickW
## t = -2.638, df = 201.4, p-value = 0.008995
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -34.900 -5.042
## sample estimates:
143
## mean of x mean of y
## 102.6 122.6
##
##
## Welch Two Sample t-test
##
## data: weight by Diet
## t = -2.638, df = 201.4, p-value = 0.008995
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -34.900 -5.042
## sample estimates:
## mean in group 1 mean in group 2
## 102.6 122.6
##
##
## Pairwise comparisons using t tests with pooled SD
##
## data: ChickWeight$weight and ChickWeight$Diet
##
## 1 2 3
## 2 0.06838 - -
## 3 2.5e-06 0.14077 -
## 4 0.00026 0.95977 1.00000
##
## P value adjustment method: bonferroni
144
pVal <- round(p, 3)
Bonferroni <- round(p.adjust(p, "bonferroni"), 3)
##
## Call:
145
## lm(formula = weight ˜ Diet, data = ChickWeight)
##
## Coefficients:
## (Intercept) Diet2 Diet3 Diet4
## 102.6 20.0 40.3 32.6
##
summary(lmDiet)
##
## Call:
## lm(formula = weight ˜ Diet, data = ChickWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.0 -53.6 -13.6 40.4 230.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.65 4.67 21.96 < 2e-16 ***
## Diet2 19.97 7.87 2.54 0.011 *
## Diet3 40.30 7.87 5.12 4.1e-07 ***
## Diet4 32.62 7.91 4.12 4.3e-05 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 69.3 on 574 degrees of freedom
## Multiple R-squared: 0.0535,Adjusted R-squared: 0.0485
## F-statistic: 10.8 on 3 and 574 DF, p-value: 6.43e-07
##
anova(lmDiet)
In some statistical packages, the sum of the squares are labeled ”between groups” and ”within groups”.
Since lm and anova tables are used for a wide range of statistical models, the output from R is different. The
Between groups sum of the squares is labeled by the name of the factor groupings (Diet). The within sum of
the squares is labeled Residuals. The aov function is a wrapper which calls lm, but express the results these
in the traditional language of the analysis of variance rather than that of linear models.
146
For examples of different analysis of variance (using aov) at http://personality-project.
org/r/r.anova.html
## Call:
## aov(formula = weight ˜ Diet, data = ChickWeight)
##
## Terms:
## Diet Residuals
## Sum of Squares 155863 2758693
## Deg. of Freedom 3 574
##
## Residual standard error: 69.33
## Estimated effects may be unbalanced
##
## Call:
## lm(formula = weight ˜ Diet + Time, data = ChickWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.8 -17.1 -2.6 15.0 141.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.924 3.361 3.25 0.0012 **
## Diet2 16.166 4.086 3.96 8.6e-05 ***
## Diet3 36.499 4.086 8.93 < 2e-16 ***
## Diet4 30.233 4.107 7.36 6.4e-13 ***
## Time 8.750 0.222 39.45 < 2e-16 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Residual standard error: 36 on 573 degrees of freedom
## Multiple R-squared: 0.745,Adjusted R-squared: 0.744
## F-statistic: 419 on 4 and 573 DF, p-value: <2e-16
##
anova(lmDiet)
147
## Response: weight
## Df Sum Sq Mean Sq F value Pr(>F)
## Diet 3 155863 51954 40.1 <2e-16 ***
## Time 1 2016357 2016357 1556.4 <2e-16 ***
## Residuals 573 742336 1296
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
where
Oi = an observed frequency;
Another measure of association is the Cramer’s V statistic that is generalizable to rectangular contin-
gency table
q
χ2
V = N (k−1) ,
k being the number of rows or the number of columns, whichever is less.
See http://en.wikipedia.org/wiki/Contingency_table for more details about measure of
association in a contingency table.
In R, first create a contingency table and then use the function assocstats to computes the Pearson chi-
Squared test, the Likelihood Ratio, chi-Squared test, the phi coefficient, the contingency coefficient, and
Cramer’s V statistics.
## load library
library(vcd)
148
## load data
attach(Arthritis)
## [1] TRUE
print(levels(Arthritis$Treatment))
is.factor(Arthritis$Improved)
## [1] TRUE
print(levels(Arthritis$Improved))
##
## None Some Marked
## Placebo 29 7 7
## Treated 13 7 21
## compute statistics
res <- assocstats(tab)
print(res)
detach(Arthritis)
The structure of the res object can be printed using the str function
149
str(res)
## List of 5
## $ table : table int [1:2, 1:3] 29 13 7 7 7 21
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "Placebo" "Treated"
## .. ..$ : chr [1:3] "None" "Some" "Marked"
## $ chisq_tests: num [1:2, 1:3] 13.52981 13.05502 2 2 0.00115 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "Likelihood Ratio" "Pearson"
## .. ..$ : chr [1:3] "Xˆ2" "df" "P(> Xˆ2)"
## $ phi : num 0.394
## $ contingency: num 0.367
## $ cramer : num 0.394
## - attr(*, "class")= chr "assocstats"
You can easily access the various statistics from the res object
## Cramer's V statistic
print(res$cramer)
## [1] 0.3942
If you want to compute the agreement between two classifications or raters, you can estimate the κ
coefficient which can have the following typical values
Kappa value magnitude of agreement
<0 no
0 - 0.2 small
0.2 - 0.4 fair
0.4 - 0.6 moderate
0.6 - 0.8 substantial
0.8 - 1 almost perfect
## value ASE
150
## Unweighted -0.04839 0.10073
## Weighted NaN 0.08598
In a practical situation, your Kappa coefficient needs to be over 0.6 to claim that your categorization is
valid. You may also want to report both the agreement (%)
## Agreement: 0.48%
• Continuous Data
– t.test
– bartlett.test: Bartlett’s test of the null that the variances in each of the groups (samples) are the
same
• Non-Parametric
– wilcox.test: one- and two-sample Wilcoxon tests on vectors of data; the latter is also known as
Mann-Whitney test.
• Correlation
cor, cor.test: correlation and Correlation tests. Cor.test methods include ”kendall” ”spearman” or
”pearson”.
151
– chisq.test: chi-squared contingency table tests, fisher.test exact test for small tables.
* As a complete aside and to continue stories of Ireland’s mathematicians and statisticians, which I started with the story of
George Boole, the first professor of mathematics of University College Cork. The t statistic was introduced by William Sealy
Gosset to monitoring the quality of brewing in the Guinness brewery in Dublin, Ireland. Guinness’s has an innovative policy of
recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness’ industrial processes.
Gosset published the t test in Biometrika in 1908, but published using the pen name Student.
152
7.3 Model formulae and model options
• Most modeling done in a standard way
Basic output of the model fitting process is minimal. Details obtained via extractor functions.
153
7.3.1 Model formulae
We have already seen in several functions (boxplot, t.test, lm) that a simply function is defined by y ∼ x.
We will now discuss formulae in much more detail.
p
X
yi = βj xij + i , i ∼ NID(0, σ 2 ), i = 1, . . . , n
j=0
y = Xβ +
where y is the response vector, X is the model matrix or design matrix with columns x0 , x1 , . . . , xp .
NOTATION:
y, x, x0, x1, x2, ... - numeric variables
A, B, C, ... - factors
154
General form:
where
library(MASS)
help("cats")
str(cats)
155
7.3.3 Contrasts, model.matrix
We need to understand how model formulae specify the columns of the model matrix.
1. Continuous variables (simplest): each variable provides a column of the model matrix (and the inter-
cept will provide a column of ones if included in the model).
2. k-level factor A
The answer differs for unordered and ordered factors.
• Unordered factors
k − 1 columns are generated for the indicators of the second, third, . . . , up to kth levels of the
factor. (Implicit parameterization is to contrast the response at each level with that at the first.)
• Ordered factors
k − 1 columns are the orthogonal polynomials on 1, . . . , k, omitting the constant term.
If the intercept is omitted in a model that contains a factor term, the first such term is encoded into k
columns giving the indicators for all the levels.
R default setting is:
contr.treatment(n = 3, base = 2)
## 1 3
## 1 1 0
## 2 0 0
## 3 0 1
contr.sum(n = 3)
## [,1] [,2]
## 1 1 0
## 2 0 1
## 3 -1 -1
156
7.4 Exercise 9
##
## Call:
## lm(formula = Hwt ˜ Sex, data = cats)
##
## Coefficients:
## (Intercept) SexM
## 9.20 2.12
##
##
## Call:
## lm(formula = Hwt ˜ Sex - 1, data = cats)
##
## Coefficients:
## SexF SexM
## 9.2 11.3
##
157
7.5 Output and extraction from fitted models
As mentioned earlier, the printed output of the model fit is minimal. However, the value of a fitted model
object is stored in an object. Information about the fitted model can be displayed, extracted and plotted.
Extractor functions:
Let’s go back to the cats example. Now, we can extract more information about the fits.
attach(cats)
cats.lmBS <- lm(Hwt ˜ Bwt + Sex, data = cats)
coef(cats.lmBS)
plot(Bwt, Hwt)
lines(Bwt, fit.catsBS, col = "green", lwd = 2) # OR
# abline(cats.lmBS, col='green', lwd=2)
lines(Bwt[Sex == "F"], fit.catsBxS[Sex == "F"], col = "red")
lines(Bwt[Sex == "M"], fit.catsBxS[Sex == "M"], col = "blue")
legend(x = 2, y = 20, legend = c("LMBS", "lmBxS.female", "lmBxS.male"),
col = c("green", "red", "blue"), lwd = c(2, 1, 1))
158
20
LMBS
lmBxS.female
lmBxS.male
18
16
14
Hwt
12
10
8
6
## 1 2 3 4
## 7.441 11.754 16.067 20.379
detach(cats)
Some more useful, but non-standard, ways of extracting information from a model.
159
names(summary(obj)) - gives names of the components in summary(obj)
How to get the residual variance of the fit? There are at least 2 ways. The first is the direct calculation
160
7.6 Exercise 10 :Multivariate linear regression
• Read the data contained in the file lungs.csv from the course website into R. Fit a multivariate
regression model (function lm) of pemax using all variables. Call the result lungFit.
• Which are the most and least significant variables in this model?
161
7.6.1 Residual plots, diagnostics
Important part of modeling - checking the model assumptions. Some easy to check and interpret model
diagnostics
attach(cats)
plot(Bwt, Hwt, main = "Model fit")
abline(cats.lmB, col = "green", lwd = 2)
Residual histogram
qqnorm(resid(cats.lmB))
qqline(resid(cats.lmB))
detach()
QQ-plot of residuals
162
Model fit Residual histogram
40
18
30
Frequency
14
Hwt
20
10
10
8
0
6
2.0 2.5 3.0 3.5 −4 −2 0 2 4 6
Bwt resid(cats.lmB)
4
Sample Quantiles
resid(cats.lmB)
2
0
0
−2
−2
8 10 12 14 −2 −1 0 1 2
Figure 7.1: Plots of the fitted Model,Residuals vs. fitted values and QQ-PLots of the residuals
163
6
4
144 144
3
4 135
135
Standardized residuals
71
71
2
2
Residuals
1
0
0
−1
−2
31
−2
140 31
−4
140
8 10 12 14 −2 −1 0 1 2
Fitted values Theoretical Quantiles
Scale−Location Cook's distance
135 140
1.5
31 71
Standardized residuals
Cook's distance
1.0
140
0.5
135
136
142
0.0
anova(obj_1, obj_2) - compare two models where obj_1 and obj_2 are two regression
The sums of squares shown are the decrease in the residual sums of squares resulting from an inclusion
of that term in the model at that place in the sequence. Only for orthogonal experiments will the order of
inclusion be inconsequential.
164
anova(cats.lmB, cats.lmBS)
anova(cats.lmB, cats.lmBxS)
The update() function allows a model to be fitted that differs from one previously fitted usually by just a
few additional or removed terms.
Syntax:
new.model <- update(old.model, new.formula)
Special name in the new formula - a period ’.’ - can be used to stand for “corresponding part of the old
model formula”.
Example:
Data set mtcars, fuel consumption and 10 aspects of automobile design and performance for 32 auto-
mobiles.
help("mtcars")
cars.lm <- lm(mpg ˜ hp + wt, data = mtcars)
cars.lm2 <- update(cars.lm, . ˜ . + disp)
165
# cars.lms <- update(cars.lm2, sqrt(.) ˜ .)
More details:
dropterm Fits all models that differ from the current model by dropping a single term, maintaining marginal-
ity.
addterm Fits all models that differ from the current model by adding a single term from those supplied,
maintaining marginality
stepAIC(model.small,
scope=list(upper=model.big, lower= ˜1),
test=F) # for linear models
166
Example
Data set mtcars
# Stepwise selection
cars.step <- stepAIC(cars.some, scope=list(lower = ˜ wt))
7.7 Cross-validation
Another approach, developed by the machine Learning community, is cross-validation. The idea is to
sequentially divide the dataset in training and test sets to fit and assess the performance of the model,
respectively. This approach enables to use all the observations both for training and testing the prediction
model.
Here is an example of a 10-fold cross-validation on the mtcars dataset where we compare the model
with one variable (wt) and all the variables to predict mpg. Once the root mean squared error (RMSE) is
computed for each fold, a paired Wilcoxon Rank Sum test is used to compare the performance of the small
and big models.
nfold <- 10
## nr is the number of observations
nr <- nrow(mtcars)
## nfold is the number of folds in the cross-validation
if (nfold > 1) k <- floor(nr/nfold) else {
k <- 1
nfold <- nr
}
smpl <- sample(nr)
mse.big <- mse.small <- NULL
for (i in 1:nfold) {
if (i == nfold)
s.ix <- smpl[c(((i - 1) * k + 1):nr)] else s.ix <- smpl[c(((i -
1) * k + 1):(i * k))]
## fit the model
167
mm.big <- lm(mpg ˜ ., data = mtcars[-s.ix, , drop = FALSE])
mm.small <- lm(mpg ˜ wt, data = mtcars[-s.ix, , drop = FALSE])
## assess the performance of the model
pp.big <- predict(object = mm.big, newdata = mtcars[s.ix,
!is.element(colnames(mtcars),
"mpg")])
pp.small <- predict(object = mm.small, newdata = mtcars[s.ix,
!is.element(colnames(mtcars),
"mpg")])
## compute mean squared error (MSE)
mse.big <- c(mse.big, sqrt(mean((mtcars[s.ix, "mpg"] - pp.big)^2)))
mse.small <- c(mse.small, sqrt(mean((mtcars[s.ix, "mpg"] -
pp.small)^2)))
}
names(mse.big) <- names(mse.small) <- paste("fold", 1:nfold,
sep = ".")
##
## Wilcoxon signed rank test
##
## data: mse.big and mse.small
## V = 35, p-value = 0.7842
## alternative hypothesis: true location shift is less than 0
##
As can be seen, there is not enough evidence in the dataset to claim that the big prediction model
outperforms the small one (p-value > 0.05). You can easily change the number of folds in the cross-
validation by setting the variable nfold to another value, nf old = 1 for leave-one-out cross-validation.
168
7.8 Statistical models
Will talk today about 3 classes of statistical models: linear regression, generalized linear models (e.g. logis-
tic and Poisson regression), and survival models.
weights - (optional) weights to fit the model using weighted list squares method
attach(ChickWeight)
time.wgt <- tapply(weight, Time, var)
time.wgt.rep <- as.numeric(time.wgt[match(Time,
as.numeric(names(time.wgt)))])
detach(2)
Chick.anl <- data.frame(ChickWeight, time.wgt.rep = time.wgt.rep)
chick.lm.wgt <- lm(weight ˜ Time, data = Chick.anl, weight =
1/time.wgt.rep)
chick.lm.T0 <- lm(weight ˜ Time, data = Chick.anl, subset = (Time ==
0))
169
7.8.2 Generalized linear modeling
• One generalization of multiple linear regression.
• The distribution of Y depends on the x’s through a single linear function, the ’linear predictor’
ν = β1 x1 + β2 x2 + . . . + βp xp (7.1)
• The protocols are very similar to linear regression and the inferential logic is virtually identical.
The class of generalized linear models handled by facilities supplied in R includes gaussian, binomial,
poisson, inverse gaussian and gamma response distributions.
Families of distributions and links
Distribution Link
------------- -------------------------------
binomial logit, probit, log, cloglog
gaussian identity, log, inverse
Gamma identity, inverse, log
inverse.gaussian 1/muˆ2, identity, inverse, log
poisson identity, log, sqrt
The R function to fit a generalized linear model is glm() which uses the form
The only difference from lm() is the family.generator, which is the instrument by which the family is
described. It is the name of a function that generates a list of functions and expressions that together define
and control the model and estimation process.
We will concentrate on the binomial family with the logit link or as you probably know it ’logistic
regression’,.
Logistic regression
To fit a binomial model using glm() there are three possibilities for the response:
1. If the response is a vector it is assumed to hold binary data, and so must be a 0/1 vector.
2. If the response is a two-column matrix it is assumed that the first column holds the number of suc-
cesses for the trial and the second holds the number of failures.
170
3. If the response is a factor, its first level is taken as failure (0) and all other levels as ’success’ (1).
Syntax:
glm(y ˜ x, family=binomial(link=logit), data = data.frame)
Link is optional, since the default link is logit. Necessary, if another link is desired, e.g. probit.
Example of logistic regression using data set esophagus
Data from a case-control study of (o)esophageal cancer in Ile-et-Vilaine, France containing records for 88
age/alcohol/tobacco combinations with the 3 covariates grouped into 6, 4 and 4 groups respectively.
summary(esoph)
171
## Start: AIC=298.6
## cbind(ncases, ncontrols) ˜ agegp
##
## Df Deviance AIC LRT Pr(Chi)
## + alcgp 3 64.6 230 74.5 4.5e-16 ***
## + tobgp 3 120.0 286 19.1 0.00026 ***
## <none> 139.1 299
## - agegp 5 227.2 377 88.1 < 2e-16 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Step: AIC=230.1
## cbind(ncases, ncontrols) ˜ agegp + alcgp
##
## Df Deviance AIC LRT Pr(Chi)
## + tobgp 3 54.0 226 10.6 0.014 *
## <none> 64.6 230
## - agegp 5 138.8 294 74.2 1.4e-14 ***
## - alcgp 3 139.1 299 74.5 4.5e-16 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## Step: AIC=225.4
## cbind(ncases, ncontrols) ˜ agegp + alcgp + tobgp
##
## Df Deviance AIC LRT Pr(Chi)
## <none> 54.0 226
## - tobgp 3 64.6 230 10.6 0.014 *
## - alcgp 3 120.0 286 66.1 3.0e-14 ***
## - agegp 5 131.5 293 77.5 2.8e-15 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Prediction and residuals Plot the fitted values for ’age’ effect
attach(esoph)
eso.pred.age <- predict.glm(eso.base, data.frame(agegp = agegp,
tobgp = rep("30+", 88), alcgp = rep("40-79", 88)), type =
"response")
plot(eso.pred.age ˜ agegp, ylab = "Predicted Age", xlab = "Age group
(True age)")
172
0.4
0.3
Predicted Age
0.2
0.1
detach(esoph)
4 types of residuals can be requested for the glm() models: deviance, working, Pearson, response
173
2
Different types of residuals
15
1
10
deviance
working
0
5
−1
0
0 20 40 60 80 0 20 40 60 80
Index Index
4
0.4
3
0.2
2
response
pearson
1
0.0
0
−0.2
−1
0 20 40 60 80 0 20 40 60 80
Index Index
par(opar)
174
7.9 Survival modeling
Survival Analysis is a class of statistical methods for studying the occurrence and timing of events.
These methods are most often applied to the study of deaths but can also handle different kinds of
events, including the onset of disease and equipment failure for instance. For instance a disease
consists of a transition from an healthy state to a diseased state. Moreover, the timing of the event is
also considered for analysis.
Survival data have a common feature, namely censoring, that is difficult to handle with conventional
statistical methods. Consider the following example, which illustrates the problem of censoring. A
sample of breast cancer patients were followed during 10 years after diagnosis. The event of interest
was the appearance of a distant metastasis (a tumor initiated from the primary breast tumor cells and
that is located in another organ). The aim was to determine how the occurrence and timing of distant
metastasis appearance depended on several variables.
An observation on a random variable t is right-censored if all you know about t is that it is greater
than some value c. In survival analysis, t is typically the time of occurrence for some event, and cases
are right-censored because observation is terminated before the event occurs.
Random censoring occurs when observations are terminated for reasons that are not under the control
of the investigator. This situation can be illustrated in our example. Patients who are still free of
distant metastasis after 10 years are censored by a mechanism identical to that applied to the singly
right-censored data. But some patients may move away, and it may be impossible to contact them.
Some patients may die from another cause. Still other patients may refuse to participate after, say, 5
years. These kinds of censoring are depicted in Figure 7.2, where the symbol ”+” for the patients A
and C indicates that observation is censored at that point in time.
The vast majority of the he functions we need to do survival analysis are in the package survival.
Check if the package survival is already loaded into your work space, if it isn’t load the library
survival
search()
library(survival)
We will work with the data set ’leukemia’ containing times of death or censoring in patients with
Acute Myelogenous Leukemia. The survival data are usually stored in a Surv object that is a one-
column matrix containing the survival times and events/censoring.
data(leukemia)
head(leukemia)
175
## time status x
## 1 9 1 Maintained
## 2 13 1 Maintained
## 3 13 0 Maintained
## 4 18 1 Maintained
## 5 23 1 Maintained
## 6 28 0 Maintained
`?`(Surv)
mysurv <- Surv(leukemia$time, leukemia$status)
head(mysurv)
Several methods for survival analysis are implemented in R, mainly in the survival package:
survfit - computes an estimate of a survival curve for censored data using the Kaplan-Meier method,
e.g. survfit(Surv(time, status) group)
survreg - regression for a parametric survival model with special case, the accelerated failure models
that use a log transformation of the response.
We can easily draw the survival curve of patients representing the proportion of patients who survived
over time. We provide the function survfit with the following set of arguments:
176
leuk.km <- survfit(Surv(time, status) ˜ x, data = leukemia)
plot(leuk.km, lty = 1, col = c("darkblue", "darkred"))
legend(100, 1, legend = c("Maintain", "Non-main"), lty = 1:2,
col = c("darkblue", "darkred"))
1.0
Maintain
Non−main
0.8
0.6
0.4
0.2
0.0
0 50 100 150
177
##
## x=Nonmaintained
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 12 2 0.8333 0.1076 0.48171 0.956
## 8 10 2 0.6667 0.1361 0.33702 0.860
## 12 8 1 0.5833 0.1423 0.27014 0.801
## 23 6 1 0.4861 0.1481 0.19188 0.730
## 27 5 1 0.3889 0.1470 0.12627 0.650
## 30 4 1 0.2917 0.1387 0.07240 0.561
## 33 3 1 0.1944 0.1219 0.03120 0.461
## 43 2 1 0.0972 0.0919 0.00575 0.349
## 45 1 1 0.0000 NaN NA NA
##
178
1.0
Maintain
Non−main
0.8
0.6
0.4
0.2
0.0
0 50 100 150
## Call:
## survdiff(formula = Surv(time, status) ˜ x, data = leukemia)
##
## N Observed Expected (O-E)ˆ2/E (O-E)ˆ2/V
## x=Maintained 11 7 10.69 1.27 3.4
## x=Nonmaintained 12 11 7.31 1.86 3.4
##
## Chisq= 3.4 on 1 degrees of freedom, p= 0.0653
179
7.9.3 Cox proportional hazards model
The (semi-parametric) Cox regression model refers to the method first proposed in 1972 by the British
statistician Cox in his seminal paper “Regression Models and Life Tables”. It is difficult to exaggerate
the impact of this paper. In the 1992 Science Citation Index, it was cited over 800 times, making it the
most highly cited journal article in the entire literature of statistics. In fact, Garfield reported that its
cumulative citation count placed it among the top 100 papers in all branches of science.
This enormous popularity can be explained by the fact that, unlike the parametric methods, Cox’s
method does not require the selection of some particular probability distribution to represent survival
times. For this reason, the method is called semi-parametric. Cox made two significant innovations.
First, he proposed a model that is often referred to as the proportional hazards model. Second, he
proposed a new estimation method that was later named maximum partial likelihood. The term Cox
regression refers to the combination of the model and the estimation method
Here is an example of Cox regression estimating the benefit of maintaining chemotherapy of with
respect to the survival of the patients.
## Call:
## coxph(formula = Surv(time, status) ˜ x, data = leukemia)
##
## n= 23, number of events= 18
##
## coef exp(coef) se(coef) z Pr(>|z|)
## xNonmaintained 0.916 2.498 0.512 1.79 0.074 .
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## xNonmaintained 2.5 0.4 0.916 6.81
##
## Concordance= 0.619 (se = 0.073 )
## Rsquare= 0.137 (max possible= 0.976 )
## Likelihood ratio test= 3.38 on 1 df, p=0.0658
## Wald test = 3.2 on 1 df, p=0.0737
## Score (logrank) test = 3.42 on 1 df, p=0.0645
##
It is not trivial to estimate the relevance of a variable with survival. If this variable is categorical,
you can draw the survival curves and statistically compare them. If the variable under interest is
continuous you can arbitrarily discretize it (not advisable) or use many existing performance criteria
published so far for survival analysis: hazard ration (see coxph), D.index, concordance.index, time-
180
dependent ROC curve, Brier score,. . . The survcomp package contains functions to estimate these
criteria.
181
7.10 Exercise 11: Survival Anlaysis
– draw the Kaplan-Meier survival curves for the three group of patients encode by ’rx’.
– Use different colors for the curves and plot the lines twice as thick as the default size (parameter
lwd).
– Which color encodes which group? Add a legend to the plot to make this clear.
– Generate a PDF output of the plot and put it in the website dropbox along with your code.
182
Chapter 8
Solutions to Exercises
Women Data
myURL <-
"http://bcb.dfci.harvard.edu/˜ aedin/courses/Bioconductor/Women.txt"
women <- read.table(myURL, sep = "\t", header = TRUE)
`?`(colnames)
women
class(women)
## [1] "data.frame"
str(women)
183
## data.frame: 15 obs. of 3 variables:
## $ height: int 58 59 60 61 62 63 64 65 66 67 ...
## $ weight: int 115 117 120 123 126 129 132 135 139 142 ...
## $ age : int 33 34 37 31 31 34 31 39 35 34 ...
nrow(women)
## [1] 15
ncol(women)
## [1] 3
dim(women)
## [1] 15 3
summary(women)
colMeans(women)
colnames(women)
`?`(colnames)
sum(women$weight < 120)
## [1] 2
women[order(women$weight), ]
184
## height weight age
## 1 58 115 33
## 2 59 117 34
## 3 60 120 37
## 4 61 123 31
## 5 62 126 31
## 6 63 129 34
## 7 64 132 31
## 8 65 135 39
## 9 66 139 35
## 10 67 142 34
## 11 68 146 34
## 12 69 150 36
## 13 70 154 33
## 14 71 159 30
## 15 72 164 37
## [1] 65
## [1] 60
nrow(TG2)
## [1] 60
mean(TG$len)
## [1] 18.81
sd(TG$len)
## [1] 7.649
185
mean(TG2$len)
## [1] 18.81
sd(TG2$len)
## [1] 7.649
women <-
read.table("http://bcb.dfci.harvard.edu/˜ aedin/courses/R/WomenStats.txt",
sep = "\t", header = TRUE)
nrow(women)
## [1] 16
ncol(women)
## [1] 1
colnames(women)
## [1] "X.html."
summary(women)
## X.html.
## </div> :2
## </body> :1
186
## </head> :1
## </html> :1
## </td></tr></table> :1
## <body style=background-color:#fff>:1
## (Other) :9
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
## [11] "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T"
## [21] "0" "10" "20" "30" "40" "50" "60" "70" "80" "90"
## [31] "100" "110" "120" "130" "140" "150" "160" "170" "180" "190"
## [41] "200"
## [1] "T" "70" "80" "200" "160" "130" "Q" "M" "100" "120"
## [11] "180" "30" "S" "110" "50" "N" "I" "190" "C" "10"
## [21] "A" "F" "140" "P" "0" "20" "B" "G" "150" "J"
## [31] "40" "L" "60" "90" "H" "E" "170" "R" "K" "D"
## [41] "O"
## [1] "T" "70" "80" "200" "160" "130" "Q" "M" "100" "120"
187
8.5 Solution to Exercise 5
## [1] 2
## [1] 4
## [1] 8
## [1] 16
## [1] 32
## [1] 64
## [1] 128
## [1] 256
## [1] 512
## [1] 1024
x <- 1
while (2^x < 1000) {
print(2^x)
x <- x + 1
}
## [1] 2
## [1] 4
## [1] 8
## [1] 16
## [1] 32
## [1] 64
## [1] 128
## [1] 256
## [1] 512
require(XML)
## Reads all the tables in the webpage into a list
worldPop <-
readHTMLTable("http://en.wikipedia.org/wiki/World_population")
## [1] "list"
188
length(worldPop)
## [1] 19
## [1] "NULL"
## [2] "toc"
## [3] "NULL"
## [4] "World population milestones (USCB estimates)"
## [5] "The 10 countries with the largest total population:"
## [6] "10 most densely populated countries (with population above 1"
## [7] "Countries ranking highly in terms of both total population ("
## [8] "NULL"
## [9] "UN (medium variant 2010 revision) and US Census Bureau (De"
## [10] "UN 2008 estimates and medium variant projections (in million"
## [11] "World historical and predicted populations (in millions)[102"
## [12] "World historical and predicted populations by percentage dis"
## [13] "Estimated world population at various dates (in millions)"
## [14] "Starting at 500 million"
## [15] "Starting at 375 million"
## [16] "NULL"
## [17] "NULL"
## [18] "NULL"
## [19] "NULL"
To tidy up this tables, lets look at dates after 1750AD, so lets remove rows 1 to 14 as these have only
world population information. Also we will remove row 32 which is just column names
## Now lets check the structure of the table The data are factors,
## lets convert to characters as these are easier to edit
str(worldPop)
189
## $ Oceania : Factor w/ 17 levels "","12.8","14.3",..: 1 1 1 1 1
## $ Notes : Factor w/ 6 levels "","[104]","[105]",..: 2 1 1 3
## chr [1:32, 1:9] "70,000 BC" "10,000 BC" "9000 BC" ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:9] "Year" "World" "Africa" "Asia" ...
## Lets get rid of 'Note 1' To get rid of the square brakets we
## have to put \\ before them
colnames(worldPop) <- sub("\\[Note 1\\]", "", colnames(worldPop))
str(worldPop)
# In what year did the population of Europe, Africa and Asia exceed
# 500 million?
for (i in c("Europe", "Asia", "Africa")) {
190
print(paste(i, min(worldPop$Year[worldPop[, i] > 500]), sep = "
"))
}
Now that the data are tidy, lets create the bonus plot
for (i in 1:length(regions)) {
region <- regions[i]
print(region)
lines(worldPop$Year, worldPop[, region], col = i, type = "l")
}
## [1] "Africa"
## [1] "Asia"
## [1] "Europe"
## [1] "Latin America"
## [1] "Northern America"
## [1] "Oceania"
191
4000
Africa
Asia
Europe
Latin America
Northern America
Oceania
3000
Population (millions)
2000
1000
192
## Max. :33.9 Max. :2.00
193
72
160
70
150
68
66
weight
height
140
64
130
62
60
120
58
58 60 62 64 66 68 70 72 120 130 140 150 160
height weight
Study of Women
72
72
70
70
height of women (inches)
68
68
66
66
height
64
64
62
62
60
60
58
58
120 130 140 150 160 120 130 140 150 160
weight Weight of Women (lbs)
194
8.9 Solution to Exercise 9
## (Intercept) SexM
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## 9 1 0
## 10 1 0
## SexF SexM
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## 9 1 0
## 10 1 0
195
lungFit <- lm(pemax ˜ ., data = data.lungs)
summary(lungFit)
##
## Call:
## lm(formula = pemax ˜ ., data = data.lungs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.34 -11.53 1.08 13.39 33.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 176.058 225.891 0.78 0.45
## age -2.542 4.802 -0.53 0.60
## sex -3.737 15.460 -0.24 0.81
## height -0.446 0.903 -0.49 0.63
## weight 2.993 2.008 1.49 0.16
## bmp -1.745 1.155 -1.51 0.15
## fev1 1.081 1.081 1.00 0.33
## rv 0.197 0.196 1.00 0.33
## frc -0.308 0.492 -0.63 0.54
## tlc 0.189 0.500 0.38 0.71
##
## Residual standard error: 25.5 on 15 degrees of freedom
## Multiple R-squared: 0.637,Adjusted R-squared: 0.42
## F-statistic: 2.93 on 9 and 15 DF, p-value: 0.032
##
resid(lungFit)
## 1 2 3 4 5 6 7 8
## 10.031 -3.414 13.386 -11.532 18.691 -31.552 -11.480 20.034
## 9 10 11 12 13 14 15 16
## -20.307 -13.182 15.646 10.748 -3.664 -33.118 10.460 33.405
## 17 18 19 20 21 22 23 24
## 21.034 -3.002 12.096 1.081 -37.338 11.864 -4.332 -34.233
## 25
## 28.677
## most significant
sort(summary(lungFit)$coefficients[, 4], decreasing = FALSE)[1]
## bmp
## 0.1517
196
## least significant
sort(summary(lungFit)$coefficients[, 4], decreasing = TRUE)[1]
## sex
## 0.8123
library(survival)
head(colon)
197
1.0
0.8
0.6
0.4
0.2
0.0
198
1.0
0.8
0.6
0.4
0.2
Obs
Lev
0.0
Lev+5FU
pdf(file = "Surv.pdf")
plot(colonFit, col = 2:4, lwd = 2)
legend("bottomleft", legend = levels(colon$rx), fill = 2:4)
dev.off()
## pdf
## 2
## Call:
199
## survdiff(formula = Surv(time, status) ˜ rx, data = colon)
##
## N Observed Expected (O-E)ˆ2/E (O-E)ˆ2/V
## rx=Obs 630 345 299 7.01 10.40
## rx=Lev 620 333 295 4.93 7.26
## rx=Lev+5FU 608 242 326 21.61 33.54
##
## Chisq= 33.6 on 2 degrees of freedom, p= 4.99e-08
## Call:
## coxph(formula = Surv(time, status) ˜ rx, data = colon)
##
## n= 1858, number of events= 920
##
## coef exp(coef) se(coef) z Pr(>|z|)
## rxLev -0.0209 0.9793 0.0768 -0.27 0.79
## rxLev+5FU -0.4410 0.6434 0.0839 -5.26 1.5e-07 ***
## ---
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## rxLev 0.979 1.02 0.842 1.138
## rxLev+5FU 0.643 1.55 0.546 0.758
##
## Concordance= 0.545 (se = 0.009 )
## Rsquare= 0.019 (max possible= 0.999 )
## Likelihood ratio test= 35.2 on 2 df, p=2.23e-08
## Wald test = 33.1 on 2 df, p=6.45e-08
## Score (logrank) test = 33.6 on 2 df, p=4.99e-08
##
200
8.12 sessionInfo()
sessionInfo()
201