0% found this document useful (0 votes)
23 views17 pages

Sampler

O'Reilly Ebooks offers lifetime access to purchased ebooks in five DRM-free formats, allowing users to read on various devices. The document includes details about the book 'R in a Nutshell, Second Edition' by Joseph Adler, covering topics from R basics to advanced data visualization and statistical analysis. It also provides information on purchasing options and the book's publication details.

Uploaded by

D M Bok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views17 pages

Sampler

O'Reilly Ebooks offers lifetime access to purchased ebooks in five DRM-free formats, allowing users to read on various devices. The document includes details about the book 'R in a Nutshell, Second Edition' by Joseph Adler, covering topics from R basics to advanced data visualization and statistical analysis. It also provides information on purchasing options and the book's publication details.

Uploaded by

D M Bok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

O’Reilly Ebooks—Your bookshelf on your devices!

When you buy an ebook through oreilly.com you get lifetime access to the book, and
whenever possible we provide it to you in five, DRM-free file formats—PDF, .epub,
Kindle-compatible .mobi, Android .apk, and DAISY—that you can use on the devices of
your choice. Our ebook files are fully searchable, and you can cut-and-paste and print
them. We also alert you when we’ve updated the files with corrections and additions.

Learn more at ebooks.oreilly.com


You can also purchase O’Reilly ebooks through the
iBookstore, the Android Marketplace, and Amazon.com.

Spreading the knowledge of innovators oreilly.com


R in a Nutshell, Second Edition
by Joseph Adler

Copyright © 2012 Joseph Adler. All rights reserved.


Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online
editions are also available for most titles (http://my.safaribooksonline.com). For more infor-
mation, contact our corporate/institutional sales department: 800-998-9938 or
corporate@oreilly.com.

Editors: Mike Loukides and Meghan Blanchette Indexer: Fred Brown


Production Editor: Holly Bauer Cover Designer: Karen Montgomery
Proofreader: Julie Van Keuren Interior Designer: David Futato
Illustrators: Robert Romano and Re-
becca Demarest

September 2009: First Edition.


October 2012: Second Edition.

Revision History for the Second Edition:


2012-09-25 First release
See http://oreilly.com/catalog/errata.csp?isbn=9781449312084 for release details.

Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trade-
marks of O’Reilly Media, Inc. R in a Nutshell, the image of a harpy eagle, and related trade
dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in this book, and O’Reilly Media,
Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps.

While every precaution has been taken in the preparation of this book, the publisher and
author assume no responsibility for errors or omissions, or for damages resulting from the use
of the information contained herein.

ISBN: 978-1-449-31208-4

[LSI]

1348585490
Table of Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Part I. R Basics

1. Getting and Installing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


R Versions 3
Getting and Installing Interactive R Binaries 3
Windows 4
Mac OS X 5
Linux and Unix Systems 5

2. The R User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7


The R Graphical User Interface 7
Windows 8
Mac OS X 8
Linux and Unix 9
The R Console 11
Command-Line Editing 13
Batch Mode 13
Using R Inside Microsoft Excel 14
RStudio 15
Other Ways to Run R 17

3. A Short R Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Basic Operations in R 19
Functions 21
Variables 22

iii
Introduction to Data Structures 24
Objects and Classes 27
Models and Formulas 28
Charts and Graphics 30
Getting Help 35

4. R Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
An Overview of Packages 37
Listing Packages in Local Libraries 38
Loading Packages 40
Loading Packages on Windows and Linux 40
Loading Packages on Mac OS X 40
Exploring Package Repositories 41
Exploring R Package Repositories on the Web 42
Finding and Installing Packages Inside R 42
Installing Packages From Other Repositories 45
Custom Packages 45
Creating a Package Directory 45
Building the Package 47

Part II. The R Language

5. An Overview of the R Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


Expressions 51
Objects 52
Symbols 52
Functions 52
Objects Are Copied in Assignment Statements 54
Everything in R Is an Object 55
Special Values 55
NA 55
Inf and -Inf 56
NaN 56
NULL 56
Coercion 56
The R Interpreter 57
Seeing How R Works 59

6. R Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Constants 63
Numeric Vectors 63
Character Vectors 64
Symbols 65
Operators 66
Order of Operations 67

iv | Table of Contents
Assignments 69
Expressions 69
Separating Expressions 69
Parentheses 70
Curly Braces 70
Control Structures 71
Conditional Statements 71
Loops 72
Accessing Data Structures 75
Data Structure Operators 75
Indexing by Integer Vector 76
Indexing by Logical Vector 78
Indexing by Name 79
R Code Style Standards 80

7. R Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Primitive Object Types 83
Vectors 86
Lists 87
Other Objects 88
Matrices 88
Arrays 89
Factors 89
Data Frames 91
Formulas 92
Time Series 94
Shingles 95
Dates and Times 95
Connections 96
Attributes 96
Class 99

8. Symbols and Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101


Symbols 101
Working with Environments 102
The Global Environment 103
Environments and Functions 104
Working with the Call Stack 104
Evaluating Functions in Different Environments 105
Adding Objects to an Environment 107
Exceptions 108
Signaling Errors 108
Catching Errors 109

9. Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
The Function Keyword 111

Table of Contents | v
Arguments 111
Return Values 113
Functions as Arguments 113
Anonymous Functions 114
Properties of Functions 115
Argument Order and Named Arguments 117
Side Effects 118
Changes to Other Environments 118
Input/Output 119
Graphics 119

10. Object-Oriented Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121


Overview of Object-Oriented Programming in R 122
Key Ideas 122
Implementation Example 123
Object-Oriented Programming in R: S4 Classes 129
Defining Classes 129
New Objects 130
Accessing Slots 130
Working with Objects 131
Creating Coercion Methods 131
Methods 132
Managing Methods 133
Basic Classes 134
More Help 135
Old-School OOP in R: S3 135
S3 Classes 135
S3 Methods 136
Using S3 Classes in S4 Classes 137
Finding Hidden S3 Methods 137

Part III. Working with Data

11. Saving, Loading, and Editing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141


Entering Data Within R 141
Entering Data Using R Commands 141
Using the Edit GUI 142
Saving and Loading R Objects 145
Saving Objects with save 145
Importing Data from External Files 146
Text Files 146
Other Software 154
Exporting Data 155
Importing Data From Databases 156
Export Then Import 156

vi | Table of Contents
Database Connection Packages 156
RODBC 157
DBI 167
TSDBI 172
Getting Data from Hadoop 172

12. Preparing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173


Combining Data Sets 173
Pasting Together Data Structures 174
Merging Data by Common Fields 177
Transformations 179
Reassigning Variables 179
The Transform Function 179
Applying a Function to Each Element of an Object 180
Binning Data 185
Shingles 185
Cut 186
Combining Objects with a Grouping Variable 187
Subsets 187
Bracket Notation 188
subset Function 188
Random Sampling 189
Summarizing Functions 190
tapply, aggregate 190
Aggregating Tables with rowsum 193
Counting Values 194
Reshaping Data 196
Data Cleaning 205
Finding and Removing Duplicates 205
Sorting 206

Part IV. Data Visualization

13. Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213


An Overview of R Graphics 213
Scatter Plots 214
Plotting Time Series 220
Bar Charts 222
Pie Charts 226
Plotting Categorical Data 227
Three-Dimensional Data 232
Plotting Distributions 239
Box Plots 242
Graphics Devices 246
Customizing Charts 247

Table of Contents | vii


Common Arguments to Chart Functions 247
Graphical Parameters 247
Basic Graphics Functions 257

14. Lattice Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267


History 267
An Overview of the Lattice Package 268
How Lattice Works 268
A Simple Example 268
Using Lattice Functions 270
Custom Panel Functions 272
High-Level Lattice Plotting Functions 272
Univariate Trellis Plots 273
Bivariate Trellis Plots 297
Trivariate Plots 305
Other Plots 310
Customizing Lattice Graphics 312
Common Arguments to Lattice Functions 312
trellis.skeleton 313
Controlling How Axes Are Drawn 314
Parameters 315
plot.trellis 319
strip.default 320
simpleKey 321
Low-Level Functions 322
Low-Level Graphics Functions 322
Panel Functions 323

15. ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325


A Short Introduction 325
The Grammar of Graphics 328
A More Complex Example: Medicare Data 333
Quick Plot 342
Creating Graphics with ggplot2 343
Learning More 347

Part V. Statistics with R

16. Analyzing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351


Summary Statistics 351
Correlation and Covariance 354
Principal Components Analysis 357
Factor Analysis 360
Bootstrap Resampling 361

viii | Table of Contents


17. Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Normal Distribution 363
Common Distribution-Type Arguments 366
Distribution Function Families 366

18. Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371


Continuous Data 371
Normal Distribution-Based Tests 372
Non-Parametric Tests 385
Discrete Data 388
Proportion Tests 388
Binomial Tests 389
Tabular Data Tests 390
Non-Parametric Tabular Data Tests 396

19. Power Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397


Experimental Design Example 397
t-Test Design 398
Proportion Test Design 398
ANOVA Test Design 400

20. Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401


Example: A Simple Linear Model 401
Fitting a Model 403
Helper Functions for Specifying the Model 404
Getting Information About a Model 404
Refining the Model 410
Details About the lm Function 410
Assumptions of Least Squares Regression 412
Robust and Resistant Regression 414
Subset Selection and Shrinkage Methods 416
Stepwise Variable Selection 416
Ridge Regression 417
Lasso and Least Angle Regression 418
elasticnet 419
Principal Components Regression and Partial Least Squares
Regression 420
Nonlinear Models 420
Generalized Linear Models 421
glmnet 424
Nonlinear Least Squares 427
Survival Models 428
Smoothing 433
Splines 433
Fitting Polynomial Surfaces 435

Table of Contents | ix
Kernel Smoothing 436
Machine Learning Algorithms for Regression 437
Regression Tree Models 439
MARS 450
Neural Networks 455
Project Pursuit Regression 459
Generalized Additive Models 462
Support Vector Machines 464

21. Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467


Linear Classification Models 467
Logistic Regression 467
Linear Discriminant Analysis 472
Log-Linear Models 476
Machine Learning Algorithms for Classification 477
k Nearest Neighbors 477
Classification Tree Models 478
Neural Networks 482
SVMs 483
Random Forests 483

22. Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485


Market Basket Analysis 485
Clustering 490
Distance Measures 490
Clustering Algorithms 491

23. Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495


Autocorrelation Functions 495
Time Series Models 496

Part VI. Additional Topics

24. Optimizing R Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503


Measuring R Program Performance 503
Timing 503
Profiling 504
Monitor How Much Memory You Are Using 505
Profiling Memory Usage 506
Optimizing Your R Code 507
Using Vector Operations 507
Lookup Performance in R 509
Use a Database to Query Large Data Sets 516
Preallocate Memory 516

x | Table of Contents
Cleaning Up Memory 516
Functions for Big Data Sets 517
Other Ways to Speed Up R 518
The R Byte Code Compiler 518
High-Performance R Binaries 520

25. Bioconductor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525


An Example 525
Loading Raw Expression Data 526
Loading Data from GEO 530
Matching Phenotype Data 532
Analyzing Expression Data 533
Key Bioconductor Packages 537
Data Structures 541
eSet 541
AssayData 543
AnnotatedDataFrame 543
MIAME 544
Other Classes Used by Bioconductor Packages 545
Where to Go Next 546
Resources Outside Bioconductor 546
Vignettes 546
Courses 547
Books 547

26. R and Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549


R and Hadoop 549
Overview of Hadoop 549
RHadoop 554
Hadoop Streaming 568
Learning More 571
Other Packages for Parallel Computation with R 571
Segue 571
doMC 572
Where to Learn More 572

Appendix: R Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675

Table of Contents | xi
Getting and Installing R
1
This chapter explains how to get R and how to install it on your computer.

R Versions
Today, R is maintained by a team of developers around the world. Usually, there is
an official release of R twice a year, in April and in October. I’ve checked the code
in this book against 2.15.1, but if you have an earlier or later version of R installed,
don’t worry.
R hasn’t changed that much in the past few years: usually there are some bug fixes,
some optimizations, and a few new functions in each release. There have been some
changes to the language, but most of these are related to somewhat obscure features
that won’t affect most users. (For example, the type of NA values in incompletely
initialized arrays was changed in R 2.5.) Don’t worry about using the exact version
of R that I used in this book; any results you get should be very similar to the results
shown in this book. If there are any changes to R that affect the examples in this
book, I’ll try to add them to the official errata online.
Additionally, I’ve given some example filenames below for the current release. The
filenames usually have the release number in them. So don’t worry if you’re reading
this book and don’t see a link for R-2.15.1-win32.exe but see a link for R-2.73.5-
win32.exe instead; just use the latest version and you should be fine.

Getting and Installing Interactive R Binaries


R has been ported to every major desktop computing platform. Because R is open
source, developers have ported R to many different platforms. Additionally, R is
available with no license fee.
If you’re using a Mac or a Windows machine, you’ll probably want to download the
files yourself and then run the installers. (If you’re using Linux, I recommend using

3
a port management system like Yum to simplify the installation and updating pro-
cess; see “Linux and Unix Systems” on page 5.) Here’s how to find the binaries.
1. Visit the official R website. On the site, you should see a link to “Download.”
2. The download link actually takes you to a list of mirror sites. The list is organ-
ized by country. You’ll probably want to pick a site that is geographically close,
because it’s likely to also be close on the Internet, and thus fast. I usually use
the link for the University of California, Los Angeles, because I live in California.
3. Find the right binary for your platform and run the installer.
There are a few things to keep in mind, depending on what system you’re using.

Building R from Source


It’s standard practice to build R from source on Linux and Unix systems, but not
on Mac OS X or Windows platforms. It’s pretty tricky to build your own binaries
on Mac OS X or Windows, and it doesn’t yield a lot of benefits for most users.
Building R from source won’t save you space (you’ll probably have to download
a lot of other stuff, like LaTeX), and it won’t save you time (unless you already
have all the tools you need and have a really, really slow Internet connection). The
best reason to build your own binaries is to get better performance out of R, but
I’ve never found R’s performance to be a problem, even on very large
data sets. If you’re interested in how to build your own R, see “Building your
own” on page 521.

Windows
Installing R on Windows is just like installing any other piece of software on Win-
dows, which means that it’s easy if you have the right permissions, difficult if you
don’t. If you’re installing R on your personal computer, this shouldn’t be a problem.
However, if you’re working in a corporate environment, you might run into some
trouble.
If you’re an “Administrator” or “Power User” on Windows XP, installation is
straightforward: double-click the installer and follow the on-screen instructions.
There are some known issues with installing R on Microsoft Windows Vista. In
particular, some users have problems with file permissions. Here are two approaches
for avoiding these issues:
• Install R as a standard user in your own file space. This is the simplest approach.
• Install R as the default Administrator account (if it is enabled and you have
access to it). Note that you will also need to install packages as the Administrator
user.
For a full explanation, see http://cran.r-project.org/bin/windows/base/rw-FAQ.html
#Does-R-run-under-Windows-Vista_003f.
Currently, CRAN releases only 32-bit builds of R for Microsoft Windows. These are
tested on 64-bit versions of Windows and should run correctly.

4 | Chapter 1: Getting and Installing R


Mac OS X
The current version of R runs on both PowerPC- and Intel-based Mac systems run-

Installing R
ning Mac OS X 10.5 (Leopard) and higher. If you’re using an older operating system,
or an older computer, you can find older versions on the website that may work
better with your system.
You’ll find three different R installers for Mac OS X: a three-way universal binary
for Mac OS X 10.5 (Leopard) and higher, a legacy universal binary for Mac OS X
10.4 and higher with supplemental tools, and a legacy universal binary for Mac
OS X 10.4 and higher without supplemental tools. See the CRAN download site for
more details on the differences among these versions.
As with most applications, you’ll need to have the appropriate permissions on your
computer to install R. If you’re using your personal computer, you’re probably OK:
you just need to remember your password. If you’re using a computer managed by
someone else, you may need that person’s help to install R.
The universal binary of R is made available as an installer package; simply download
the file and double-click the package to install the application. The legacy R installers
are packaged on a disk image file (like most Mac OS X applications). After you
download the disk image, double-click it to open it in the finder (if it does not au-
tomatically open). Open the volume and double-click the R.mpkg icon to launch
the installer. Follow the directions in the installer, and you should have a working
copy of R on your computer.

Linux and Unix Systems


Before you start, make sure that you know the system’s root password or have sudo
privileges on the system you’re using. If you don’t, you’ll need to get help from the
system administrator to install R.

Installation using package management systems


On a Linux system, the easiest way to install R is to use a package management
system. These systems automate the installation process: they fetch the R binaries
(or sources), get any other software that’s needed to run R, and even make upgrading
to the latest version easy.
For example, on Red Hat (or Fedora), you can use Yum (which stands for
“Yellowdog Updater, Modified”) to automate the installation. For example, on a
64-bit x86 Linux platform running Linux, open a terminal window and type:
$ sudo yum install R.x86_64

You’ll be prompted for your password, and if you have sudo privileges, R should be
installed on your system. Later, you can update R by typing:
$ sudo yum update R.x86_64

And, if there is a new version available, your R installation will be upgraded to the
latest version.

Getting and Installing Interactive R Binaries | 5


If you’re using another Unix system, you may also be able to install R. (For example,
R is available through the FreeBSD Ports system at http://www.freebsd.org/cgi/cvsweb
.cgi/ports/math/R/.) I haven’t tried these versions but have no reason to think they
don’t work correctly. See the documentation for your system for more information
about how to install software.

Installing R from downloaded files


If you’d like, you can manually download R and install it later. Currently, there are
precompiled R packages for several flavors of Linux, including Red Hat, Debian,
Ubuntu, and SUSE. Precompiled binaries are also available for Solaris.
On Red Hat–style systems, you can install these packages through the Red Hat
Package Manager (RPM). For example, suppose that you downloaded the file
R-2.15.1.fc10.i386.rpm to the directory ~/Downloads. Then you could install it with
a command like:
$ rpm -i ~/Downloads/R-2.15.1.fc10.i386.rpm

For more information on using RPM, or other package management systems, see
your user documentation.

6 | Chapter 1: Getting and Installing R


Want to read more?
You can buy this book at oreilly.com
in print and ebook format.
Buy 2 books, get the 3rd FREE!
Use discount code: OPC10
All orders over $29.95 qualify for free shipping within the US.

It’s also available at your favorite book retailer,


including the iBookstore, the Android Marketplace,
and Amazon.com.

Spreading the knowledge of innovators oreilly.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy