Hdat9200ch1 Rcorner
Hdat9200ch1 Rcorner
R CORNER
Introduction to R
The origins of R
The origins of R lie in a language called S, which was developed by the AT&T
telecommunications company (formerly the Bell Telephone Company) in the US in the late 1970s
and early 1980s, for the purposes of facilitating statistical analysis of telephone call and
customer data. The S language became a commercial software product called S/Plus, which
rapidly became the language of choice for many statisticians, particularly those developing new
statistical methods.
In the early 1990s, two statisticians at the University of Auckland in New Zealand, Ross Ihaka and
Robert Gentleman, decided to create an alternative to the S language, but which was closely
modelled on it, which could be freely used on computers in their university without having to pay
license fees. They named this language R (after their shared first initial, and as a pun on the
name "S"). Over the course of the 1990s and 2000s, R rapidly became very popular with
statisticians, particularly those who were already using S or S/Plus, and eventually came to be
arguably the dominant programming language for statistics and statistical graphics, and one of
the most popular programming languages for data science, currently only challenged by Python.
The free, open-source licensing for R means that anyone can install it on as many computers as
they wish without having to pay any licensing fees. This is a tremendous advantage: it means
that skills are highly portable between jobs or projects, because the software on which those
skills have been acquired can be installed anywhere. It also facilitates the use of R with 'big data',
which may require the use of many computers simultaneously, all running the same program
code on different parts of the data in parallel.
What is R?
R is a high-level programming language, and as such it can be used for a very wide variety of
tasks.
'High-level' means that the language abstracts away (hides) many of the messy details of writing
program code to run on a computer. This allows the coder to concentrate on the task rather than
on the characteristics of computer on which it will run. Thus, R Code tends to be highly portable
- Code written on one type of computer, say, under macOS, will run unchanged (or almost
unchanged) on, say, a Windows- or Linux-based computer.
R requires you to write code to get most things done. There are no point-and-click interfaces
that will do everything that you need to do: writing code is inescapable. But that's a Good
Thing™, because actions performed through code are repeatable and reproducible actions,and
the code coupled with a good source control system automatically provides an audit trail of how
data has been manipulated and the process of arriving at the final analysis.
R is an interpreted language
This means that when you run R code, a special program called an interpreter converts the R
code on-the-fly into a lower-level intermediate language (which isn't designed to be written by
humans), and that is then converted into "machine language" (one level below even assembly
language) as the program is actually executing on the computer hardware.
R is a functional language
The sense of functional here is not that "it works well" (it does!), but rather that every operation
in the R language is performed by calling a function, with one or more arguments passed as
parameters to the function.
The function operates on those arguments and returns a value, or some data, or another
function.
R is vector-oriented
R is vector-oriented, by default. This means that all data in R are stored in vectors (or two-
dimensional matrices, or multi-dimensional arrays, both of which are actually vectors with two
or more dimensions imposed on them, whereas as a vector has just one dimension).
A vector is a container for data which can store zero or more values of a particular data type, and
these values can be accessed by an index, where the index is an integer or a name eg, the third
value in a vector can be accessed, or the value named "Robert" in a vector can be accessed.
length(a)
In fact, "variables" in R are actually just names which refer to vectors or matrices or arrays, or
lists, or other objects, in R.
R stores data as vectors (or matrices or arrays) because it enables almost all arithmetic
operations and function in R to be vectorised, meaning that a single operation will operate on
every value in the vector, without the need for an explicit loop in your code to process each
value.
An example makes this clear. Notice that comments are preceded with # and that referring to an
object by itself will print that object's value (or a summary of its contents if it is a more complex
object):
Notice that 2 has been added to each of the 3 three elements in the vector named b, without the
need for an explicit loop.
Object names in R
A syntactically valid object name in R consists of letters, numbers and the dot or underline
characters and starts with a letter or the dot not followed by a number. Thus, names such as
".2way" are not valid. Neither are the following reserved words: if, else, repeat,
while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN,
NA, NA_integer_, NA_real_, NA_complex_, NA_character_.
Note that object names can include dots (periods), and many function and argument names in R
take this form, such as data.frame() or na.rm. Current best practice in R is to avoid the use
of a dot in object names, and to use an underscore instead. Object names may not have spaces in
them.
White space
R doesn't care about "whitespace" (spaces and tabs, and indenting).
Program code is just written one complete statement to a line. If a line of code contains an
incomplete line of code, the R interpreter will look for the completion of the statement on the
next line.
Comments
Comments require prepending your statement with a hash symbol # (also known as a pound
symbol). There is no specific provision for multiline comments in R - just prepend each line of
the comment with a hash.
Individual help pages can also be accessed offline while using R by typing a question mark
followed by a function or command name in a code cell in Jupyter Notebook e.g. ?order will
display the help page for the order() function.
One caveat about the official R help pages: they are written with experienced, highly technical
readers in mind, and can often seem inscrutable or almost deliberately difficult to understand.
However, most help pages include example code to help demonstrate what they are
documenting.
• the Safari book (freely available through UNSW library in e-book online format) Learning
R by Richard Cotton.
Note the assign (or gets; left arrow and dash) in R. This assigns whatever is on the right to the
object on the left. We do not favour the use of '=' for assignment in R. R users consider this to be
bad practice!
White space is not important in R; notice how I leave a space after a comma to aid readibility of
the code. Equally, you can split long lines of code over multiple lines.
Running the code cell creates toss_outcome. If we wish to see the object, we can just call it thus:
toss_outcome
print(toss_outcome)
This is preferable, as it makes it explicit that we wish to view the object 'toss_outcome'.
Note that we are using 0/1 to denote Tails/Heads. This is standard practice for binary (No/Yes)
outcomes. However, we could have used strings (enclosed in quotation marks), if we so desired:
Now we have our vector of possible outcomes from tossing a coin, lets simulate a single coin
toss in R:
sample(toss_outcome, size = 1)
How did I know to use the sample() function? The easiest way to discover the required function is
a simple Google search with 'R' included in the search terms. The R help can be accessed by '?'
(or '??' for a general search) e.g.:
? sample
We now have the R help for the sample() function. Let's take a closer look.
The function is sample(). The things inside the brackets are known as the function arguments, or
just the arguments. Think of these as the tuning parameters that can be varied to provide the
required flexibility. For example, suppose we wish to simulate 10 coin tosses. We simply specify
the 'size' argument to be 10, thus:
sample(toss_outcome, size = 10)
Mmmm, an error - notice the third argument 'replace=FALSE'. This means that if we don't
specify the 'replace' argument in our function call, then it will default to FALSE. This means that
after the 1st coin toss, our vector 'toss_outcome' only has 1 value left to sample (the opposite of
what was tossed in the first toss). After the 2nd coin toss, there are no more options left. Thus,
we need to specify 'replace=TRUE'. This means we replace the first toss back into list of possible
outcomes for the 2nd toss, and so on...
Note that if we specify the arguments in the default order, we do not need to name the
argument. Thus:
However, naming the arguments is useful if we wish to pass the arguments 'out of order' e.g.:
Notice how the last 3 simulations of 10 coins tosses give different results. set.seed() allows us to
set the starting point, or seed, for the pseudo-random number generator, or PRNG, e.g.:
This is good news for reproducible coding! I can now tell you that you just got the result:
1 1 1 1 1 1 0 1 1 0
Finally, we can use table() to sum the number of tails and heads:
set.seed(1010)
tosses <- sample(toss_outcome, 10, TRUE)
table(tosses) # Could wrap function in function i.e.
table(sample(toss_outcome, 10, TRUE) )