0% found this document useful (0 votes)
7 views6 pages

Hdat9200ch1 Rcorner

Uploaded by

Mohiuddin Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views6 pages

Hdat9200ch1 Rcorner

Uploaded by

Mohiuddin Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

© Copyright 2025 UNSW Sydney. All rights reserved except where otherwise stated.

R CORNER
Introduction to R
The origins of R
The origins of R lie in a language called S, which was developed by the AT&T
telecommunications company (formerly the Bell Telephone Company) in the US in the late 1970s
and early 1980s, for the purposes of facilitating statistical analysis of telephone call and
customer data. The S language became a commercial software product called S/Plus, which
rapidly became the language of choice for many statisticians, particularly those developing new
statistical methods.

In the early 1990s, two statisticians at the University of Auckland in New Zealand, Ross Ihaka and
Robert Gentleman, decided to create an alternative to the S language, but which was closely
modelled on it, which could be freely used on computers in their university without having to pay
license fees. They named this language R (after their shared first initial, and as a pun on the
name "S"). Over the course of the 1990s and 2000s, R rapidly became very popular with
statisticians, particularly those who were already using S or S/Plus, and eventually came to be
arguably the dominant programming language for statistics and statistical graphics, and one of
the most popular programming languages for data science, currently only challenged by Python.

R is a fully-featured, 'industrial-strength' programming language which is mature and highly


trusted, and is very suitable for most data science and data analysis tasks.

Free, open-source software


Open-source software is software that is made available under a licence that allows others to
use and/or modify the software code. Although this sounds like a recipe for chaos and anarchy,
in practice it works very well and open-source software is now very widely used almost
everywhere, including in mission-critical areas. Indeed, much of the internet runs on computers
running the Linux operating system: Linux is open-source software.

The free, open-source licensing for R means that anyone can install it on as many computers as
they wish without having to pay any licensing fees. This is a tremendous advantage: it means
that skills are highly portable between jobs or projects, because the software on which those
skills have been acquired can be installed anywhere. It also facilitates the use of R with 'big data',
which may require the use of many computers simultaneously, all running the same program
code on different parts of the data in parallel.

What is R?
R is a high-level programming language, and as such it can be used for a very wide variety of
tasks.

'High-level' means that the language abstracts away (hides) many of the messy details of writing
program code to run on a computer. This allows the coder to concentrate on the task rather than
on the characteristics of computer on which it will run. Thus, R Code tends to be highly portable
- Code written on one type of computer, say, under macOS, will run unchanged (or almost
unchanged) on, say, a Windows- or Linux-based computer.

R requires you to write code to get most things done. There are no point-and-click interfaces
that will do everything that you need to do: writing code is inescapable. But that's a Good
Thing™, because actions performed through code are repeatable and reproducible actions,and
the code coupled with a good source control system automatically provides an audit trail of how
data has been manipulated and the process of arriving at the final analysis.

R is an interpreted language
This means that when you run R code, a special program called an interpreter converts the R
code on-the-fly into a lower-level intermediate language (which isn't designed to be written by
humans), and that is then converted into "machine language" (one level below even assembly
language) as the program is actually executing on the computer hardware.

R is a functional language
The sense of functional here is not that "it works well" (it does!), but rather that every operation
in the R language is performed by calling a function, with one or more arguments passed as
parameters to the function.

The function operates on those arguments and returns a value, or some data, or another
function.

R is vector-oriented
R is vector-oriented, by default. This means that all data in R are stored in vectors (or two-
dimensional matrices, or multi-dimensional arrays, both of which are actually vectors with two
or more dimensions imposed on them, whereas as a vector has just one dimension).

A vector is a container for data which can store zero or more values of a particular data type, and
these values can be accessed by an index, where the index is an integer or a name eg, the third
value in a vector can be accessed, or the value named "Robert" in a vector can be accessed.

Even when a single value (sometimes referred to as a scalar ) is assigned to a "variable" in R, in


fact it is being assigned to a vector of length one. For example:
a <- 3

length(a)

In fact, "variables" in R are actually just names which refer to vectors or matrices or arrays, or
lists, or other objects, in R.

R stores data as vectors (or matrices or arrays) because it enables almost all arithmetic
operations and function in R to be vectorised, meaning that a single operation will operate on
every value in the vector, without the need for an explicit loop in your code to process each
value.

An example makes this clear. Notice that comments are preceded with # and that referring to an
object by itself will print that object's value (or a summary of its contents if it is a more complex
object):

# assign a vector of three integers to b


b <- c(3, 5, 6) # the c() function concatenates or combines its
arguments into a vector

# show the contents of b


b

# now add 2 to b and show the result


b <- b + 2
b

Notice that 2 has been added to each of the 3 three elements in the vector named b, without the
need for an explicit loop.

R is strongly typed and uses dynamic variable declaration


This means that atomic vectors (including matrices and arrays) can only hold one type of data eg
integer numbers, or floating point numbers, or character strings.

Object names in R
A syntactically valid object name in R consists of letters, numbers and the dot or underline
characters and starts with a letter or the dot not followed by a number. Thus, names such as
".2way" are not valid. Neither are the following reserved words: if, else, repeat,
while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN,
NA, NA_integer_, NA_real_, NA_complex_, NA_character_.

Note that object names can include dots (periods), and many function and argument names in R
take this form, such as data.frame() or na.rm. Current best practice in R is to avoid the use
of a dot in object names, and to use an underscore instead. Object names may not have spaces in
them.
White space
R doesn't care about "whitespace" (spaces and tabs, and indenting).

Program code is just written one complete statement to a line. If a line of code contains an
incomplete line of code, the R interpreter will look for the completion of the statement on the
next line.

Comments
Comments require prepending your statement with a hash symbol # (also known as a pound
symbol). There is no specific provision for multiline comments in R - just prepend each line of
the comment with a hash.

# this is a comment, and is ignored by R


# so is this

c <- 3 + 4 # this is a line of code containing a complete expression

Accessing the R help system


R has an extensive help system that documents almost every aspect of the R language and
analysis ecosystem. All the documentation is available online via the R web site, as well as from
third-party providers. Be warned, the documentation is very, very extensive - it runs to about
3500 pages just for the core R system and standard packages alone, and thus is best treated as a
reference resource rather than a manual that can be read from cover-to-cover.

Individual help pages can also be accessed offline while using R by typing a question mark
followed by a function or command name in a code cell in Jupyter Notebook e.g. ?order will
display the help page for the order() function.

One caveat about the official R help pages: they are written with experienced, highly technical
readers in mind, and can often seem inscrutable or almost deliberately difficult to understand.
However, most help pages include example code to help demonstrate what they are
documenting.

Additional resource for learning R programming


These R Corners will build weekly into a complete set, providing all the R knowledge required for
this course. However, if you wish to or want to learn about R programming in greater depth,
then the following resource is recommended:

• the Safari book (freely available through UNSW library in e-book online format) Learning
R by Richard Cotton.

Hints for completion of Phase I R exercises


Let's assume we wish to use R to simulate the toss of a fair coin. For each activity, the first step is
to devise an appropriate vector of possible outcomes from which to randomly allocate. The
second step is to then use the sample() function to generate the random allocation.
toss_outcome <- c(0, 1)

The above code creates an R object 'toss_outcome', which is a vector [0,1].

Note the assign (or gets; left arrow and dash) in R. This assigns whatever is on the right to the
object on the left. We do not favour the use of '=' for assignment in R. R users consider this to be
bad practice!

We use 'c' (concatenate) to join elements of the vector.

White space is not important in R; notice how I leave a space after a comma to aid readibility of
the code. Equally, you can split long lines of code over multiple lines.

Running the code cell creates toss_outcome. If we wish to see the object, we can just call it thus:

toss_outcome

However, we can wrap the object in the print() function:

print(toss_outcome)

This is preferable, as it makes it explicit that we wish to view the object 'toss_outcome'.

Note that we are using 0/1 to denote Tails/Heads. This is standard practice for binary (No/Yes)
outcomes. However, we could have used strings (enclosed in quotation marks), if we so desired:

toss_outcome_string <- c("Tails", "Heads")


print(toss_outcome_string)

Now we have our vector of possible outcomes from tossing a coin, lets simulate a single coin
toss in R:

sample(toss_outcome, size = 1)

How did I know to use the sample() function? The easiest way to discover the required function is
a simple Google search with 'R' included in the search terms. The R help can be accessed by '?'
(or '??' for a general search) e.g.:

? sample

We now have the R help for the sample() function. Let's take a closer look.

sample(x, size, replace = FALSE, prob = NULL)

The function is sample(). The things inside the brackets are known as the function arguments, or
just the arguments. Think of these as the tuning parameters that can be varied to provide the
required flexibility. For example, suppose we wish to simulate 10 coin tosses. We simply specify
the 'size' argument to be 10, thus:
sample(toss_outcome, size = 10)

Mmmm, an error - notice the third argument 'replace=FALSE'. This means that if we don't
specify the 'replace' argument in our function call, then it will default to FALSE. This means that
after the 1st coin toss, our vector 'toss_outcome' only has 1 value left to sample (the opposite of
what was tossed in the first toss). After the 2nd coin toss, there are no more options left. Thus,
we need to specify 'replace=TRUE'. This means we replace the first toss back into list of possible
outcomes for the 2nd toss, and so on...

sample(toss_outcome, size = 10, replace = TRUE)

Note that if we specify the arguments in the default order, we do not need to name the
argument. Thus:

sample(toss_outcome, 10, TRUE)

However, naming the arguments is useful if we wish to pass the arguments 'out of order' e.g.:

sample(toss_outcome, replace = TRUE, size = 10)

Notice how the last 3 simulations of 10 coins tosses give different results. set.seed() allows us to
set the starting point, or seed, for the pseudo-random number generator, or PRNG, e.g.:

set.seed(1010) # For reproducibility


sample(toss_outcome, 10, TRUE)

This is good news for reproducible coding! I can now tell you that you just got the result:

1 1 1 1 1 1 0 1 1 0

Finally, we can use table() to sum the number of tails and heads:

set.seed(1010)
tosses <- sample(toss_outcome, 10, TRUE)
table(tosses) # Could wrap function in function i.e.
table(sample(toss_outcome, 10, TRUE) )

or 2 tails and 8 heads.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy