(Ca) Unit-Iii
(Ca) Unit-Iii
org
Unit – 3: Introduction to R and getting started with R
What is R? Why R? advantages of R over other programming languages, Data types in R-logical,
numeric, integer, character, double, complex, raw, coercion, ls() command, expressions, variables
and functions, control structures, Array, Matrix, Vectors, R packages.
…………………………………………………………………………………………………………………..
Statistical computing and high-scale data analysis tasks needed a new category of computer language
besides the existing procedural and object-oriented programming languages, which would support
these tasks instead of developing new software. There is plenty of data available today which can be
analysed in different ways to provide a wide range of useful insights for multiple operations in various
industries. Problems such as the lack of support, tools and techniques for varied data analysis have
been solved with the introduction of one such language called R.
1.What is R?
R is a scripting or programming language which provides an environment for statistical computing,
data science and graphics.
It was inspired by, and is mostly compatible with, the statistical language S developed at Bell
laboratory (formerly AT & T, now Lucent technologies). Although there are some very important
differences between R and S, much of the code written for S runs unaltered on R.
R has become so popular that it is used as the single most important tool for computational statistics,
visualisation, and data science.
2. Why Use R?
• It is a great resource for data analysis, data visualization, data science and machine learning
• It provides many statistical techniques (such as statistical tests, classification, clustering and data
reduction)
• It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc++
• It works on different platforms (Windows, Mac, Linux)
• It is open-source and free
• It has a large community support
• It has many packages (libraries of functions) that can be used to solve different problems
• R is free. It is available under the terms of the Free Software Foundation’s GNU General Public
License in source code form.
1
www.anuupdates.org
• It is available for Windows, Mac and a wide variety of Unix platforms (includingFreeBSD,
Linux, etc.).
• In addition to enabling statistical operations, it is a general programming languageso that you
can automate your analyses and create new functions.
• R has excellent tools for creating graphics such as bar charts, scatter plots, multipanel lattice
charts, etc.
• It has an object oriented and functional programming structure along with support from a
robust and vibrant community.
• R has a flexible analysis tool kit, which makes it easy to access data in various formats,
manipulate it (transform, merge, aggregate, etc.), and subject it to traditional and modern
statistical models (such as regression, ANOVA, tree models, etc.)
• R can be extended easily via packages. It relates easily to other programming languages.
Existing software as well as emerging software can be integrated with R packages to make
them more productive.
• R can easily import data from MS Excel, MS Access, MySQL, SQLite, Oracle etc. It can easily
connect to databases using ODBC (Open Database Connectivity Protocol) and ROracle
package.
3. Advantages of R over other programming languages
Advanced programming languages like Python also support statistical computing and data
visualisation along with traditional computer programming. However, R wins the race over Python
and similar languages because of the following two advantages.
1. Python needs third party extensions and support for data visualisation and statistical computing.
However, R does not require any such support extensively.
For example, the lm function is present for linear regression analysis and data analysis in both Python
and R. In R, data can be easily passed through the function and the function will return an object
with detailed information about the regression.
The function can also return information about the standard errors, coefficients, residual values and
so on. When lm function is called in the Python environment, it will duplicate the functionalities using
third party libraries such as SciPy, NumPy and so on. Hence, R can do the same thing with a single
line of code instead of taking support from third party libraries.
2. R has the fundamental data type, i.e., a vector that can be organised and aggregated in different
ways even though the core is the same. Vector data type imposes some limitations on the language
as this is a rigid type. However, it gives a strong logical base to R. Based on the vector data type, R
uses the concept of data frames that are like a matrix with attributes and internal data structure similar
to spreadsheets or relational database. Hence, R follows a column-wise data structure based on the
aggregation of vectors.
2
www.anuupdates.org
4. Data Types in R
R is a programming language. Like other programming languages, R also makes use of variables to
store varied information. This means that when variables are created, locations are reserved in the
computer’s memory to hold the related values. The number of locations or size of memory reserved
is determined by the data type of the variables. Data type essentially means the kind of value which
can be stored, such as boolean, numbers, characters, etc. In R, however, variables are not declared as
data types.
Variables in R are used to store some R objects and the data type of the R object becomes the data
type of the variable. The most popular (based on usage) R objects are:
• Vector
• List
• Matrix
• Array
• Factor
• Data Frames
A vector is the simplest of all R objects. It has varied data types. All other R objects are based on these
atomic vectors. The most commonly used data types are listed as follows:
Data types supported by R are:
• Logical
• Numeric - Integer
• Character
• Double
• Complex
• Raw
class () function can be used to reveal the data type.
1. Logical Data type: Indicates two values True or False / T or F
> TRUE
[1] TRUE
> class(TRUE)
[1] "logical"
>T
[1] TRUE
> class(T)
[1] "logical"
> FALSE
[1] FALSE
> class(FALSE)
[1] "logical"
3
www.anuupdates.org
>F
[1] FALSE
> class(F)
[1] "logical"
2. Numeric Data types
>2
[1] 2
> class (2)
[1] "numeric"
> 76.25
[1] 76.25
> class(76.25)
[1] "numeric"
Integer
Integer data type is a sub class of numeric data type. Notice the use of “L“ as a suffix to a numeric
value in order for it to be considered an “integer”.
> 2L
[1] 2
> class(2L)
[1] "integer"
Functions such as is.numeric(), is.integer() can be used to test the data type.
> is.numeric(2)
[1] TRUE
> is.numeric(2L)
[1] TRUE
> is.integer(2)
[1] FALSE
> is.integer(2L)
[1] TRUE
Note: Integers are numeric but NOT all numbers are integers.
3. Character Data types
> "Data Science"
[1] "Data Science"
4
www.anuupdates.org
> class("Data Science")
[1] "character"
is.character() function can be used to ascertain if a value is a character.
> is.character ("Data Science")
[1] TRUE
4. Double (for double precision floating point numbers)
By default, numbers are of “double” type unless explicitly mentioned with an L suffixed
to the number for it to be considered an integer.
> typeof (76.25)
[1] "double"
5. Complex Data types
> 5 + 5i
[1] 5+5i
> class(5 + 5i)
[1] "complex"
6. Raw Data tyoes
> charToRaw("Hi")
[1] 48 69
> class (charToRaw ("Hi"))
[1] "raw"
typeof() function can also be used to check the data type (as shown).
> typeof(5 + 5i)
[1] "complex"
> typeof(charToRaw ("Hi")
+)
[1] "raw"
> typeof ("DataScience")
[1] "character"
> typeof (2L)
[1] "integer"
> typeof (76.25)
[1] "double"
5
www.anuupdates.org
5. Coercion
Coercion helps to convert one data type to another, e.g. logical “TRUE” value when converted to
numeric yields “1”. Likewise, logical “FALSE” value yields “0 ”.
> as.numeric(TRUE)
[1] 1
> as.numeric(FALSE)
[1] 0
Numeric 5 can be converted to character 5 using as.character().
> as.character(5)
[1] "5"
> as.integer(5.5)
[1] 5
On converting characters, “hi” to numeric data type, the as.numeric() returns NA.
> as.numeric("hi")
[1] NA
Warning message:
NAs introduced by coercion
However, R does have a print() function available if you want to use it. This might be useful if you
are familiar with other programming languages, such as Python, which often use a print() function
to output variables.
>name <- "John Doe"
7
www.anuupdates.org
2.4 Number of Arguments
By default, a function must be called with the correct number of arguments. Meaning that if your
function expects 2 arguments, you have to call the function with 2 arguments, not more, and not
less.
2.5 Default Parameter Value
my_function("Sweden")
my_function("India")
my_function() # will get the default value, which is Norway
my_function("USA")
2.6 Return Values
To let a function return a result, use the return() function:
my_function <- function(x) {
return (5 * x)
}
print(my_function(3))
print(my_function(5))
print(my_function(9))
Note, R also supports some pre-defined functions like sum(), max(), min(), and seq() ext.
8
www.anuupdates.org
Use the ls() function to list all the objects in the working environment.
> 1s()
[1] "RectangleArea" "RectangleHeight" "RectangleWidth"
ls() is also useful to clean the environment before running a code. Execute the rm()
function as shown to clean up the environment.
> rm(list=1s())
> 1s()
character(0)
Expressions
Look at a few arithmetic operations such as addition, subtraction, multiplication, division,
exponentiation, finding the remainder (modulus), integer division and computing the square root as
given in Table 3.1.
9
www.anuupdates.org
8. control structures
Control statements are expressions used to control the execution and flow of the program based on
the conditions provided in the statements. These structures are used to make a decision after assessing
the variable.
In R programming, there are 8 types of control statements as follows:
1. if condition
2. if-else condition
3. for loop
4. nested loops
5. while loop
6. repeat and break statement
7. return statement
8. next statement
1. if condition
This control structure checks the expression provided in parenthesis is true or not. If true, the
execution of the statements in braces {} continues.
Syntax:
if(expression){
statements
....
....
}
x <- 100
for(i in x){
print(i)
}
4. Nested loops
Nested loops are similar to simple loops. Nested means loops inside loop. Moreover, nested loops
are used to manipulate the matrix.
Example:
# Defining matrix
m <- matrix(2:15, 2)
for (r in seq(nrow(m))) {
11
www.anuupdates.org
for (c in seq(ncol(m))) {
print(m[r, c])
}
}
5. while loop
while loop is another kind of loop iterated until a condition is satisfied. The testing expression is
checked first before executing the body of loop.
Syntax:
while(expression){
statement
....
....
}
Example:
x=1
# Print 1 to 5
while(x <= 5){
print(x)
x=x+1
}
12
www.anuupdates.org
}
Example:
x=1
# Print 1 to 5
repeat{
print(x)
x=x+1
if(x > 5){
break
}
}
7. return statement
return statement is used to return the result of an executed function and returns control to the calling
function.
Syntax:
return(expression)
Example:
# Checks value is either positive, negative or zero
func <- function(x){
if(x > 0){
return("Positive")
}else if(x < 0){
return("Negative")
}else{
return("Zero")
}
}
func(1)
func(0)
func(-1)
13
www.anuupdates.org
8. next statement
next statement is used to skip the current iteration without executing the further statements and
continues the next iteration cycle without terminating the loop.
Example:
# Defining vector
x <- 1:10
9. R Data Structures
9.1 Vector
14
www.anuupdates.org
2. The default increment with seq is 1. However, it also allows the use of increments other than 1.
B) rep function
The rep function is used to place the same constant into long vectors. The syntax is rep (z,k), which
creates a vector of k*length(z) elements, each equals to z.
C) Vector Access
Let us create a variable, ‘V1’ and assign to it a vector consisting of string values
15
www.anuupdates.org
9.2 Matrices
16
www.anuupdates.org
9.3 Arrays
Compared to matrices, arrays can have more than two dimensions.
We can use the array() function to create an array, and the dim parameter to specify the dimensions:
Example:
# An array with one dimension with values ranging from 1 to 24
thisarray <- c(1:24)
thisarray
Example Explained
In the example above we create an array with the values 1 to 24.
How does dim=c(4,3,2) work?
The first and second number in the bracket specifies the number of rows and columns.
The last number in the bracket specifies how many dimensions we want.
17
www.anuupdates.org
Access Array Items
You can access the array elements by referring to the index position. You can use the [] brackets to
access the desired elements from an array. The syntax is as follow:
array[row position, column position, matrix level]
18
www.anuupdates.org
10. R packages
R packages are a collection of R functions, complied code and sample data. They are stored under a
directory called "library" in the R environment. By default, R installs a set of packages during
installation. More packages are added later, when they are needed for some specific purpose. When
we start the R console, only the default packages are available by default. Other packages which are
already installed have to be loaded explicitly to be used by the R program that is going to use them.
Below is a list of commands to be used to check, verify and use the R packages.
1. Get library locations containing R packages
.libPaths()
2. Get the list of all package installed
library()
3. Get all packages currently loaded in the R environment
search()
19