R1 Guideline Session1 Part2
R1 Guideline Session1 Part2
In order to open a dataset, there are two options: (1) the dataset is saved in a package from
R and (2) the dataset is a csv. file already saved in your computer. Regarding the second one,
RStudio can read other file formats such as xls. (EXCEL) but in our case we will be always
using csv. files.
Code 1.6
# Install the yarrr package
install.packages('yarrr')
library('yarrr')
Next, we’ll look at the help menu for the pirates dataset using help(pirates). When you
run this, you should see a small help window open up in RStudio that gives you some
information about the dataset in the help window such as number of observations,
variables and source of the data.
Code 1.7
# Information about the dataset
help(pirates)
First, let’s take a look at the first few rows of the dataset using the head() function. This
will show you the first few rows of the data.
1
DAE: R1
Code 1.8
# Look at the first few rows of the data
head(pirates)
You can look at the names of the columns in the dataset with the names() function.
Code 1.9
# What are the names of the columns?
names(pirates)
Finally, you can also view the entire dataset in a separate window using
the View() function:
Code 1.10
# View the entire dataset in a new window
View(pirates)
If you want to save this dataset in your computer as a csv. file in your computer, you need
to indicate RStudio the working directory of your computer where you want to save it.
To save the data we write out the code write.csv(pirates ,file="pirates.csv"). Once the file is
saved, we can load it by typing read.csv("pirates.csv"). The package that allows us to
manipulate csv files is readr. We do not need to install it because it is included in tidyverse
that we installed it in previous sections. However, we need to install a new package if we
want to import data in .xls or .xlsx formats (Excel files). The package is readxl. Once installed,
we can read Excel files by typing read_excel(file.xlsx).
2
DAE: R1
Code 1.11
# Load the tidyverse and yarrr packages so I can use them!
# If already installed from CRAN
library("tidyverse")
library("yarrr")
setwd("C:/Econometrics/R1")
write.csv(pirates, file="pirates.csv")
Now, you can go to this directory in your laptop and check that the pirates.csv file is created
and save it in the specified working directory.
Code 1.12
# Load the tidyverse package so I can use it!
# If already installed from CRAN
library("tidyverse")
setwd("C:/Econometrics/R1")
pirates<-read.csv("C:/Econometrics/R1/pirates.csv")
In fact, in Environment/History window of RStudio a new object is created with the data.
If you double click on that object, where the number of observations and variables is
indicated, the data is opened for you to have a look at it. Notice, that now the name of the
dataset is pirates.
3
DAE: R1
Now, you can run a similar code as the one we run before, to perform a first exploration
of the data. Note that now, the name of the dataset is mydata instead of pirates, but the
dataset is the same:
Code 1.13
##Preliminary inspection of the data
head(pirates)
names(pirates)
View(pirates)
# Type of data
class(pirates)
4
DAE: R1
1.1.5 Debugging:
When you are programming, you will always, and I do mean always, make errors (also
called bugs) in your code. You might misspell a function, include an extra comma, or some
days…R just won’t want to work with you.
Debugging will always be a challenge. However, over time you’ll learn which bugs are the
most common and get faster and faster at finding and correcting them.
Here are the most common bugs you’ll run into as you start your R journey.
Thankfully, there is an easy solution to this problem: Just hit the escape key on your
keyboard. This will cancel R’s waiting state and make it Ready!
In the code below, I’ll try to take the mean of a vector data, but I will misspell the
function mean()
5
DAE: R1
R is case-sensitive, so if you don’t use the correct capitalization you’ll receive an error.
In the code below, I’ll use Mean() instead of the correct version mean()
Here is the correct version where both the object data and function mean() are correctly
spelled:
• Punctuation problems
Another common error is having bad coding “punctuation”. By that, I mean having an
extra space, missing a comma, or using a comma (,) instead of a period (.). In the code
below, I’ll try to create a vector using periods instead of commas:
Because I used periods instead of commas, I get the above error. Here is the correct
version
# Correct
mean(c(1, 4, 2))
## [1] 2.3
6
DAE: R1
If you include an extra space in the middle of the name of an object or function, you’ll
receive an error. In the code below, I’ll accidentally write Chick Weight instead
of ChickWeight
Because I had an extra space in the object name, I get the above error. Here is the
correction:
# Correct:
head(ChickWeight)
7
DAE: R1
Once we know a few things about how the software works, we can start thinking of
describing the data. The objective of a descriptive analysis is to summarize the data,
extracting features and the most relevant information for the question under study.
First, we will synthesize the information of the variables of interest and, second, we will get
a first idea of the relationships between variables. For this purpose, we will use figures –
histogram or scatter plot, and descriptive statistics. – mean, variance (or standard deviation)
and correlations.
Recall that one of the steps in an econometric analysis is to perform a descriptive analysis,
that is, computing summary statistics, performing a graphical analysis and correlation
analysis.
Let´s use our pirates dataset containing the results of a survey of 1,000 pirates, including
18 variables with both quantitative and qualitative variables. Therefore, this is a cross-
sectional analysis in which variation of the data comes from variation across different
individuals (our pirates) at a single point in time.
The main purpose of the analysis is to discuss the main determinants that help to explain
the treasure chests he/she found which is our variable of interest. And we will use the rest
of the variables contained in the dataset as possible determinants.
First, you need to go to BLACKBOARD (R1 SESSION folder) and save both the pirates
data (as a csv. file) and pirates script (as a R. file) in your laptop. I strongly recommend you
to create a folder in your C: directory of your laptop called Econometrics and within it
another folder call R1. In R1 folder you can save the data and script taken from Blackboard.
By doing this, you will be able to use the same working directory that is written on the
scripts without the need of changing it. Additionally, you can save the remaining data sets
and scripts (property and hdi also available on Blackboard) in the same R1 folder since all
the scripts are going to use the same working directory.
8
DAE: R1
9
DAE: R1
Figure 1.7. Creating your working directory and saving the information on it
Next, you open RStudio, and to open the pirates script go to FILE and Open File in the
top-left part of the RStudio interface. Search the pirates_script in your laptop, click on it and
the script will appear in your RStudio screen.
10
DAE: R1
Once you have the script in your RStudio interface we can start working on R1 session. In
the very first part of the code you have the packages to be installed and downloaded for this
first session. Once you run this part of the code, next is setting the working directory and
opening the data (pirates.csv). Finally, once you have the data on RStudio, it is time to start
a preliminary inspection of the data we are going to use.
Code 1.14
# Load the tidyverse package so I can use them!
# If already installed from CRAN
library("tidyverse")
setwd("C:/Econometrics/R1")
pirates<-read.csv("C:/Econometrics/R1/pirates.csv")
Code 1.15
##Preliminary inspection of the data
head(pirates)
glimpse(pirates)
names(pirates)
View(pirates)
# Type of data
11
DAE: R1
class(pirates)
is.na(pirates)
sum(is.na(pirates))
To perform this first inspection of the data we use functions such as head(), glimpse(),
names(), class() and is.na() allowing us to get an idea of the information that we have and
check we do not have missing values. In the console window you will obtain the main results
of the above code that are presented in the next figure (Figure 1.7). As we can see, there are
some variables being quantitative such as age, height, weight, tattoos…and our variable of
interest tchests among others while other variables are qualitative such as sex, headband,
college and others.
It is important to notice here that the descriptive analysis of our dataset (summary statistics,
graphical and correlation analysis) is produced only for quantitative variables. Therefore, we
need first to restrict our dataset with only quantitative variables creating a new data frame.
12
DAE: R1
In order to perform the descriptive analysis, we need to restrict our data frame to numeric
vectors and creating a new dataset (with the name pirates_restricted). For the shake of
simplicity, we will select only the following quantitative variables: tchests (number of treasure
chests found by the pirate, our variable of interest), age (the age of the pirate), height (height
in cm), weight (weight in kg) and tattoos (number of tattoos the pirate has). Let's do it by
typing the following code using the select() function:
Code 1.16
# Restrict or build a new dataset containing only quantitative variables
# or a set of variables of the original dataset
write.table(pirates_restricted, file="pirates_restricted.csv")
In this way the object pirates_restricted is a new data frame with the variables tchests, height,
weight and tattoos. All these variables are numeric variables.
Note that from now on, we will work with pirates_restricted dataset.
After a first exploration of the data using previous functions such as head(), View() or class(),
and the creation of our new dataset containing quantitative variables only, we know turn on
how to compute summary statistics such as the mean, variance, max. and min. values for
all the variables of our dataset so that we can summarize the information in a few numbers.
You can compute separately different summary statistics for each of the variables of your
dataset. But this is very time consuming if you have a dataset with many variables. If you
want to create a table of basic summary statistics for all the variables in your dataset a useful
function is summary():
13
DAE: R1
Code 1.17
# Basic summary statistics
summary(pirates_restricted)
And you knowing how to interpret those summary statics. For example, the average pirate
in our sample is 170cm and 69kg with around 9 tattoos and finding almost 23 treasure chests.
The youngest pirate is 11 years old. Additionally, the tallest pirate in our sample is 209cm
and the pirate most successful in finding treasure chests is one that found 147 units.
However, in this table there are no dispersion measures such as the variance, the standard
deviation or coefficient of variation that we also discussed in class.
To get a more detailed summary statistics table we need first to install a new package (pastecs)
that contains a function (stat.desc) that compute more summary statistics for you so that you
can infer more information from the data.
Code 1.18
# Install the pastecs package
install.packages('pastecs')
library('pastecs')
# Summary statistics
stat.desc(pirates_restricted)
14
DAE: R1
As you can see, the results for the min., max., and mean values are the same as the ones
obtained before and therefore, you can make the same conclusions. In addition, you get
information about dispersion measures (variance, standard deviation and coefficient of
variable) being able to infer more information about your data.
For example, it seems that the variable being the most dispersed is tchest (the one with the
largest coefficient of variation) while the one being the least disperse is age 8with the lowest
coefficient of variation). Note that I am using the CV instead of the variance or the standard
deviation because the variables are measured in different measurement units.
Graphical methods of describing the data make use of our ability to process and interpret
even very large amounts of visual information, at least as long as it is well presented. Such
graphs are at their most powerful when they summarize comparisons and associations
between variables. RStudio provides a wide variety of graphs such as bar charts (one way
table of frequencies for a variable), histograms (similar to a bar chart but for continuous
variables).
15
DAE: R1
In Econometrics, we can use the above graphs as well. However, as the major objective in
Econometrics is the study of the association between variables, the above graphs are not
very informative. The most commonly graph that is used in Econometrics is the scatter plot.
A scatter plot is a graph using Cartesian coordinates to display values for typically two
variables for a set of data. It shows how much one variable is affected by another by plotting
the two variables in each of the axes.
Now let’s make a scatter-plot with RStudio. We’ll plot the relationship between treasure
chests (our variable of interest) and pirate’s height using the plot() function. Recall to
place the variable of interest in the y coordinate and the explanatory one in the x-axis:
Code 1.19
# Create scatterplot: tchest versus age
The scatter plot allows to distinguish the possible (1) shape relationship, linear or not,
between the variables. Additionally, we can identify the (2) type of association. There is a
positive linear relationship between two variables when increasing 𝑥, increases the average
16
DAE: R1
value of 𝑦. We say that there is a negative linear relationship between two variables when we
observe that increasing 𝑥, decreases the average value of 𝑦. We can also infer the (3) strength
of the relationship depending on how dispersed (weak) or concentrated (strong) the data
points are represented in the scatter plot. Therefore, we can conclude that it seems there is a
positive association between the variables (as the age of the pirate increases it seems the
treasure chests also increase, experience may play a role). However, the association seems to
be not very strong (very dispersed observations). Additionally, the relationship might be not
linear as there is a point the positive relationship turns to become negative (very old pirates
will not be very successful).
You can save the scatter plot as an image in case you want by clicking on the EXPORT box
(see blue circle in Figure 1.12).
Now let’s make a fancier version of the same plot by adding some customization:
Code 1.20
# Create fancier scatterplot: tchest versus age
17
DAE: R1
By writing the above code you have added a name for the scatter-plot, writing the labels for
the variables being plot and fancier the visualization of the data points.
Now let’s make it even better by adding gridlines and a blue regression line to measure
the strength of the relationship.
Code 1.21
# Create scatterplot with regression line: tchest versus age
18
DAE: R1
As we can see through the regression line, the association between these two variables is
not very strong which is consistent with what we were discussing before in the figure
without the blue regression line.
You can plot as many scatter plots as you want. For example, lets plot now weight and height
variables:
Code 1.22
# Create scatterplot: weight versus heightt
19
DAE: R1
In this case, and according to the above plot, we now rather obtain a positive, very strong
and clear linear relationship between these two variables. This makes sense in real life since
you should expect the height and the weight being strongly associated in a positive way.
Additionally, if the dataset contains few variables, it could be useful to plot all the possible
relationships that may exists between your variables at the same time and having first insights
about them in a scatter plot matrix.
Code 1.23
# Create scatterplot matrix
pairs(~pirates_restricted$tchests+ pirates_restricted$age+
pirates_restricted$weight+ pirates_restricted$height+
pirates_restricted$tattoos, main="Simple Scatterplot Matrix")
20
DAE: R1
Finally, a more powerful graphical function that you can use is ggplot() function. This
function allows to perform scatter plots in two different ways: (1) using linear approximation
or (2) smoothing the data. The latter is very useful to identify much better if a relationship
can be considered as non-linear relation and thus, being able to identify a potential non-
linearity problem. Additionally, the smooth method allows you to see in which part of the
diagram the fit is better or worse. The following code shows how to use ggplot() using both
approximations and you will obtained Figure 1.17:
Code 1.24
# Create scatterplot matrix
# Create scatter plots with ggplot function using linear and smooth
(alternative way)
21
DAE: R1
When comparing both graphs, it seems that the relation is linear as in the second one
(smooth) the approximation is a line. However, we can identify, according to the second
graph (smooth) that the fit is poorer at the extremes.
A unit-less measure used to examine the relationship between variables is the linear
correlation coefficient. Correlation can reveal not only the direction (as the covariance) but
also the strength of the linear relationship between random variables.
You can ask RStudio to compute the correlation matrix for you using the cor() function:
Code 1.25
# Create correlation matrix
cor(pirates_restricted)
Recall that the correlation matrix is symmetric, so you just need to look at the values above
the 1 main diagonal.
According to the above we can conclude that, for example, the variable having the strongest
correlation with our tchest variable is age with a correlation coefficient of 0.19. this mean a
positive but weak correlation which is consistent with our previous graphical analysis (see
Figure 1.11). The rest of the correlations with tchests variable are close to zero, meaning
those variables being uncorrelated.
22
DAE: R1
Additionally, we want the correlation between explanatory variables being as weak as possible
to avoid multicollinearity issues that will discuss in the future. According to the above
correlation matrix, the only multicollinearity problem is between weight and height variables
since the correlation is positive and very strong (0.931).
Now let’s make a fancier version of the same correlation matrix by adding some
customization. The function corrplot(), in the package of the same name, creates
a graphical display of a correlation matrix, highlighting the most correlated variables in a
data table.
In this plot, correlation coefficients are coloured according to the value. Correlation matrix
can be also reordered according to the degree of association between variables.
Code 1.26
# Create fancy correlation matrix
install.packages("corrplot")
library(corrplot)
M <- cor(pirates_restricted)
23
DAE: R1
Positive correlations are displayed in blue and negative correlations in red colour. Colour
intensity and the size of the circle are proportional to the correlation coefficients. In the
right side of the correlogram, the legend colour shows the correlation coefficients and the
corresponding colours.
By comparing the Scatter Plot Matrix in Figure 1.12 and the correlogram in Figure 1.14, you
will see that both analysis (graphical and correlation) are consistent.
24
DAE: R1
FINAL NOTE
When running codes in RStudio, you may want to save your results. For example, you may
want to save the summary statistics table, the correlation matrix and the graphs you produce.
Code 1.25
# Create tables and save them in txt in your working directory
#Summary Statistics
sumstat_table_pirates<-summary(pirates_restricted)
write.table(sumstat_table_pirates, file = "sumstats_table_pirates.txt",
sep = ",", quote = FALSE, row.names = F)
#Correlation matrix
corr_table_pirates<-cor(pirates_restricted)
write.table(corr_table_pirates, file = "corr_table_pirates.txt", sep =
",", quote = FALSE, row.names = F)
And the sumstats_table.txt file and corr_table.txt file will end up in your working directory.
For the graphs, please use the Export menu in the plots window and you can save your
plots as images or as pdf files in your working directory.
25