0% found this document useful (0 votes)
40 views6 pages

DevRes wk1-2

This document provides an introduction to performing basic statistical analysis in R. It demonstrates how to import data, examine variables, calculate summary statistics, and perform t-tests and chi-square tests. The document contains code examples for reading in data, recoding variables, plotting distributions and comparing means between groups.

Uploaded by

Faustina Prima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views6 pages

DevRes wk1-2

This document provides an introduction to performing basic statistical analysis in R. It demonstrates how to import data, examine variables, calculate summary statistics, and perform t-tests and chi-square tests. The document contains code examples for reading in data, recoding variables, plotting distributions and comparing means between groups.

Uploaded by

Faustina Prima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

# Welcome to Development Research!

You will now learn how to write some code in R and to perform some basic statistics!

###################################################### Quantitative Practical 1 [START] ##################################################

############################################## Part 1 Importing and looking at your data #########################################

# Note that using a # allows you add comments to annotate your code. It's easy to write lots of lines of
# code and to loose track of why you did certain things. Keeping notes is generally a good idea.

# R works using commands. This means that you can write a line of code that tells R what to do.
# for example
1+2
# this tells R to calculate 1 + 2 (Yes! R will also do calculations like a calculator and this is a very useful feature)

# First, you will want to read in your data. R allows you to read and write many different data types.
# This includes files from EXCEL, data files commonly used in ArcGIS (e.g. .dbf), and other statistical,
# including SPSS and STATA. Some of these features are available in base, others have to be imported in separate packages
# We will use .csv files, which are easily transferable and which can be easily saved in and opened in EXCEL.

data <- read.csv(file.choose()) # this line of data includes two commands, read.csv() - which reads in the data,
# and file.choose() - which allows you to select your file using a browser.

# This command allows you to see the first few lines of data in your dataset
head(data)

# Looks at the variable "weight..kg.". It seems like R are doesn't like brackets ().
# Let's fix and rename the weight and height variables
colnames(data)[colnames(data) == "weight..kg."] <- c("weight")
colnames(data)[colnames(data) == "height..m."] <- c("height")

# Let's try and get some summary statistics


summary(data)

# note that the categorical variables have been imported as "characters" and we need to change these to be factors so that R can identify them as categorical variables
# let's re-assign them
data$pokemon <- as.factor(data$pokemon)
data$type <- as.factor(data$type)
data$sex <- as.factor(data$sex)
data$surface <- as.factor(data$surface)

# Let's run the summary function again to see what has happened
summary(data)
# let's check if what we have done has fixed the problem
names(data)

# now let's calculate the mean and the standard deviation for pokemon height and weight
# this command let's you calculate the mean for Pokemon weight in our sample (note that we specify the dataset [data] and the variable [weight]
# we and use $ to do this).
mean(data$weight)

# this command let's you calculate the standard deviation


sd(data$weight)

# let's calculate the standard error, which you'll remember is the standard deviation divided by the square root of the sample size
# Note the command nrow will take on the value of the number of rows of the data frame in brackets. It's your sample size or n
sd(data$weight)/sqrt(nrow(data))

# we can also create an object and assign a value or a calculation (this will also appear in the Environment pannel)
weight_sd <- sd(data$weight)

# if you run that command you'll get the value in the console
weight_sd

# we can also create an object for the square root of the sample size
sqrt_n <- sqrt(nrow(data))

# and we can now divide the standard deviation by the square root of n
weight_sd/sqrt_n

# We can also use a package with an in-built function


# first we need to install the package
install.packages("plotrix")

# Now we need to upload or attach the package


library(plotrix)

# now we can use the std.error() function


std.error(data$weight)

# now let's calculate the mean for Pokemons caught in the park
# note that the first part of the command is the same as above - the second part specifies the surface
mean(data$weight[which(data$surface == "natural") ], )

# now let's calculate the mean for Pokemons caught in the park and that are bug types
mean(data$weight[which(data$surface == "natural" & data$type == "bug") ], )

# What is the mean for bugn types on built surfaces?

# Generating a table to calculate all these sub classifications in individual steps is very time consuming
# Again, we can speed up the process by using a package
# Let's install another package
install.packages("doBy")

# and upload the package


library(doBy)

# let use the summaryBy() function to calculate the mean height and weight of Pokemons by surface type
# note that FUN = mean is telling summaryBy to calculate the mean, you could equally ask it
# to calculate the standard deviation
summaryBy(weight + height ~ surface + type, FUN = mean, data = data)

# we can also get summaryBy() to run several calculations using a function


# note here that we are also using std.error from the package plotrix
summaryBy(weight + height ~ surface,
data = data,
FUN = function(x) { c(mean = mean(x), sd = sd(x), sum = sum(x)) }
)

# we can also get summaryBy() to calculate statistics for both surface and pokemon type
# this time we are also going to assign it an object
# you could read what summaryBy does as: calculate weight and height as a function of surface and pokemon type
summary.table <- summaryBy(weight + height ~ surface + type,
data = data,
FUN = function(x) { c(mean = mean(x), sd = sd(x), se = std.error(x)) }
)
summary.table

# How come you are getting NA in some rows?


# You can check your data by double clicking on the data icon (the one with the little blue arrow) in the "Environment" tab in the upper
# right hand corner. How many poison types did we find in built areas?
# R cannot calculate standard deviations if there is only one row of data.

# now let's save this table so that we can use it later


# note that we first used read.csv; now we will use write.csv
write.csv(summary.table, "/Users/user/Desktop/PokemonTable.csv")
# One of the things that we might want to do is to also get some statistics for ranges of values
# First we will look at the range of values for weight in our data
# Notice that this function gives us the min. and the max. as well as the mean and the median, and the quartiles
summary(data$weight)

# Now let's plot some frequency distributions for HP


# Note that we have now generate a histogram, reflecting the values we just calculated
hist(data$HP)

# Now let's try and compare the frequency distributions of height between the two surfaces
# We are going to use a new package called "ggplot" which is the most powerful package for data visualisation
# Lets install and load the package first
install.packages("ggplot2")
library(ggplot2)

# now lets graph the distribution for HP


ggplot(data, aes(HP)) + # this part tells ggplot the data and variables you are selecting to plot, note that here you are only graphing on value on the x axis
geom_histogram(bins = 30) # this part tells ggplot what kind of plot you want to make and sets the number of bars you want to draw

# now lets graph the distribution for HP for the surfaces that we have
# note that in this instance, we have divided the dataset and drawing two separate histograms
# we can colour them and also change the transparency. In this case, we can make the green bars more transparent so that you can see the grey bars behind them
# we can also add the axis labels
ggplot() +
geom_histogram(data = data[which(data$surface == "built"),], aes(HP), bins = 6, fill = "grey") +
geom_histogram(data = data[which(data$surface == "natural"),], aes(HP), bins = 7, fill = "green", alpha = 0.3) +
xlab("Pokemon HP") +
ylab("Number of Pokemons") +
theme_bw()

# another way of graphing distributions is to use density plots


# let's consider variables
ggplot() +
geom_density(data = data[which(data$sex == "male"),], aes(CP), colour = "grey") +
geom_density(data = data[which(data$sex == "female"),], aes(CP), colour = "green") +
xlab("Pokemon HP") +
ylab("Density") +
theme_classic() # we can change the "theme" of graphs two
############################################## Part 2 Comparing means (t test) #########################################

# let's calculate the t-statistic


# now we can calculate the means for HP for the two surfaces
m_built <- mean(data$HP[which(data$surface == "built") ], )
m_natural <- mean(data$HP[which(data$surface == "natural") ], )

# and now the variance


s_built <- var(data$HP[which(data$surface == "built") ], )
s_natural<- var(data$HP[which(data$surface == "natural") ], )

# we can use the summary function to provide us with some summary statistics, including the number of datapoints (length) in each group
# how many data points do we have in each group?
summary(data)

# so according to the formula for the t-statistic


(m_built - m_natural) / sqrt((s_built/18)+(s_natural/40))

# now let's compare that with the built in t.test function in {base}
t.test(data$HP~data$surface)

# is there a statistical difference Pokemon HP between surfaces?


# can you run another t-test compare groups?
############################################## Compare frequencies (Chi Square test) #########################################

### Chi Square Analysis


# let's try and see whether there's a difference in the frequency of male and female Pokemon between surfaces
# let's generate a summary table for our analysis
summaryBy(type ~ surface + sex, FUN = sum, data = data)

# Now let's generate the contingency table


# the table is what we call a matrix
# f m
C<-matrix(c(85, 90, #built
141, 209), #natural
nrow=2)

# what does our table look like


C

# Now with a pen and paper let's calculate the chi-square statistic; you can use R as a calculator if you wish

# And now, let's run the Chi Square test running the test using R's function
chisq.test(C, correct = F)

###################################################### Quantitative Practical 1 [END] ##################################################

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy