DevRes wk1-2
DevRes wk1-2
You will now learn how to write some code in R and to perform some basic statistics!
# Note that using a # allows you add comments to annotate your code. It's easy to write lots of lines of
# code and to loose track of why you did certain things. Keeping notes is generally a good idea.
# R works using commands. This means that you can write a line of code that tells R what to do.
# for example
1+2
# this tells R to calculate 1 + 2 (Yes! R will also do calculations like a calculator and this is a very useful feature)
# First, you will want to read in your data. R allows you to read and write many different data types.
# This includes files from EXCEL, data files commonly used in ArcGIS (e.g. .dbf), and other statistical,
# including SPSS and STATA. Some of these features are available in base, others have to be imported in separate packages
# We will use .csv files, which are easily transferable and which can be easily saved in and opened in EXCEL.
data <- read.csv(file.choose()) # this line of data includes two commands, read.csv() - which reads in the data,
# and file.choose() - which allows you to select your file using a browser.
# This command allows you to see the first few lines of data in your dataset
head(data)
# Looks at the variable "weight..kg.". It seems like R are doesn't like brackets ().
# Let's fix and rename the weight and height variables
colnames(data)[colnames(data) == "weight..kg."] <- c("weight")
colnames(data)[colnames(data) == "height..m."] <- c("height")
# note that the categorical variables have been imported as "characters" and we need to change these to be factors so that R can identify them as categorical variables
# let's re-assign them
data$pokemon <- as.factor(data$pokemon)
data$type <- as.factor(data$type)
data$sex <- as.factor(data$sex)
data$surface <- as.factor(data$surface)
# Let's run the summary function again to see what has happened
summary(data)
# let's check if what we have done has fixed the problem
names(data)
# now let's calculate the mean and the standard deviation for pokemon height and weight
# this command let's you calculate the mean for Pokemon weight in our sample (note that we specify the dataset [data] and the variable [weight]
# we and use $ to do this).
mean(data$weight)
# let's calculate the standard error, which you'll remember is the standard deviation divided by the square root of the sample size
# Note the command nrow will take on the value of the number of rows of the data frame in brackets. It's your sample size or n
sd(data$weight)/sqrt(nrow(data))
# we can also create an object and assign a value or a calculation (this will also appear in the Environment pannel)
weight_sd <- sd(data$weight)
# if you run that command you'll get the value in the console
weight_sd
# we can also create an object for the square root of the sample size
sqrt_n <- sqrt(nrow(data))
# and we can now divide the standard deviation by the square root of n
weight_sd/sqrt_n
# now let's calculate the mean for Pokemons caught in the park
# note that the first part of the command is the same as above - the second part specifies the surface
mean(data$weight[which(data$surface == "natural") ], )
# now let's calculate the mean for Pokemons caught in the park and that are bug types
mean(data$weight[which(data$surface == "natural" & data$type == "bug") ], )
# Generating a table to calculate all these sub classifications in individual steps is very time consuming
# Again, we can speed up the process by using a package
# Let's install another package
install.packages("doBy")
# let use the summaryBy() function to calculate the mean height and weight of Pokemons by surface type
# note that FUN = mean is telling summaryBy to calculate the mean, you could equally ask it
# to calculate the standard deviation
summaryBy(weight + height ~ surface + type, FUN = mean, data = data)
# we can also get summaryBy() to calculate statistics for both surface and pokemon type
# this time we are also going to assign it an object
# you could read what summaryBy does as: calculate weight and height as a function of surface and pokemon type
summary.table <- summaryBy(weight + height ~ surface + type,
data = data,
FUN = function(x) { c(mean = mean(x), sd = sd(x), se = std.error(x)) }
)
summary.table
# Now let's try and compare the frequency distributions of height between the two surfaces
# We are going to use a new package called "ggplot" which is the most powerful package for data visualisation
# Lets install and load the package first
install.packages("ggplot2")
library(ggplot2)
# now lets graph the distribution for HP for the surfaces that we have
# note that in this instance, we have divided the dataset and drawing two separate histograms
# we can colour them and also change the transparency. In this case, we can make the green bars more transparent so that you can see the grey bars behind them
# we can also add the axis labels
ggplot() +
geom_histogram(data = data[which(data$surface == "built"),], aes(HP), bins = 6, fill = "grey") +
geom_histogram(data = data[which(data$surface == "natural"),], aes(HP), bins = 7, fill = "green", alpha = 0.3) +
xlab("Pokemon HP") +
ylab("Number of Pokemons") +
theme_bw()
# we can use the summary function to provide us with some summary statistics, including the number of datapoints (length) in each group
# how many data points do we have in each group?
summary(data)
# now let's compare that with the built in t.test function in {base}
t.test(data$HP~data$surface)
# Now with a pen and paper let's calculate the chi-square statistic; you can use R as a calculator if you wish
# And now, let's run the Chi Square test running the test using R's function
chisq.test(C, correct = F)