26 Survey Analysis - The Epidemiologist R Handbook
26 Survey Analysis - The Epidemiologist R Handbook
Need help learning R? Enroll in Applied Epi's intro R course, try our free R tutorials, post in our Community Q&A forum, or ask about our R ×
Help Desk service.
26 Survey analysis
26.1 Overview
This page demonstrates the use of several packages for survey analysis.
Most survey R packages rely on the survey package for doing weighted analysis. We will use survey as well as srvyr (a wrapper for survey allowing for
tidyverse-style coding) and gtsummary (a wrapper for survey allowing for publication ready tables). While the original survey package does not allow for
tidyverse-style coding, it does have the added benefit of allowing for survey-weighted generalised linear models (which will be added to this page at a later
date). We will also demonstrate using a function from the sitrep package to create sampling weights (n.b this package is currently not yet on CRAN, but can
be installed from github).
Most of this page is based off work done for the “R4Epis” project; for detailed code and R-markdown templates see the “R4Epis” github page. Some of the
survey package based code is based off early versions of EPIET case studies.
At current this page does not address sample size calculations or sampling. For a simple to use sample size calculator see OpenEpi. The GIS basics page of
the handbook will eventually have a section on spatial random sampling, and this page will eventually have a section on sampling frames as well as sample
size calculations.
1. Survey data
2. Observation time
3. Weighting
5. Descriptive analysis
6. Weighted proportions
7. Weighted rates
26.2 Preparation
Packages
This code chunk shows the loading of packages required for the analyses. In this handbook we emphasize p_load() from pacman, which installs the
package if necessary and loads it for use. You can also load packages with library() from base R. See the page on R basics for more information on R
packages.
Here we also demonstrate using the p_load_gh() function from pacman to install a load a package from github which has not yet been published on CRAN.
https://epirhandbook.com/en/survey-analysis.html 1/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
Load data
The example dataset used in this section:
This is based off the MSF OCA ethical review board pre-approved survey. The fictional dataset was produced as part of the “R4Epis” project. This is all
based off data collected using KoboToolbox, which is a data collection software based off Open Data Kit.
Kobo allows you to export both the collected data, as well as the data dictionary for that dataset. We strongly recommend doing this as it simplifies data
cleaning and is useful for looking up variables/questions.
TIP: The Kobo data dictionary has variable names in the “name” column of the survey sheet. Possible values for each variable are specified in choices
sheet. In the choices tab, “name” has the shortened value and the “label::english” and “label::french” columns have the appropriate long versions. Using the
epidict package msf_dict_survey() function to import a Kobo dictionary excel file will re-format this for you so it can be used easily to recode.
CAUTION: The example dataset is not the same as an export (as in Kobo you export different questionnaire levels individually) - see the survey data section
below to merge the different levels.
The dataset is imported using the import() function from the rio package. See the page on Import and export for various ways to import data.
start ▲
▼ end ▲
▼ today ▲
▼ deviceid ▲
▼ date ▲
▼ team_number ▲
▼ village_name ▲
▼ village_other ▲
▼ cluster_number ▲
▼ h
2018-04-15T00:00:00Z village_1 1
2018-03-04T00:00:00Z village_1 1
2018-04-16T00:00:00Z village_10 10
2018-01-23T00:00:00Z village_5 5
2018-01-09T00:00:00Z village_3 3
We also want to import the data on sampling population so that we can produce appropriate weights. This data can be in different formats, however we
would suggest to have it as seen below (this can just be typed in to an excel).
Showing 1 to 5 of 10 entries
https://epirhandbook.com/en/survey-analysis.html 2/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
Previous 1 2 Next
For cluster surveys you may want to add survey weights at the cluster level. You could read this data in as above. Alternatively if there are only a few counts,
these could be entered as below in to a tibble. In any case you will need to have one column with a cluster identifier which matches your survey data, and
another column with the number of households in each cluster.
Clean data
The below makes sure that the date column is in the appropriate format. There are several other ways of doing this (see the Working with dates page for
details), however using the dictionary to define dates is quick and easy.
We also create an age group variable using the age_categories() function from epikit - see cleaning data handbook section for details. In addition, we
create a character variable defining which district the various clusters are in.
Finally, we recode all of the yes/no variables to TRUE/FALSE variables - otherwise these cant be used by the survey proportion functions.
## change to dates
survey_data <- survey_data %>%
mutate(across(all_of(DATEVARS), as.Date))
## add those with only age in months to the year variable (divide by twelve)
survey_data <- survey_data %>%
mutate(age_years = if_else(is.na(age_years),
age_months / 12,
age_years))
## change to dates
survey_data <- survey_data %>%
mutate(across(all_of(YNVARS),
str_detect,
pattern = "yes"))
https://epirhandbook.com/en/survey-analysis.html 3/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
As described above (depending on how you design your questionnaire) the data for each level would be exported as a separate dataset from Kobo. In our
example there is one level for households and one level for individuals within those households.
These two levels are linked by a unique identifier. For a Kobo dataset this variable is “_index” at the household level, which matches the “_parent_index” at
the individual level. This will create new rows for household with each matching individual, see the handbook section on joining for details.
## join the individual and household data to form a complete data set
survey_data <- left_join(survey_data_hh,
survey_data_indiv,
by = c("_index" = "_parent_index"))
To do this we first define our time period of interest, also known as a recall period (i.e. the time that participants are asked to report on when answering
questions). We can then use this period to set inappropriate dates to missing, i.e. if deaths are reported from outside the period of interest.
We can then use our date variables to define start and end dates for each individual. We can use the find_start_date() function from sitrep to fine the
causes for the dates and then use that to calculate the difference between days (person-time).
start date: Earliest appropriate arrival event within your recall period Either the beginning of your recall period (which you define in advance), or a date after
the start of recall if applicable (e.g. arrivals or births)
end date: Earliest appropriate departure event within your recall period Either the end of your recall period, or a date before the end of recall if applicable
(e.g. departures, deaths)
https://epirhandbook.com/en/survey-analysis.html 4/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
26.5 Weighting
It is important that you drop erroneous observations before adding survey weights. For example if you have observations with negative observation time, you
will need to check those (you can do this with the assert_positive_timespan() function from sitrep. Another thing is if you want to drop empty rows
(e.g. with drop_na(uid) ) or remove duplicates (see handbook section on De-duplication for details). Those without consent need to be dropped too.
In this example we filter for the cases we want to drop and store them in a separate data frame - this way we can describe those that were excluded from the
survey. We then use the anti_join() function from dplyr to remove these dropped cases from our survey data.
DANGER: You cant have missing values in your weight variable, or any of the variables relevant to your survey design (e.g. age, sex, strata or cluster
variables).
## store the cases that you drop so you can describe them (e.g. non-consenting
## or wrong village/cluster)
dropped <- survey_data %>%
filter(!consent | is.na(startdate) | is.na(enddate) | village_name == "other")
## use the dropped cases to remove the unused rows from the survey data set
survey_data <- anti_join(survey_data, dropped, by = names(dropped))
As mentioned above we demonstrate how to add weights for three different study designs (stratified, cluster and stratified cluster). These require information
on the source population and/or the clusters surveyed. We will use the stratified cluster code for this example, but use whichever is most appropriate for your
study design.
https://epirhandbook.com/en/survey-analysis.html 5/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
# stratified ------------------------------------------------------------------
# create a variable called "surv_weight_strata"
# contains weights for each individual - by age group, sex and health district
survey_data <- add_weights_strata(x = survey_data,
p = population,
surv_weight = "surv_weight_strata",
surv_weight_ID = "surv_weight_ID_strata",
age_group, sex, health_district)
## cluster ---------------------------------------------------------------------
There are four options, comment out those you do not use: - Simple random - Stratified - Cluster - Stratified cluster
For this template - we will pretend that we cluster surveys in two separate strata (health districts A and B). So to get overall estimates we need have
combined cluster and strata weights.
As mentioned previously, there are two packages available for doing this. The classic one is survey and then there is a wrapper package called srvyr that
makes tidyverse-friendly objects and functions. We will demonstrate both, but note that most of the code in this chapter will use srvyr based objects. The
one exception is that the gtsummary package only accepts survey objects.
NOTE: we need to use the tilde ( ~ ) in front of variables, this is because the package uses the base R syntax of assigning variables based on formulae.
https://epirhandbook.com/en/survey-analysis.html 6/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
## stratified ------------------------------------------------------------------
base_survey_design_strata <- svydesign(ids = ~1, # 1 for no cluster ids
weights = ~surv_weight_strata, # weight variable created above
strata = ~health_district, # sampling was stratified by district
data = survey_data # have to specify the dataset
)
# cluster ---------------------------------------------------------------------
base_survey_design_cluster <- svydesign(ids = ~village_name, # cluster ids
weights = ~surv_weight_cluster, # weight variable created above
strata = NULL, # sampling was simple (no strata)
data = survey_data # have to specify the dataset
)
In this section we will focus on how to investigate bias in your sample and visualise this. We will also look at visualising population flow in a survey setting
using alluvial/sankey diagrams.
https://epirhandbook.com/en/survey-analysis.html 7/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
Median (range) number of households per cluster and individuals per household
Note that these p-values are just indicative, and a descriptive discussion (or visualisation with age-pyramids below) of the distributions in your study sample
compared to the source population is more important than the binomial test itself. This is because increasing sample size will more often than not lead to
differences that may be irrelevant after weighting your data.
## bind together the columns of two tables, group by age, and perform a
## binomial test to see if n/total is significantly different from population
## proportion.
## suffix here adds to text to the end of columns in each of the two datasets
left_join(ag, propcount, by = "age_group", suffix = c("", "_pop")) %>%
group_by(age_group) %>%
## broom::tidy(binom.test()) makes a data frame out of the binomial test and
## will add the variables p.value, parameter, conf.low, conf.high, method, and
## alternative. We will only use p.value here. You can include other
## columns if you want to report confidence intervals
mutate(binom = list(broom::tidy(binom.test(n, n_total, proportion_pop)))) %>%
unnest(cols = c(binom)) %>% # important for expanding the binom.test data frame
mutate(proportion_pop = proportion_pop * 100) %>%
## Adjusting the p-values to correct for false positives
## (because testing multiple age groups). This will only make
## a difference if you have many age categories
mutate(p.value = p.adjust(p.value, method = "holm")) %>%
## # A tibble: 5 × 6
## # Groups: Age group [5]
## `Age group` `Study population (n)` `Study population (%)` `Source population (n)` `Source population (%)` `P-value`
## <chr> <int> <dbl> <dbl> <dbl> <chr>
## 1 0-2 12 0.0256 1360 6.8 <0.001
## 2 3-14 42 0.0896 7244 36.2 <0.001
## 3 15-29 64 0.136 5520 27.6 <0.001
## 4 30-44 52 0.111 3232 16.2 0.002
## 5 45+ 299 0.638 2644 13.2 <0.001
https://epirhandbook.com/en/survey-analysis.html 8/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
As with the formal binomial test of difference, seen above in the sampling bias section, we are interested here in visualising whether our sampled population
is substantially different from the source population and whether weighting corrects this difference. To do this we will use the patchwork package to show
our ggplot visualisations side-by-side; for details see the section on combining plots in ggplot tips chapter of the handbook. We will visualise our source
population, our un-weighted survey population and our weighted survey population. You may also consider visualising by each strata of your survey - in our
example here that would be by using the argument stack_by = "health_district" (see ?plot_age_pyramid for details).
https://epirhandbook.com/en/survey-analysis.html 9/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
## this part defines vector using the above numbers with axis breaks
breaks <- c(
seq(max_prop/100 * -1, 0 - step/100, step/100),
0,
seq(0 + step / 100, max_prop/100, step/100)
)
## this part defines vector using the above numbers with axis limits
limits <- c(max_prop/100 * -1, max_prop/100)
## this part defines vector using the above numbers with axis labels
labels <- c(
seq(max_prop, step, -step),
0,
seq(step, max_prop, step)
)
https://epirhandbook.com/en/survey-analysis.html 10/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
https://epirhandbook.com/en/survey-analysis.html 11/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
## summarize data
flow_table <- survey_data %>%
count(startcause, endcause, sex) %>% # get counts
gather_set_data(x = c("startcause", "endcause")) # change format for plotting
NOTE: Functions from survey also accept srvyr design objects, but here we have used the survey design object just for consistency
## died
## FALSE TRUE
## 1406244.43 76213.01
https://epirhandbook.com/en/survey-analysis.html 12/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
## 2.5% 97.5%
## died 0.0514 0.0208 0.12
## diedFALSE diedTRUE
## 3.755508 3.755508
We can combine the functions from survey shown above in to a function which we define ourselves below, called svy_prop ; and we can then use that
function together with map() from the purrr package to iterate over several variables and create a table. See the handbook iteration chapter for details on
purrr.
## return dataframe
full_table
}
https://epirhandbook.com/en/survey-analysis.html 13/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
NOTE: It does not seem to be possible to get proportions from categorical variables using srvyr either, if you need this then check out the section below
using sitrep
## # A tibble: 1 × 5
## counts props props_low props_upp deff_deff
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 76213. 5.14 2.08 12.1 3.76
Here too we could write a function to then iterate over multiple variables using the purrr package. See the handbook iteration chapter for details on purrr.
https://epirhandbook.com/en/survey-analysis.html 14/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
summarise(
## using the survey design object
design,
## produce the weighted counts
counts = survey_total(.data[[x]]),
## produce weighted proportions and confidence intervals
## multiply by 100 to get a percentage
props = survey_mean(.data[[x]],
proportion = TRUE,
vartype = "ci") * 100,
## produce the design effect
deff = survey_mean(.data[[x]], deff = TRUE)) %>%
## add in the variable name
mutate(variable = x) %>%
## only keep the rows of interest
## (drop standard errors and repeat proportion calculation)
select(variable, counts, props, props_low, props_upp, deff_deff)
## # A tibble: 3 × 6
## variable counts props props_low props_upp deff_deff
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 left 701199. 47.3 39.2 55.5 2.38
## 2 died 76213. 5.14 2.08 12.1 3.76
## 3 arrived 761799. 51.4 40.9 61.7 3.93
https://epirhandbook.com/en/survey-analysis.html 15/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
## # A tibble: 9 × 5
## variable value n deff ci
## <chr> <chr> <dbl> <dbl> <chr>
## 1 arrived TRUE 761799. 3.93 51.4% (40.9-61.7)
## 2 arrived FALSE 720658. 3.93 48.6% (38.3-59.1)
## 3 left TRUE 701199. 2.38 47.3% (39.2-55.5)
## 4 left FALSE 781258. 2.38 52.7% (44.5-60.8)
## 5 died TRUE 76213. 3.76 5.1% (2.1-12.1)
## 6 died FALSE 1406244. 3.76 94.9% (87.9-97.9)
## 7 education_level higher 171644. 4.70 42.4% (26.9-59.7)
## 8 education_level primary 102609. 2.37 25.4% (16.2-37.3)
## 9 education_level secondary 130201. 6.68 32.2% (16.5-53.3)
1
Characteristic Weighted total (N) Weighted Count 95%CI
https://epirhandbook.com/en/survey-analysis.html 16/17
30/04/2024, 15:06 26 Survey analysis | The Epidemiologist R Handbook
ci <- confint(ratio)
cbind(
ratio$ratio * 10000,
ci * 10000
)
## # A tibble: 1 × 3
## mortality mortality_low mortality_upp
## <dbl> <dbl> <dbl>
## 1 5.98 0.349 11.6
26.10 Resources
UCLA stats page
srvyr packge
gtsummary package
"The Epidemiologist R Handbook" was written by the handbook team. It was last built on 2023-07-18.
https://epirhandbook.com/en/survey-analysis.html 17/17