Managing-Data PW Final 09252013
Managing-Data PW Final 09252013
justify
assess
calculate
stratify
bivariable
design
testing
table
univariable
Measures of
association
variables
Descriptive
analysis
analysis
PARTICIPANT WORKBOOK
confounding
statistical
software
confidence intervals
plan
Managing Data
Created: 2013
Managing Data. Atlanta, GA: Centers for Disease Control and Prevention (CDC), 2013.
MANAGING DATA
Managing Data
INTRODUCTION ................................................................................................................ 3
LEARNING OBJECTIVES ...................................................................................................... 3
ESTIMATED COMPLETION TIME ........................................................................................... 3
TARGET AUDIENCE ............................................................................................................ 3
PREWORK AND PREREQUISITES .......................................................................................... 3
ABOUT THIS WORKBOOK AND THE ACTIVITY WORKBOOK ....................................................... 3
ICON GLOSSARY ............................................................................................................... 4
ACKNOWLEDGEMENTS ....................................................................................................... 4
SECTION 1: OVERVIEW OF DATA MANAGEMENT ....................................................... 5
DATA MANAGEMENT PRACTICES .......................................................................................... 5
DATA DICTIONARY .............................................................................................................. 5
CLEANING THE DATA .......................................................................................................... 5
.................. 20
FREQUENCIES ................................................................................................................. 23
LOGIC CHECKS ................................................................................................................ 27
KEY POINTS TO REMEMBER ............................................................................................... 30
RESOURCES .................................................................................................................. 33
PARTICIPANT WORKBOOK |2
MANAGING DATA
Introduction
LEARNING OBJECTIVES
At the end of the training, you will be able to:
1. Create a data dictionary that includes, at a minimum:
a. Variable names
b. Variable descriptions or labels
c. Variable types
d. Response options and allowable values
2. Clean the data
a. Identify errors, including duplications, missing data, miscodes,
and outliers
b. Use statistical software to identify and correct errors
TARGET AUDIENCE
The workbook is designed for FETP fellows who specialize in NCDs;
however, you can also complete the module if you are working in infectious
disease.
PARTICIPANT WORKBOOK |3
MANAGING DATA
ICON GLOSSARY
The following icons are used in this workbook:
Image
Type
Image Meaning
Stop a point at which you should consult a mentor or wait for
the facilitator to provide locally relevant information about the
topic
Activity- an activity or exercise that you should complete
ACKNOWLEDGEMENTS
Many thanks to the following people from the Centers for Disease Control
and Prevention (CDC) who contributed to this module:
Some of the content of this module was taken from a training manual
developed by CDCs Division of Epidemiology and Surveillance Capacity
Development: Advanced Management and Analysis of Data Using Epi Info
for Windows: Risk Factors for Sexually Transmitted Infections in
Kuwadzana, Zimbabwe; 2006.
PARTICIPANT WORKBOOK |4
MANAGING DATA
Managing
Data
Analyzing
and
Interpreting
Large
Datasets
Data
into
Action
DATA DICTIONARY
If you are analyzing data that you did not collect, you must first familiarize
yourself with the dataset. You will create a data dictionary, also called a
codebook, to understand the meaning of the collected data. It should
describe how the data are arranged in the computer file and what the
various numbers and letters mean.
Whether or not you collected the data yourself, you should always use the
data dictionary during data analysis so that the meaning of a variable and
the coding used will never be in question. The data dictionary is the place
where you will look up which codes correspond to each possible response.
PARTICIPANT WORKBOOK |5
MANAGING DATA
Variable Name
Identify each variable from the survey or questionnaire and give it a name.
Use the name to identify variables in the database and during the analysis.
Ideal features of a name:
Easily identifies the question on the data collection form (if one is
used) or type of information collected
Begins with a letter
Cannot end with a period
Can have special symbols or characters
Should be short, with a maximum length of 64 characters
Limits the use of symbols
MANAGING DATA
Description
Field type
Response
Options1
IDnum
ID number
Numeric
0001 9999
DOB
Date of birth
Date
(dd/mm/yyyy)
01/01/1900
30/11/2009
CntryBth
Country of
birth
Text
{up to 60
characters of
text}
MarStat
Marital status
Numeric
1 = Single
2 = Married
3 = Divorced
4 = Widowed
99 = Missing
Some data dictionaries also include a fifth column for measures to record the type of
variables (scale, nominal, or ordinal).
PARTICIPANT WORKBOOK |7
MANAGING DATA
of the data. For example, a BMI variable can be created using the weight
and height variables.
Stop
Let the facilitator or mentor know you are ready for the group discussion.
PARTICIPANT WORKBOOK |8
MANAGING DATA
Activity
PARTICIPANT WORKBOOK |9
MANAGING DATA
Variable
Name
Type
Label
SEQN
Numeric
Identification
number
AGEY
Numeric
Age in years
Value
Measure
Scale
555 =
Missing
Scale
777 =
Refused
999 = Dont
Know
3
SEX
Numeric
Gender
1 = Female
Nominal
2 = Male
6 = Missing
77 = Refused
99 = Dont
Know
4
EDU
Numeric
Educational
level
1 = Less than
High School
Nominal
2 = High
School
Graduate
3 = Some
College
4 = College
Graduate
5 = <25 years
of age
6 = Missing
77 = Refused
99 = Dont
Know
5
ETH5C
Numeric
Racial/ethnic
group
1 = nonHispanic
white
Nominal
MANAGING DATA
2 = nonHispanic
black
3 = Hispanic
4 = Other
6 = Missing
77 = Refused
99 = Dont
Know
6
HCP
Numeric
Current health
care provider
1 = Yes, one
Nominal
2 = Yes,
more than
one
3 = No
6 = Missing
77 = Dont
Know
99 = Refused
BPTOLD1
Numeric
1 = Yes
Nominal
2 = No
6 = Missing
77 = Refused
99 = Dont
Know
BPTOLD2
Numeric
1 = Yes
Nominal
2 = No
6 = Missing
77 = Refused
99 = Dont
Know
BPMED
Numeric
Taking blood
pressure
medication
1 = Yes
Nominal
2 = No
6 = Missing
MANAGING DATA
77 = Refused
99 = Dont
Know
10
BPSYS
Numeric
Systolic blood
pressure
6666 =
Missing
Scale
9999 =
Refused
11
BPDIA
Numeric
Diastolic blood
pressure
6666 =
Missing
Scale
9999 =
Refused
12
BPHI
Numeric
BP >=140/90
1 = Yes
2 = No
6 = Missing
13
14
15
16
17
18
Nominal
MANAGING DATA
Activity
Take out the activity workbook and complete skill assessment #1.
Then continue reading in the participant workbook.
MANAGING DATA
Tip
It is difficult to remember to return to a problem, so fix an error as soon as it
occurs (or you become aware of it).
This can be particularly important when you are entering data from the
field.
MANAGING DATA
Use a table such as the one below to document changes made to the
dataset for missing, miscoded, and out-of-range values:
Variable
name
Format
Problem
The following is an example of how you can use this table during data
cleaning:
Variable
Format
Problem
Record IDs affected and
name
resolution
dd/mm/yyyy
Missing
data
0012
not
collected;
DOB
action: none
0075-unclear writing;
action: left blank
0103-not entered;
action: entered from
questionnaire
dd/mm/yyyy
DOB
Improbable
value
0024-month=15;data was
mis-entered; action:
corrected in database
Use a table such as the one below to document changes made to the
dataset for duplicate records:
Number Primary ID
Secondary
Problem
Record IDs
ID
affected and
resolution
1
SSN
duplicate
3125 - removed
Participant ID
number
duplicate
3241 removed
Missing
duplicate
3278 - removed
Social
Security
Number
(SSN)
MANAGING DATA
Tip
Because these edits permanently change the dataset it is important to
make a working copy of the original dataset. Make changes to the
working copy only. If you make any mistakes, you can access the original
database and start over.
MANAGING DATA
Good record keeping and study organization will usually prevent these types
of errors from occurring. Sometimes duplications can arise when multiple
databases are merged together into one.
To check for duplicate records:
1. Identify how many records are in the database. Use your statistical
software to check the record count.
2. Determine if the number of records matches the number of
questionnaires.
3. If the number of records is more than the number of questionnaires, run
a frequency listing to look for multiple records with the same identifying
information (such as ID number or name).
4. If there are two records with the same ID number or name, select the
records and examine them to determine if they are identical (a duplicate
record) or whether an ID number or name was entered incorrectly.
Stop
Let the facilitator or mentor know you are ready for the group discussion.
MANAGING DATA
MANAGING DATA
Activity
Background:
For this exercise you will practice checking for duplicate records for the
hypertension case study.
Instructions:
1. Check for and identify duplicate records.
2. Correct any errors and document in the table below.
3. Let your facilitator know when you have completed the exercise.
Figure 2:
A cross-sectional survey was conducted during 20092010 among adults
aged 18 years and older (N = 993). The survey was conducted in all
provinces (or regions) in Country X. The purpose of the survey was to
provide estimates of the current health of the adults in the country as well
as health conditions in the country. Assessments of high blood pressure
were included in the survey. Data were collected by trained interviewers in
the homes of participants using paper and pencil questionnaires and
measurements. Responses were checked for completion and entered into
a database manually. The dataset needs to be reviewed for errors.
Use the table below to document the problem and resolution:
MANAGING DATA
Number
Primary ID
Secondary
ID2
Problem
Record IDs
affected and
resolution
MANAGING DATA
the same variable. When there are patterns to missing data this may
provide clues to why it is missing. (Please see example on the following
page.)
A large amount of missing data for a single subject may indicate that the
interview or questionnaire was terminated early. Or it might mean that the
person was lost to follow-up. Sometimes data are intentionally missing,
such as a skip pattern on a questionnaire, to avoid asking subjects
irrelevant questions.
Out-of-range values, known as outliers, are values that fall outside the
range of values of the majority of responses. You will most often detect
outliers by examining descriptive statistics for each variable. This includes
the minimum and maximum values, measures of central tendency, and
frequency distributions, which are represented graphically by histograms.
The frequency distribution of a continuous variable such as age might
include a few values that seem quite low (young) or high (old). You can sort
the variable (ascending or descending) to determine whether the values are
accurate (true outliers) or the result of miscoding.
For example, a survey of reproductive age women may have respondents
who were less than 10 years of age and over 65 years of age; both could be
outliers because most women in the survey would be less than 45 years old.
The survey should not have included anyone above age 45.
MANAGING DATA
MANAGING DATA
FREQUENCIES
In addition to using a graph (or histogram) to quickly detect errors, you can
examine the data in each variable by conducting a frequency distribution.
The statistical software program you use will have a frequency command
that allows you to select all variables. Review the individual variables and
look for values that are out-of-range or inconsistent with other data in the
record or where data are missing.
Frequency
Percent
Cum Percent
Female
1070
47.8%
47.8%
Male
1170
52.2%
100%
100.0%
100.0%
Total
2240
Note that there are only 2,240 responses rather than 2,242. This means
that two records had missing values on gender.
MANAGING DATA
Correcting Miscodes
The simplest way to correct miscodes is to first look at the original data
source from the subject in question (such as the questionnaire) to determine
the true value. Then make the correction in the database. Remember to
write down the changes made.
If you do not have access to the original data source, recontact the subject
to confirm that the information you have is correct (or incorrect);
PARTICIPANT WORKBOOK |24
MANAGING DATA
100
150
Female
50
Weight in Kilograms
200
Male
20
40
60
80
20
40
60
80
Age in Years
Fitted values
Handling Outliers
To determine whether an outlier is a true outlier or an error (e.g., data entry
error or miscode), look at the original data source from the subject in
question (such as the questionnaire). If it is an error, make the correction in
the database. Remember to write down the changes made.
MANAGING DATA
When you have an outlier that you cannot resolve by looking at the
questionnaire, decide whether to verify the data or leave it as entered. This
will depend on the effort required, the importance of that particular variable,
and the overall size of the dataset. For a very small dataset and a key
variable, it is probably worth the effort to get it right. In another
circumstance, having a missing value for such a variable may be
acceptable.
Tip
It is never okay to change a value just because it does not seem valid.
MANAGING DATA
LOGIC CHECKS
A logic check is when you compare responses of two different variables to
determine if they are logical. One type of logic check looks for
impossibilities (usually a typo or data misentry). An example of this is a
date of discharge for a given hospital stay that is earlier than the date of
admission for the same stay. Similarly, we often compare a calculated age
based on date of birth to stated age in years.
Another type of logic check is looking for inconsistencies, such as
comparing the hysterectomy (or prostate cancer) variable with the gender
variable. For example, if a question were asked about a diagnosis of
prostate cancer and the reply is marked 'yes', this would not be compatible
with sex = F. You cannot solve this without doing additional investigation.
Maybe the participant was female and the code for prostate cancer is
incorrect. Or maybe the participant did have prostate cancer but the sex is
miscoded.
A third type is of logic check is ensuring that skip patterns have been
followed. For example, a respondent answers "never smoked cigarettes,"
then answers that she started smoking in 2004.
Some statistical software programs can incorporate logic checks so that
improbable values are flagged for the investigator to examine.
MANAGING DATA
MANAGING DATA
Stop
Let the facilitator or mentor know you are ready for the group discussion.
MANAGING DATA
MANAGING DATA
Activity
Record IDs
Variable
affected and
Number
name
Format
Problem
resolution
SEX
Numeric
Miscoded
51929 changed
(response = 3)
to missing
2
3
MANAGING DATA
4
5
6
7
8
9
Activity
Take out the activity workbook and complete skill assessment #2.
MANAGING DATA
Resources
For more information on topics found in this workbook:
Centers for Disease Control and Prevention, Division of Epidemiology and
Surveillance Capacity Development. Advanced Management and Analysis
of Data Using Epi Info for Windows: Risk Factors for Sexually Transmitted
Infections in Kuwadzana, Zimbabwe; 2006.
Tulane University: Practical Analysis of Nutritional Data. Available at
http://www.tulane.edu/~panda2/Analysis2/datclean/dataclean.htm#1.