0% found this document useful (0 votes)

87 views35 pages

Managing-Data PW Final 09252013

This document provides guidance on creating a data dictionary and cleaning data. It discusses the key components of a data dictionary including variable names, descriptions, types, and response options/values. Common sources of errors in epidemiological data are also reviewed such as duplicates, missing/miscoded values, and outliers. Methods for detecting and correcting these errors like frequencies and logic checks using statistical software are presented. The importance of documenting any data cleaning activities is emphasized.

Uploaded by

vipermy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views35 pages

Managing-Data PW Final 09252013

Uploaded by

vipermy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

shells

justify
assess
calculate

stratify

bivariable
design

testing

table

univariable

Measures of
association

variables

Descriptive
analysis

analysis

PARTICIPANT WORKBOOK

confounding

statistical
software

confidence intervals

plan

Managing Data
Created: 2013

Managing Data. Atlanta, GA: Centers for Disease Control and Prevention (CDC), 2013.

MANAGING DATA

Managing Data
INTRODUCTION ................................................................................................................ 3
LEARNING OBJECTIVES ...................................................................................................... 3
ESTIMATED COMPLETION TIME ........................................................................................... 3
TARGET AUDIENCE ............................................................................................................ 3
PREWORK AND PREREQUISITES .......................................................................................... 3
ABOUT THIS WORKBOOK AND THE ACTIVITY WORKBOOK ....................................................... 3
ICON GLOSSARY ............................................................................................................... 4
ACKNOWLEDGEMENTS ....................................................................................................... 4
SECTION 1: OVERVIEW OF DATA MANAGEMENT ....................................................... 5
DATA MANAGEMENT PRACTICES .......................................................................................... 5
DATA DICTIONARY .............................................................................................................. 5
CLEANING THE DATA .......................................................................................................... 5

SECTION 2: DICTIONARY ................................................................................................ 6

................................................................................. 6
KEY POINTS TO REMEMBER ................................................................................................. 8
COMPONENTS OF A DATA DICTIONARY

SECTION 3: CLEANING DATA ...................................................................................... 13

OVERVIEW ...................................................................................................................... 13
DOCUMENTING ERRORS AND CHANGES TO THE DATASET ..................................................... 13
COMMON SOURCES AND TYPES OF ERRORS IN EPIDEMIOLOGIC DATA .................................... 16
DETECTING AND CORRECTING DUPLICATE RECORDS ........................................................... 16
KEY POINTS TO REMEMBER ............................................................................................... 18
DETECTING AND CORRECTING MISSING, MISCODED, AND OUT-OF RANGE VALUES

.................. 20
FREQUENCIES ................................................................................................................. 23
LOGIC CHECKS ................................................................................................................ 27
KEY POINTS TO REMEMBER ............................................................................................... 30
RESOURCES .................................................................................................................. 33

PARTICIPANT WORKBOOK |2

MANAGING DATA

Introduction
LEARNING OBJECTIVES
At the end of the training, you will be able to:
1. Create a data dictionary that includes, at a minimum:
a. Variable names
b. Variable descriptions or labels
c. Variable types
d. Response options and allowable values
2. Clean the data
a. Identify errors, including duplications, missing data, miscodes,
and outliers
b. Use statistical software to identify and correct errors

ESTIMATED COMPLETION TIME

The workbook should take between 6 and 7 hours to complete.

TARGET AUDIENCE
The workbook is designed for FETP fellows who specialize in NCDs;
however, you can also complete the module if you are working in infectious
disease.

PRE-WORK AND PREREQUISITES

Before participating in this training module, you must complete training in:
Basic epidemiology and surveillance;
Statistical software program your FETP is using (e.g., SPSS, Epi
Info).

ABOUT THIS WORKBOOK AND THE ACTIVITY WORKBOOK

You will read information about creating a data dictionary and cleaning data
in the Participant Workbook. To practice the skills and knowledge learned
you will complete three exercises. To apply what you have learned you will
refer to the Activity Workbook and create a data dictionary and clean data
for an NCD study in your country.

PARTICIPANT WORKBOOK |3

MANAGING DATA

ICON GLOSSARY
The following icons are used in this workbook:
Image
Type

Image Meaning
Stop a point at which you should consult a mentor or wait for
the facilitator to provide locally relevant information about the
topic
Activity- an activity or exercise that you should complete

Light bulb key idea to note and remember or supplemental

information

ACKNOWLEDGEMENTS
Many thanks to the following people from the Centers for Disease Control
and Prevention (CDC) who contributed to this module:

Fleetwood Loustalot, PhD, FNP, Andrea Neiman, MPH, PhD (Division

for Heart Disease and Stroke Prevention), and Edward Gregg, PhD
(Division of Diabetes Translation), for creating the hypertension case
study.

Indu Ahluwalia, (Senior Scientist, Division of Reproductive Health,

National Centers for Chronic Disease Prevention and Health Promotion),
and Richard Dicker, MD, MPH, from the Centers for Global Health,
Division of Global Health Protection, for their subject matter expertise
and for reviewing the training module.

Some of the content of this module was taken from a training manual
developed by CDCs Division of Epidemiology and Surveillance Capacity
Development: Advanced Management and Analysis of Data Using Epi Info
for Windows: Risk Factors for Sexually Transmitted Infections in
Kuwadzana, Zimbabwe; 2006.

PARTICIPANT WORKBOOK |4

MANAGING DATA

Section 1: Overview of Data Management

DATA MANAGEMENT PRACTICES
In the Creating an Analysis Plan module you learned how to develop an
analysis plan -- creating blank templates, or table shells, to use during data
analysis. Before you begin data analysis, there are two additional tasks to
complete, which you will learn in this module:
Creating a data dictionary
Cleaning the data
Creating
an
Analysis
Plan

Managing
Data

Analyzing
and
Interpreting
Large
Datasets

Data
into
Action

DATA DICTIONARY
If you are analyzing data that you did not collect, you must first familiarize
yourself with the dataset. You will create a data dictionary, also called a
codebook, to understand the meaning of the collected data. It should
describe how the data are arranged in the computer file and what the
various numbers and letters mean.
Whether or not you collected the data yourself, you should always use the
data dictionary during data analysis so that the meaning of a variable and
the coding used will never be in question. The data dictionary is the place
where you will look up which codes correspond to each possible response.

CLEANING THE DATA

Every dataset contains some errors. Cleaning data is the process you will
use to identify inaccurate, incomplete, or improbable data, and then correct
it when possible. Data cleaning is a two-step process that includes
detection and correction.

PARTICIPANT WORKBOOK |5

MANAGING DATA

Section 2: Data Dictionary

COMPONENTS OF A DATA DICTIONARY
A data dictionary should include, at a minimum:
Variable names;
Variable descriptions or labels;
Variable types; and
Response options and codes used to represent the response options.
Some data dictionaries also include the column from the questionnaire
where the variable can be found.

Variable Name
Identify each variable from the survey or questionnaire and give it a name.
Use the name to identify variables in the database and during the analysis.
Ideal features of a name:
Easily identifies the question on the data collection form (if one is
used) or type of information collected
Begins with a letter
Cannot end with a period
Can have special symbols or characters
Should be short, with a maximum length of 64 characters
Limits the use of symbols

Variable Description or Label

Provide a description (or label) of the variable that explains the variable
name.

Variable (Field) Type

Indicate the type of variable (field). Most common types are:
Numeric
Text/alpha
Date

Response Options (or Values)

Some variables use actual values. Other variables use codes that can be
text or numeric. Identify the accepted response options or values for each
variable. For example, it is common to use numeric coding for nominal
response options, such as marital status (see example below) or codes 1
PARTICIPANT WORKBOOK |6

MANAGING DATA

for yes, 0 or 2 for no, and 9 or 99for unknown, dont know, or

missing.
For open-ended text fields, where there are a large number of possible
responses or characters of text such as the country of birth, you can
indicate that the response can contain up to a certain number of characters.
The following is a basic example of the first few lines of a data dictionary.
Note that the first three examples are actual values and the fourth example
is coded. Often, we code nominative responses and use the actual value
for numerical responses. Also notice that for the variable DOB, which
represents the patients date of birth, a date field type is used; any date
between the first of January, 1900, and the 30th of November, 2009 (the
last date of the study), could be a valid value for that variable. If you had
certain age restrictions for your study, you could alter the allowable values
so that only dates that match those restrictions could be valid values.
Variable
Name

Description

Field type

Response
Options1

IDnum

ID number

Numeric

0001 9999

DOB

Date of birth

Date
(dd/mm/yyyy)

01/01/1900
30/11/2009

CntryBth

Country of
birth

Text

{up to 60
characters of
text}

MarStat

Marital status

Numeric

1 = Single
2 = Married
3 = Divorced
4 = Widowed
99 = Missing

Some variables in the data dictionary are derived variables. These

variables are created from other variables to enable more detailed analysis

Some data dictionaries also include a fifth column for measures to record the type of
variables (scale, nominal, or ordinal).

PARTICIPANT WORKBOOK |7

MANAGING DATA

of the data. For example, a BMI variable can be created using the weight
and height variables.

Stop

Let the facilitator or mentor know you are ready for the group discussion.

KEY POINTS TO REMEMBER

Use the space below to record any key points from the facilitator-led
discussion:

PARTICIPANT WORKBOOK |8

MANAGING DATA

Activity

Practice Exercise #1 (Estimated Time: 30 minutes)

Background:
For this exercise you will work individually, in pairs, or in a small group to
create a data dictionary based on the information provided in Figure 1, a
handout provided by your facilitator, and the case study and questionnaire
from the Creating an Analysis Plan module.
Instructions:
1. Read the information in Figure 1.
2. Review the handout (questionnaire and blood pressure and
height/weight measurements).
3. Use the table on the following page to create a data dictionary for
questions # 13 18. (The first 12 lines have already been entered.)
Note: Selections included in the Measure column in the data dictionary
below are reflective of choices available in SPSS (i.e., Nominal, Ordinal,
and Scale). Additional columns may be used if the data dictionary is
created in a different format.
Ask your facilitator to review your work.
Figure 1: Hypertension case study
The Panel Members from the Ministry of Health have posed three questions
that need to be addressed. A recent survey was conducted in your country
that may provide the data to support your responses. After reviewing the
information about the survey (e.g., sample, methodology), you need to
review the questions in the survey that could be used for analysis.
Common demographic and descriptive questions and measurements are
frequently included in surveys. In addition, questions and measurements
about the primary outcome of interest (i.e., hypertension) would need to be
reviewed.

PARTICIPANT WORKBOOK |9

MANAGING DATA

Variable
Name

Type

Label

SEQN

Numeric

Identification
number

AGEY

Numeric

Age in years

Value

Measure
Scale

555 =
Missing

Scale

777 =
Refused
999 = Dont
Know
3

SEX

Numeric

Gender

1 = Female

Nominal

2 = Male
6 = Missing
77 = Refused
99 = Dont
Know
4

EDU

Numeric

Educational
level

1 = Less than
High School

Nominal

2 = High
School
Graduate
3 = Some
College
4 = College
Graduate
5 = <25 years
of age
6 = Missing
77 = Refused
99 = Dont
Know
5

PARTICIPANT WORKBOOK |10

ETH5C

Numeric

Racial/ethnic
group

1 = nonHispanic
white

Nominal

MANAGING DATA

2 = nonHispanic
black
3 = Hispanic
4 = Other
6 = Missing
77 = Refused
99 = Dont
Know
6

HCP

Numeric

Current health
care provider

1 = Yes, one

Nominal

2 = Yes,
more than
one
3 = No
6 = Missing
77 = Dont
Know
99 = Refused

BPTOLD1

Numeric

Ever told you

had high blood
pressure

1 = Yes

Nominal

2 = No
6 = Missing
77 = Refused
99 = Dont
Know

BPTOLD2

Numeric

Told you had

high blood
pressure on
2+ occasions

1 = Yes

Nominal

2 = No
6 = Missing
77 = Refused
99 = Dont
Know

BPMED

Numeric

Taking blood
pressure
medication

1 = Yes

Nominal

2 = No
6 = Missing

PARTICIPANT WORKBOOK |11

MANAGING DATA

77 = Refused
99 = Dont
Know
10

BPSYS

Numeric

Systolic blood
pressure

6666 =
Missing

Scale

9999 =
Refused
11

BPDIA

Numeric

Diastolic blood
pressure

6666 =
Missing

Scale

9999 =
Refused
12

BPHI

Numeric

BP >=140/90

1 = Yes
2 = No
6 = Missing

PARTICIPANT WORKBOOK |12

Nominal

MANAGING DATA

Activity
Take out the activity workbook and complete skill assessment #1.
Then continue reading in the participant workbook.

Section 3: Cleaning Data

OVERVIEW
Few datasets are free of errors and missing values. It is important to review
the dataset to identify errors before beginning analysis. When you find
errors, correct them in the dataset and document the changes made.
Failure to correct errors may result in false analysis results and invalid
conclusions. Even after you clean the data you may find additional errors
during analysis. You will also correct (and document) those errors.
When you clean data, look at the distribution (frequency of values) for each
variable to:
Assess for accurate and consistent data entered
Check for completeness of data or missing values
Determine whether to create or collapse data categories
If more than one person has collected or entered the data you should
familiarize yourself with these aspects of the data before analyzing the data.

DOCUMENTING ERRORS AND CHANGES TO THE DATASET

A key principle of data management is to write down everything, such as:
Changes to the dataset
Decisions about how to assess certain fields
This documentation will ensure that you make consistent decisions and will
provide a reference for those who may have questions about your analysis

PARTICIPANT WORKBOOK |13

MANAGING DATA

Tip
It is difficult to remember to return to a problem, so fix an error as soon as it
occurs (or you become aware of it).
This can be particularly important when you are entering data from the
field.

PARTICIPANT WORKBOOK |14

MANAGING DATA

Use a table such as the one below to document changes made to the
dataset for missing, miscoded, and out-of-range values:
Variable
name

Format

Problem

Record IDs affected

and resolution

The following is an example of how you can use this table during data
cleaning:
Variable
Format
Problem
Record IDs affected and
name
resolution
dd/mm/yyyy
Missing
data
0012

not
collected;
DOB
action: none
0075-unclear writing;
action: left blank
0103-not entered;
action: entered from
questionnaire
dd/mm/yyyy

DOB

Improbable
value

0024-month=15;data was
mis-entered; action:
corrected in database

Use a table such as the one below to document changes made to the
dataset for duplicate records:
Number Primary ID
Secondary
Problem
Record IDs
ID
affected and
resolution
1

SSN

duplicate

3125 - removed

Participant ID
number

duplicate

3241 removed

Missing

duplicate

3278 - removed

Social
Security
Number
(SSN)

PARTICIPANT WORKBOOK |15

MANAGING DATA

Tip
Because these edits permanently change the dataset it is important to
make a working copy of the original dataset. Make changes to the
working copy only. If you make any mistakes, you can access the original
database and start over.

COMMON SOURCES AND TYPES OF ERRORS IN EPIDEMIOLOGIC DATA

Some of the most common errors occur during the data collection phase.
Other sources of error are measurement errors, improper functioning of
measurement equipment, and interviewer mistakes (often due to inadequate
training). The respondents may also cause errors if they provide the
incorrect response. This can occur if they incorrectly read or interpret a
question, or if they intentionally provide a false answer. Errors can also
occur after the data have been collected, most often by data entry mistakes
or coding errors.
Common types of data errors are entering duplicate information, miscoding,
assignment of missing values, and inclusion of out-of-range values.

DETECTING AND CORRECTING DUPLICATE RECORDS

Duplicate records can occur for many reasons:
Data entry errors in which the same case is accidentally entered more
than once.
Multiple cases share a common primary ID value but have different
secondary ID values, such as family members who live in the same
house.
Multiple cases represent the same case but with different values for
variables other than those that identify the case. For example, the same
person or company makes multiple purchases of different products or at
different times.
Duplications can arise in several ways; most often it happens when the
same persons information is entered into the same database more than
once. Less often a person might accidentally be enrolled in the study twice,
or they may be asked to complete the same interview or questionnaire twice.
PARTICIPANT WORKBOOK |16

MANAGING DATA

Good record keeping and study organization will usually prevent these types
of errors from occurring. Sometimes duplications can arise when multiple
databases are merged together into one.
To check for duplicate records:
1. Identify how many records are in the database. Use your statistical
software to check the record count.
2. Determine if the number of records matches the number of
questionnaires.
3. If the number of records is more than the number of questionnaires, run
a frequency listing to look for multiple records with the same identifying
information (such as ID number or name).
4. If there are two records with the same ID number or name, select the
records and examine them to determine if they are identical (a duplicate
record) or whether an ID number or name was entered incorrectly.

Deleting Duplicate Records

When you have identified the correct record, remove the inaccurate or
incomplete duplicate record from the dataset so that it will not be a problem
in your analysis.

Stop

Let the facilitator or mentor know you are ready for the group discussion.

PARTICIPANT WORKBOOK |17

MANAGING DATA

KEY POINTS TO REMEMBER

Use the space below to record any key points from the facilitator-led
discussion:

PARTICIPANT WORKBOOK |18

MANAGING DATA

Activity

Practice Exercise #2 (Estimated time: 30 minutes)

Background:
For this exercise you will practice checking for duplicate records for the
hypertension case study.
Instructions:
1. Check for and identify duplicate records.
2. Correct any errors and document in the table below.
3. Let your facilitator know when you have completed the exercise.
Figure 2:
A cross-sectional survey was conducted during 20092010 among adults
aged 18 years and older (N = 993). The survey was conducted in all
provinces (or regions) in Country X. The purpose of the survey was to
provide estimates of the current health of the adults in the country as well
as health conditions in the country. Assessments of high blood pressure
were included in the survey. Data were collected by trained interviewers in
the homes of participants using paper and pencil questionnaires and
measurements. Responses were checked for completion and entered into
a database manually. The dataset needs to be reviewed for errors.
Use the table below to document the problem and resolution:

PARTICIPANT WORKBOOK |19

MANAGING DATA

Number

Primary ID

Secondary
ID2

Problem

Record IDs
affected and
resolution

DETECTING AND CORRECTING MISSING, MISCODED, AND OUT-OF RANGE

VALUES
Very few datasets are 100% complete or accurate. Usually there are a few
missing, miscoded or out-of-range values. Since several people can collect
and record survey data, errors can occur during the data collection stage, in
recoding forms, or during data entry. Values can be coded incorrectly or
data from one variable column can be mistakenly entered under an adjacent
column. It can be easy to detect errors in data entry or column shifts if you
have the completed survey forms available during data cleaning.

Sometimes missing data occurs randomly and sometimes it occurs in

patterns. For example, you may find that many records are missing data on

Many datasets have secondary IDs. This dataset does not.

PARTICIPANT WORKBOOK |20

MANAGING DATA

the same variable. When there are patterns to missing data this may
provide clues to why it is missing. (Please see example on the following
page.)
A large amount of missing data for a single subject may indicate that the
interview or questionnaire was terminated early. Or it might mean that the
person was lost to follow-up. Sometimes data are intentionally missing,
such as a skip pattern on a questionnaire, to avoid asking subjects
irrelevant questions.
Out-of-range values, known as outliers, are values that fall outside the
range of values of the majority of responses. You will most often detect
outliers by examining descriptive statistics for each variable. This includes
the minimum and maximum values, measures of central tendency, and
frequency distributions, which are represented graphically by histograms.
The frequency distribution of a continuous variable such as age might
include a few values that seem quite low (young) or high (old). You can sort
the variable (ascending or descending) to determine whether the values are
accurate (true outliers) or the result of miscoding.
For example, a survey of reproductive age women may have respondents
who were less than 10 years of age and over 65 years of age; both could be
outliers because most women in the survey would be less than 45 years old.
The survey should not have included anyone above age 45.

PARTICIPANT WORKBOOK |21

MANAGING DATA

Example of Patterns to missing data (missing high blood pressure

information for persons interviewed by Interviewer #3)

PARTICIPANT WORKBOOK |22

MANAGING DATA

FREQUENCIES
In addition to using a graph (or histogram) to quickly detect errors, you can
examine the data in each variable by conducting a frequency distribution.
The statistical software program you use will have a frequency command
that allows you to select all variables. Review the individual variables and
look for values that are out-of-range or inconsistent with other data in the
record or where data are missing.

Identifying Records with Missing Values

The best way to ensure the statistical software displays missing values in
your frequency distribution is to select a command to show missing values.
Alternatively, you can identify variables for which you expect to have a
certain number of responses and those for which you do not expect to see
missing values. For example, you might expect responses to all the basic
demographic questions from each person interviewed in the study. If you
interviewed 2,242 people, you would expect 2,242 responses to each
demographic question.
In the table below, when you look at the frequency for sex, you see the
following results:
Sex

Frequency

Percent

Cum Percent

Female

1070

47.8%

Male

1170

52.2%

100%

100.0%

Total

2240

Note that there are only 2,240 responses rather than 2,242. This means
that two records had missing values on gender.

Correcting Missing Data

As a first step, find the questionnaires that are missing a value for this
variable and determine if the data were missing from the questionnaire or
were not entered. Use the appropriate software commands to select all
records with the variable name (e.g., Sex) equal to Missing; identify the
records to review. Return to the questionnaires and determine if the correct
information is available.
If missing data are not caused by errors in data entry, correct the error by
contacting the study participant(s); however, this is often not possible.

PARTICIPANT WORKBOOK |23

MANAGING DATA

There are other approaches to dealing with missing data as described

below.

Handling Missing Values in Analysis

Some investigators use complete case analysis when the amount of missing
data is small (less than 10% for each variable). This method deletes
records with missing data from the analysis so that the analysis dataset
include only records with complete data.
Other investigators only use the data that are available. Records with
missing values for just a few variables are not included in the analyses
involving those variables; they are included in analyses of variables for
which values are available. For example, if a value for sex is missing from
a record, you may choose to exclude that record for analysis of the sex
variable.
Another method is not to analyze the variables that have a large amount of
missing data. Of course, if the variable is important for your analysis, and is
related to your hypothesis, then do not use this method!
For datasets with larger amounts of missing data, (more than 10% for a
variable), you can use imputation techniques. Imputation is a method for
assigning values for missing data by making statistical inferences from
similar records with known values. Consult a statistician before undertaking
imputation.

Identifying Records with Miscoded Values

You can avoid miscoded values if you use a data entry screen with
allowable values for text variables or range checks for numeric variables.
By reducing the opportunity to enter data incorrectly you will reduce the
need to check for miscoded values.

Correcting Miscodes
The simplest way to correct miscodes is to first look at the original data
source from the subject in question (such as the questionnaire) to determine
the true value. Then make the correction in the database. Remember to
write down the changes made.
If you do not have access to the original data source, recontact the subject
to confirm that the information you have is correct (or incorrect);
PARTICIPANT WORKBOOK |24

MANAGING DATA

unfortunately this option is often not feasible. Another common approach is

to recode the value as missing (999) and deal with it the same way you
managed other missing data.

Identifying Records with Out-of-Range Values

Some variables may contain values that seem out-of-range compared to the
responses from the other participants in the study. These are often
numerical values that may have been incorrectly coded. When you run
frequencies on the variables you should notice these out-of-range values or
outliers; however, some data errors only appear when you compare two
variables. Making a scatterplot illustrates the value of one variable on the
X axis and the value of the other variable on the Y axis. The points that
stray from the bulk of the scatterplot points represent the outliers. Many
statistical software programs have this functionality.
The scatterplot below shows the weights of the Jordan BRFSS participants
by age and sex. There would appear to be two extreme weight values
among the women (shown below).
Weight of 2004 Jordan BRFSS Participants by Height and Sex

100

150

Female

Weight in Kilograms

200

Male

Age in Years
Fitted values

Handling Outliers
To determine whether an outlier is a true outlier or an error (e.g., data entry
error or miscode), look at the original data source from the subject in
question (such as the questionnaire). If it is an error, make the correction in
the database. Remember to write down the changes made.

PARTICIPANT WORKBOOK |25

MANAGING DATA

When you have an outlier that you cannot resolve by looking at the
questionnaire, decide whether to verify the data or leave it as entered. This
will depend on the effort required, the importance of that particular variable,
and the overall size of the dataset. For a very small dataset and a key
variable, it is probably worth the effort to get it right. In another
circumstance, having a missing value for such a variable may be
acceptable.

Tip
It is never okay to change a value just because it does not seem valid.

PARTICIPANT WORKBOOK |26

MANAGING DATA

LOGIC CHECKS
A logic check is when you compare responses of two different variables to
determine if they are logical. One type of logic check looks for
impossibilities (usually a typo or data misentry). An example of this is a
date of discharge for a given hospital stay that is earlier than the date of
admission for the same stay. Similarly, we often compare a calculated age
based on date of birth to stated age in years.
Another type of logic check is looking for inconsistencies, such as
comparing the hysterectomy (or prostate cancer) variable with the gender
variable. For example, if a question were asked about a diagnosis of
prostate cancer and the reply is marked 'yes', this would not be compatible
with sex = F. You cannot solve this without doing additional investigation.
Maybe the participant was female and the code for prostate cancer is
incorrect. Or maybe the participant did have prostate cancer but the sex is
miscoded.
A third type is of logic check is ensuring that skip patterns have been
followed. For example, a respondent answers "never smoked cigarettes,"
then answers that she started smoking in 2004.
Some statistical software programs can incorporate logic checks so that
improbable values are flagged for the investigator to examine.

PARTICIPANT WORKBOOK |27

MANAGING DATA

For example, refer to Question 7.1 in the Jordan BRFSS questionnaire

below. It asks if the participant has smoked at least 100 cigarettes in his or
her lifetime. The answer choices are 1 for Yes and 2 for No. The next
question (and several more that follow) should only be answered by
respondents who answered Yes to question 7.1. When you perform a logic
check, you find that one respondent (# 2018) answered No to being a
smoker (question 7.1=2) and Yes to currently smokes cigarettes (question
7.2=1). You would need to investigate this inconsistency by checking
questionnaire #2018.

PARTICIPANT WORKBOOK |28

MANAGING DATA

Stop

Let the facilitator or mentor know you are ready for the group discussion.

PARTICIPANT WORKBOOK |29

MANAGING DATA

KEY POINTS TO REMEMBER

Use the space below to record any key points from the facilitator-led
discussion:

PARTICIPANT WORKBOOK |30

MANAGING DATA

Activity

Practice Exercise #3 (Estimated time: 45 minutes)

Background:
For this exercise you will detect and correct errors for the hypertension
case study dataset.
Instructions:
1. Check for missing data of basic demographic data, such as age and
sex, by running frequencies.
2. Assuming you cannot contact the study participants, describe how
you will resolve the missing data in the table below (next to #1).
3. Check for miscodes by running frequencies and logic checks.
4. Correct the miscodes by referring to the questionnaire.
5. Document in the table below. (Number 2 has already been filled in.)
6. Make a scatterplot to identify any out-of-range values (outliers).
7. Correct the outliers by referring to the questionnaire.
8. Document in the table below.

Record IDs
Variable

affected and

Number

name

Format

Problem

resolution

SEX

Numeric

Miscoded

51929 changed

(response = 3)

to missing

2
3

PARTICIPANT WORKBOOK |31

MANAGING DATA

4
5
6
7
8
9

Activity
Take out the activity workbook and complete skill assessment #2.

PARTICIPANT WORKBOOK |32

MANAGING DATA

Resources
For more information on topics found in this workbook:
Centers for Disease Control and Prevention, Division of Epidemiology and
Surveillance Capacity Development. Advanced Management and Analysis
of Data Using Epi Info for Windows: Risk Factors for Sexually Transmitted
Infections in Kuwadzana, Zimbabwe; 2006.
Tulane University: Practical Analysis of Nutritional Data. Available at
http://www.tulane.edu/~panda2/Analysis2/datclean/dataclean.htm#1.

PARTICIPANT WORKBOOK |33

Data Wrangling
No ratings yet
Data Wrangling
30 pages
Data Analysis 2025
No ratings yet
Data Analysis 2025
17 pages
The Top-10 Management Consulting Firms Compared - CaseCoach
No ratings yet
The Top-10 Management Consulting Firms Compared - CaseCoach
7 pages
DataAnalysis and Interpretation
100% (2)
DataAnalysis and Interpretation
49 pages
MBB Consulting Firms - What's Different About The Big 3
No ratings yet
MBB Consulting Firms - What's Different About The Big 3
17 pages
WEEK 6 Sources of Epidemiological Data
100% (1)
WEEK 6 Sources of Epidemiological Data
61 pages
Different Types of Urine Test
No ratings yet
Different Types of Urine Test
2 pages
Maulid Al-Barzanji (Al-Imam Ja'Far Ibn Hasan Al-Barzanji)
100% (1)
Maulid Al-Barzanji (Al-Imam Ja'Far Ibn Hasan Al-Barzanji)
79 pages
Dr. Bhupendra Pandey CV - 28-04-25
No ratings yet
Dr. Bhupendra Pandey CV - 28-04-25
3 pages
Biocides - Resistance, Cross-Resistance Mechanisms and Assessment
No ratings yet
Biocides - Resistance, Cross-Resistance Mechanisms and Assessment
16 pages
Epi Data Guide Ziad
No ratings yet
Epi Data Guide Ziad
87 pages
Procedure (Blood Transfusion)
No ratings yet
Procedure (Blood Transfusion)
4 pages
Lecture 19 Basidiomycota
No ratings yet
Lecture 19 Basidiomycota
101 pages
Nature of Biostatistics and Data Processing
No ratings yet
Nature of Biostatistics and Data Processing
3 pages
Principles of Epidemiology 2
No ratings yet
Principles of Epidemiology 2
37 pages
Creating-Analysis-Plan PW Final 09242013
No ratings yet
Creating-Analysis-Plan PW Final 09242013
45 pages
Session 4 Data Analysis
No ratings yet
Session 4 Data Analysis
18 pages
2017 08 04-GuidelineDataDictionary v1.3
No ratings yet
2017 08 04-GuidelineDataDictionary v1.3
10 pages
Week3 - Data Preprocessing, Extraction and Preparation
No ratings yet
Week3 - Data Preprocessing, Extraction and Preparation
34 pages
2024 Wk5 Explorative Data Analysis-1.Ko - en
No ratings yet
2024 Wk5 Explorative Data Analysis-1.Ko - en
51 pages
Human Nature PDF
No ratings yet
Human Nature PDF
8 pages
7925 Deployment Guide
No ratings yet
7925 Deployment Guide
111 pages
Data Analysis3
No ratings yet
Data Analysis3
31 pages
Common Healthcare and Lactation Abbreviations
No ratings yet
Common Healthcare and Lactation Abbreviations
5 pages
10 Acupressure Points For High Blood Pressure Treatment
No ratings yet
10 Acupressure Points For High Blood Pressure Treatment
4 pages
Lecture 2 Data Information Knowledge-1
No ratings yet
Lecture 2 Data Information Knowledge-1
110 pages
Non Disjunction
No ratings yet
Non Disjunction
15 pages
Module 7-Measurment
No ratings yet
Module 7-Measurment
47 pages
University of Liberia T 302 Final0001
No ratings yet
University of Liberia T 302 Final0001
7 pages
Data Analysis
No ratings yet
Data Analysis
65 pages
Maths Lit Content Manual
No ratings yet
Maths Lit Content Manual
43 pages
Leuprolide Vs Triptorelin: The Recent Trends in GNRH Analogues in Precocious Puberty
No ratings yet
Leuprolide Vs Triptorelin: The Recent Trends in GNRH Analogues in Precocious Puberty
11 pages
FETPI2.0 d20 Data Planning Management 2020-06
No ratings yet
FETPI2.0 d20 Data Planning Management 2020-06
62 pages
Basic ECG Interpretation
100% (1)
Basic ECG Interpretation
8 pages
Unit 2
No ratings yet
Unit 2
18 pages
Initial Data Analysis
No ratings yet
Initial Data Analysis
38 pages
L9 Planning Data Management & Analysis
No ratings yet
L9 Planning Data Management & Analysis
26 pages
ASHOK
No ratings yet
ASHOK
3 pages
7 - Data Collection, Analysis & Interpretation
No ratings yet
7 - Data Collection, Analysis & Interpretation
35 pages
Green Tea Consumption and Cognitive Function: A Cross-Sectional Study From The Tsurugaya Project
No ratings yet
Green Tea Consumption and Cognitive Function: A Cross-Sectional Study From The Tsurugaya Project
7 pages
DAS Ammonium Chloride SDS
No ratings yet
DAS Ammonium Chloride SDS
5 pages
Pima Tutorial
No ratings yet
Pima Tutorial
8 pages
Data Tabulasi PDF
No ratings yet
Data Tabulasi PDF
6 pages
Week 12 Data Analysis and Presentation
No ratings yet
Week 12 Data Analysis and Presentation
21 pages
Lesson 09 Data Analysis I Descriptive Statistics
No ratings yet
Lesson 09 Data Analysis I Descriptive Statistics
15 pages
Putting Data To Work: Facilitator'S Guide
No ratings yet
Putting Data To Work: Facilitator'S Guide
75 pages
Psych Nursing Practice Exams
100% (2)
Psych Nursing Practice Exams
44 pages
Topic 3 Assignment 3 Case Study
No ratings yet
Topic 3 Assignment 3 Case Study
8 pages
Preparing Data For Analysis Using Microsoft Excel
No ratings yet
Preparing Data For Analysis Using Microsoft Excel
8 pages
G6 Scie & Techn
No ratings yet
G6 Scie & Techn
42 pages
Histology of CNS
No ratings yet
Histology of CNS
1 page
Title
No ratings yet
Title
12 pages
1 Data MNGT CH 1,2,3
No ratings yet
1 Data MNGT CH 1,2,3
28 pages
Preparing Data For Analysis Using Microsoft Excel: Tools and Issues
No ratings yet
Preparing Data For Analysis Using Microsoft Excel: Tools and Issues
9 pages
Cisco ACS 5.1 Migration Guide
No ratings yet
Cisco ACS 5.1 Migration Guide
156 pages
How To Write A DR Plan and Define DR Strategies
No ratings yet
How To Write A DR Plan and Define DR Strategies
10 pages
How To Write A DR Plan and Define DR Strategies
No ratings yet
How To Write A DR Plan and Define DR Strategies
10 pages
01 - BIOE 211 - Nature of Statistics and Data Processing
No ratings yet
01 - BIOE 211 - Nature of Statistics and Data Processing
26 pages
Lesson 1 - Roles of Statistics and Data Analysis
No ratings yet
Lesson 1 - Roles of Statistics and Data Analysis
5 pages
Research Proposal Components-Methodology
No ratings yet
Research Proposal Components-Methodology
27 pages
Bioe Week 2-Nature of Biostatistics and Data Processing
No ratings yet
Bioe Week 2-Nature of Biostatistics and Data Processing
6 pages
Data Analysis
No ratings yet
Data Analysis
39 pages
AG024-Allergy Profile (Drugs, Inhalants, Non-Veg and Vegetarian-Pages-Deleted
No ratings yet
AG024-Allergy Profile (Drugs, Inhalants, Non-Veg and Vegetarian-Pages-Deleted
7 pages
Islamisation of Knowledge Prof Rosnani
No ratings yet
Islamisation of Knowledge Prof Rosnani
26 pages
National Tuberculosis Program: NTP Manager: Dr. Anna Marie Celina G. Garfin
No ratings yet
National Tuberculosis Program: NTP Manager: Dr. Anna Marie Celina G. Garfin
9 pages
Biostat Lec Part 3 (SV)
No ratings yet
Biostat Lec Part 3 (SV)
4 pages
A History of Johnson & Johnson
No ratings yet
A History of Johnson & Johnson
4 pages
Massage Points
100% (15)
Massage Points
35 pages
Val Ward Caroline Grimes Clinical Nurse Specialist: Rochester
No ratings yet
Val Ward Caroline Grimes Clinical Nurse Specialist: Rochester
23 pages
Digestion and Absorption of Lipids
No ratings yet
Digestion and Absorption of Lipids
21 pages
CDC UP Business Impact Analysis Template
No ratings yet
CDC UP Business Impact Analysis Template
9 pages
Alchemy
No ratings yet
Alchemy
18 pages
Soal Try Out 2 Paket 1
No ratings yet
Soal Try Out 2 Paket 1
6 pages
Science Test: Chapter I Respiratory and Circulatory System Term I, Semester I
No ratings yet
Science Test: Chapter I Respiratory and Circulatory System Term I, Semester I
4 pages
AYURVEDA
No ratings yet
AYURVEDA
30 pages
The CLassification of DOLE in First Aider
No ratings yet
The CLassification of DOLE in First Aider
5 pages
Chloro Alkalis
No ratings yet
Chloro Alkalis
20 pages
Sri Chaitanya Educational Institutions, India: NEET Grand Test
No ratings yet
Sri Chaitanya Educational Institutions, India: NEET Grand Test
5 pages
Primary Science Curriculum Framework PDF
100% (1)
Primary Science Curriculum Framework PDF
18 pages
Business Impact Analysis
No ratings yet
Business Impact Analysis
5 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Statistics and Data Analysis Essentials
From Everand
Statistics and Data Analysis Essentials
Jayant Ramaswamy
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Core Concepts in Statistical Learning
From Everand
Core Concepts in Statistical Learning
Tushar Gulati
No ratings yet
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
5/5 (2)
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Applied Survival Analysis: Regression Modeling of Time-to-Event Data
From Everand
Applied Survival Analysis: Regression Modeling of Time-to-Event Data
David W. Hosmer, Jr.
4/5 (2)
Sample Size Tables for Clinical Studies
From Everand
Sample Size Tables for Clinical Studies
David Machin
No ratings yet
Essentials of Data Analysis
From Everand
Essentials of Data Analysis
Agasti Khatri
No ratings yet
Painless Statistics
From Everand
Painless Statistics
Barron's Educational Series
No ratings yet
Test Development: Fundamentals for Certification and Evaluation
From Everand
Test Development: Fundamentals for Certification and Evaluation
Melissa Fein
No ratings yet
Analyzing Quantitative Data: An Introduction for Social Researchers
From Everand
Analyzing Quantitative Data: An Introduction for Social Researchers
Debra Wetcher-Hendricks
No ratings yet
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
"Data Analysis" Basic Concepts and Applications
From Everand
"Data Analysis" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
IGNOU BCA System Analysis and Design Previous Year Solved Papers MCS 014
From Everand
IGNOU BCA System Analysis and Design Previous Year Solved Papers MCS 014
Manish Soni
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Research & the Analysis of Research Hypotheses: Volume 2
From Everand
Research & the Analysis of Research Hypotheses: Volume 2
Kathleen Thomas Allan, PhD
No ratings yet
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Confirmative Evaluation: Practical Strategies for Valuing Continuous Improvement
From Everand
Confirmative Evaluation: Practical Strategies for Valuing Continuous Improvement
Joan C. Dessinger
No ratings yet
Data Preparation and Exploration: Applied to Healthcare Data
From Everand
Data Preparation and Exploration: Applied to Healthcare Data
Robert Hoyt
No ratings yet
Project Management of Clinical Trials
From Everand
Project Management of Clinical Trials
Richard Chamberlain
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Managing-Data PW Final 09252013

Uploaded by

Managing-Data PW Final 09252013

Uploaded by

shells

SECTION 2: DICTIONARY ................................................................................................ 6

SECTION 3: CLEANING DATA ...................................................................................... 13

ESTIMATED COMPLETION TIME

PRE-WORK AND PREREQUISITES

ABOUT THIS WORKBOOK AND THE ACTIVITY WORKBOOK

Light bulb key idea to note and remember or supplemental

Fleetwood Loustalot, PhD, FNP, Andrea Neiman, MPH, PhD (Division

Indu Ahluwalia, (Senior Scientist, Division of Reproductive Health,

Section 1: Overview of Data Management

CLEANING THE DATA

Section 2: Data Dictionary

Variable Description or Label

Variable (Field) Type

Response Options (or Values)

for yes, 0 or 2 for no, and 9 or 99for unknown, dont know, or

Some variables in the data dictionary are derived variables. These

KEY POINTS TO REMEMBER

Practice Exercise #1 (Estimated Time: 30 minutes)

PARTICIPANT WORKBOOK |10

Ever told you

Told you had

PARTICIPANT WORKBOOK |11

PARTICIPANT WORKBOOK |12

Section 3: Cleaning Data

DOCUMENTING ERRORS AND CHANGES TO THE DATASET

PARTICIPANT WORKBOOK |13

PARTICIPANT WORKBOOK |14

Record IDs affected

PARTICIPANT WORKBOOK |15

COMMON SOURCES AND TYPES OF ERRORS IN EPIDEMIOLOGIC DATA

DETECTING AND CORRECTING DUPLICATE RECORDS

Deleting Duplicate Records

PARTICIPANT WORKBOOK |17

KEY POINTS TO REMEMBER

PARTICIPANT WORKBOOK |18

Practice Exercise #2 (Estimated time: 30 minutes)

PARTICIPANT WORKBOOK |19

DETECTING AND CORRECTING MISSING, MISCODED, AND OUT-OF RANGE

Sometimes missing data occurs randomly and sometimes it occurs in

Many datasets have secondary IDs. This dataset does not.

PARTICIPANT WORKBOOK |20

PARTICIPANT WORKBOOK |21

Example of Patterns to missing data (missing high blood pressure

PARTICIPANT WORKBOOK |22

Identifying Records with Missing Values

Correcting Missing Data

PARTICIPANT WORKBOOK |23

There are other approaches to dealing with missing data as described

Handling Missing Values in Analysis

Identifying Records with Miscoded Values

unfortunately this option is often not feasible. Another common approach is

Identifying Records with Out-of-Range Values

PARTICIPANT WORKBOOK |25

PARTICIPANT WORKBOOK |26

PARTICIPANT WORKBOOK |27

For example, refer to Question 7.1 in the Jordan BRFSS questionnaire

PARTICIPANT WORKBOOK |28

PARTICIPANT WORKBOOK |29

KEY POINTS TO REMEMBER

PARTICIPANT WORKBOOK |30

Practice Exercise #3 (Estimated time: 45 minutes)

PARTICIPANT WORKBOOK |31

PARTICIPANT WORKBOOK |32

PARTICIPANT WORKBOOK |33

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.