Exercises
Exercises
b) How would you plan such a study? Here you will need to describe your study
design, who should be included in the study, how would you measure your exposure
and your outcome (maybe you will have to google ..), etc.
The main idea behind this exercise is to spend some time on discussing different
aspects of the planning of such a study, so don’t rush it!
A
Which of these designs can be used to investigate the research question above? Give
pros and cons of each design.
B
Choose one of these designs, describe briefly what you would do and argue why you
find this most appropriate. You should notice, there are several possible answers here.
-1-
Exercise 3: Entering data + descriptive analysis of
continuous data (Friday 10.11)
Getting started…
Start the computer and log on to your VMWare Horizon Clint. Once inside, choose
Statistics fullscreen (Statistikk fullskjerm). Start Stata from the Start menu.
Alternatively, if you have downloaded Stata to your own computer, just start it from
the Start menu.
A
The following data contain the weight (in kilograms) of 20 students:
Girls:
50 75 70 74 68 83 65 66 65 53
Boys:
65 75 84 55 73 95 72 94 67 65
The first thing to do is to define the two variables Sex and Weight.
Click Data – Create or change – Create new variable on the menu line.
Give a Variable name (Sex) and choose "Fill with missing data" and click OK.
Choose the Data Editor window (Edit mode) from the Stata menu.
You can now enter your data.
When you enter Sex, you may code girl = 1 and boy = 2.
-2-
I have also included an ID variable, to identify each subject. This takes the values 1-
20. This is not very important in this case, but it’s a nice habit and can prove very
useful when you deal with larger data sets.
Value labels:
We would like to be able to link the correct gender to the coding of the variable 'Sex'.
In the Data Editor window, right-click "Sex", choose Data – Value labels – Manage
value labels.
In the next window, click "Create label".
Next, give a label name (e.g. sexl) and give labels to the values '1' and '2' (girl and
boy).
Click OK
Finally, we need to link the labels to our variable 'Sex'. We do that by right-clicking
"Sex" again, choose Data – Value labels – Attach label to variable Sex. Choose 'sexl'.
-3-
B
So we have two variables: Sex which is categorical and Weight which is continuous.
We are mainly interested in the weight data, and will use sex to group the data.
On the menu line, choose Statistics – Summaries, tables, and tests – Summary and
descriptive statistics – Summary statistics.
Choose Weight under Variables, by clicking the small arrow to the right. Click also
Display additional statistics under Options.
Chose by/if/in on top, click Repeat command by groups, and chose Sex as group.
Click OK.
-4-
We would also like to create a histogram of the weight data.
On the menu line, choose Graphics – histogram. Choose Weight as Variable and
choose Percent under Y axis. Go to By on top, and click Draw subgraph …. Choose
Sex and click OK.
In the table of descriptive statistics, which values do you know the meaning of? Do
the boys and girls appear to be different?
-5-
C
You will often need to compare observed distributions with the theoretical normal
distribution. For that purpose, we add a normal curve to the histograms: Redo the
histogram as above, but in addition, choose Density plots on top and click Add
normal-density plot.
D
Try to create a Box plot. Choose Graphics on the menu line and go from there. What
can we learn from the Box plot? (Use the Help graph box for a description of Box
plots – move down to Description.) Can you identify any ”suspicious” observations?
This is a real data file, condensed from a study that was conducted by a group of
graduate students in Educational Psychology. The study was designed to explore the
factors that impact on respondents' psychological adjustment and wellbeing. The
survey contained a variety of validated scales measuring constructs that the extensive
literature on stress and coping suggest influence people's experience of stress. These
scales are, however, not included in the current data file. The survey was distributed
to members of the general public in Melbourne, Australia and surrounding districts.
The final sample size was 439, consisting of 42 percent males and 58 percent females,
with ages ranging from 18 to 82 (mean=37.4).
-6-
The variable source denotes each person’s main source of stress. To see the coding of
this variable, open the Data Editor window (Edit mode) from the Stata menu, right-
click source and choose Data – Value labels – Manage value labels and click on the
relevant 'plus'-sign.
We wish to assess important sources of stress, and if there are gender differences.
A
We will first evaluate the common frequency distribution. Choose Statistics –
Summaries, tables, and tests – Frequency tables – One-way table. Choose source
under ‘Categorical variable’ and click OK.
It is also important to be aware of missing values. Redo the procedure above, but click
"Treat missing values like other values" in the panel.
Next, we want to create a Bar chart of the same data. Choose Graphics – Bar chart –
Tick: Graph of percent of frequencies within categories. Under “Categories”, tick:
Group 1 and add grouping variable source. Alternatively, you may run the command:
graph bar, over(source). Here, the different sources of stress appear a bit messy. We
can fix that by using the graph editor (in the graph, choose File – Start graph editor),
click the legend (the messy part below the graph) and choose 'Label angel 45o' on top.
B
It is interesting to do the analyses separately for men and women. We can use the
following procedure:
Statistics – Summaries, tables, and tests – Frequency tables – Two-way tables with
measures of association. Choose source on the rows and sex on the columns. Tick the
"Within-column relative frequencies" box (why?). Click OK.
We can also ask for a Bar chart for each gender, separately by graph bar,
over(source) by(sex). In the menu, you do as above, but click also “By” on the top
line, tick “Draw subgraphs …” and choose sex under “Variables”.
C
You will find a variable named age which contains the age of all the subjects.
-7-
For practical data analysis, we will categorize age into three groups: 18-25 years, 26-
40 years and over 40 years.
This can be done by the following commands (make sure you understand what is
being done):
generate agegr = .
replace agegr = 1 if (age >= 18) & (age <=25)
replace agegr = 2 if (age >= 26) & (age <=40)
replace agegr = 3 if (age > 40)
You may also add a Value label to each numerical category to show what each value
(1, 2, 3) represents, as we did in Exercise 1.
You can control the recoding by making a frequency table as in question B), choosing
age as row variable and agegr as column variable. Alternatively, run the command
tabulate age agegr.
Evaluate the frequency table: Was the coding done as you wanted it to?
An open question for discussion: What do you think about this type of measure?
The discussion can be guided by the following points:
i) What does it mean to have 3.1cm pain?
ii) If one person ticks 3.1cm pain and another person ticks 4.5cm pain,
what can we say about the level of pain in these two persons?
iii) If one person ticks 3.1 cm pain and then the next day ticks 4.5cm, what
can we say about the change in pain?
iv) Is a reduction from 2cm to 1cm pain the same as a reduction from 6cm
to 5cm?
What do we think about validity and reliability?
-8-
Exercise 6: More descriptive statistics (Friday 10.11 /
Monday 13.11)
A
Open the data set BIRTH.DTA in Stata.
Create a histogram over the children’s birth weight, that is, over the variable bwt. Find
mean and median, as well as the minimum and maximum values. Find the standard
deviation.
Go to Graphics – Histogram and choose the bwt variable. Alternatively, use the
command histogram bwt.
B
Make histograms and find mean and standard deviation for bwt for smoking and
nonsmoking mothers, respectively. Repeat the procedures under A, but by smk (use
the menu or use commands). The commands will be
histogram bwt, by(smk)
and
by smk, sort : summarize bwt, detail
Does it seem like smoking during pregnancy affects the children’s birth weight?
C
Make histograms and find mean and standard deviation for bwt for the different
values of ht. Does it seem like hypertension affects the birth weight? How many of
the mothers had "History of hypertension"?
-9-
Exercise 7 (Monday 13.11): Measure of central
tendency
(Solve this exercise both with and without use of Stata)
2,11,4,5,3
We are interested in the expected birth weight among smokers and non-smokers.
Calculate the mean birth weight with corresponding 95% confidence intervals for
smokers and non-smokers by going to Statistics - Summaries, tables, and tests –
Summary and descriptive statistics – Means. Choose bwt under ‘Variables’, click
if/in/over on top and tick ‘Group over subpopulations’. Include smk as Group
variable. Alternatively, run the command mean bwt, over(smk).
Comment on the confidence intervals. Do you understand how they are calculated
(look at mean and standard error)?
- 10 -
Go to Statistics - Summaries, tables, and tests – Summary and descriptive statistics –
Proportions and move the relevant variable (low) to ‘Variables’ to estimate the
probability of having a baby with low birth weight. Alternatively, run the command
proportion low. Make sure you understand how the confidence interval is calculated
(see formula in today’s lecture).
- 11 -
Use the data in BIRTH.DTA to set up a Table 1 for a study where you are interested
in comparing smoking and non-smoking mothers with regard to the risk of low
birthweight babies. The idea is to describe the two groups (smokers and non-smokers)
with regard to a number of other characteristics that might influence the risk of low
birthweight. Use what we have learnt about descriptive statistics to pick relevant
summary measures, and use commands from the previous exercises to compute. You
may use the following table as a starting point:
Smokers (n = ) Non-smokers (n = )
Age
Ethnicity
Hypertension
Weight
No. visits to physician 1.
Trimester
A
Is birth weight associated with the mother’s smoking habits? Formulate hypotheses
and perform an independent samples t-test comparing smokers and non-smokers! Try
to find your way through the menu (Start at Statistics – Summaries, tables, and tests –
Classical tests of hypotheses …), or just run the following command:
Formulate a conclusion!
Several aspects are relevant for this discussion. Tip: sample, representativeness,
assumptions in the model, e.g., if the data are normally distributed in each group. You
may have a look at histograms or normal probability plots (run e.g.
- 12 -
Menu: Graphics – Distributional graphs – Normal quintile plot. Choose bwt as
Variable, go to if/in on top and specify ‘smk==1’ or ‘smk==0’.
- 13 -
e) How do you conclude? Emphasize not only the result of the hypothesis test,
but also the estimated changes and differences.
f) Repeat the same analysis for average pain.
g) How do you conclude about the effect of the educational program, based on
what you have now seen?
A
Does the mean number of doctor visits in the first trimester differ from the mean
number of visits in the third trimester?
Again, you may try to find your way through the menu (you are supposed to run a
paired samples t-test with the variables fvt and ttv), or just run the command:
What is the conclusion here? Why is it appropriate to use the paired samples test in
this case?
You may also have a look at normality here by creating a new variable that contains
the difference between fvt and ttv:
generate diff=fvt-ttv.
A
Make a cross table of the association between low and smk. Does smoking seem to
affect the probability of giving birth to an underweight child (<2500g)?
Result:
- 14 -
smoking low
status bwt > 250 bwt < 250 Total
Non-smoker 86 29 115
74.78 25.22 100.00
Smoker 44 30 74
59.46 40.54 100.00
Why was it appropriate to percentage by the rows, that is, by the variable smk,
in this case?
B
Make a cross table of the association between low and ht.
What seems to be more important in predicting low birth weight - smk or ht?
C
Create a cross table of the association between low and ethn. Comments?
A
To answer this, we will create a cross table in Stata and run a chi-square test:
Choose Statistics – Summaries, tables, and tests – Frequency tables – Two-way tables
with measures of association.
Choose sex under Rows and smoke under Columns. Go to Cell contents and indicate
how you want the percentages to be created in this table. Tick Pearson's chi-squared
under Test statistics.
Conclusion?
B
Are the assumptions behind the chi-square test met? Repeat the procedure above, but
tick Expected frequencies under Cell contents.
- 15 -
C
What is the odds ratio of smoking for men vs women? What does this mean?
Calculate also the odds ratio by hand to make sure you understand the calculation.
Notice: This Stata procedure will only work if the relevant variables are coded zero
and one!
Graphics – Twoway graphs. Click ‘Create’, and then you will see that ‘Scatter’ is the
default. You just have to specify the x-variable and the y-variable. After that, click
‘Accept’ and OK.
Alternatively:
twoway (scatter nabopuls minpuls)
Calculate the Pearson correlation coefficient between nabopuls and minpuls. This can
again be done through the menu by Statistics – Summaries, tables, and tests –
Summary and descriptive statistics – Pairwise correlations. Specify the variables in
question and tick 'Print significance level for each entry'.
The easier solution is to run the code: pwcorr minpuls nabopuls, sig.
What is the correlation? Can you conclude that there is good agreement between the
measures?
- 16 -
Exercise 18 (Monday 20.11): Simple linear regression
In the table below, you will find data describing the relationship between age and
blood pressure of 20 healthy adults. Enter the data
20 120
43 128
63 141
26 126
53 134
31 128
58 136
46 132
58 140
70 144
46 128
53 136
70 146
20 124
63 143
43 130
26 124
19 121
31 126
23 123
Find the correlation between age and blood pressure and test if it is significant. See
commands above.
Perform also a linear regression analysis with blood pressure as dependent variable
and age as independent variable.
Statistics – Linear models and related – Linear regression. Specify the dependent (Y)
and the independent (X) variables.
Alternatively, write the command reg BP Age (if you name the variables like this …).
Read off the 95% confidence interval for the regression parameter. Find also the
squared correlation coefficient between age and blood pressure. What does it mean?
What is the expected blood pressure for a person at age 40? For a person at age 80?
Comment.
- 17 -
Exercise 19 (Monday 20.11): Simple linear regression
We will study the association between heart rate (HR) and intake of oxygen (VO2) in
38 persons. The data are given in the file OXYGEN.DTA.
Create a scatter plot of the association between heart rate and intake of oxygen. Use
heart rate (HR) on the X-axis (see commands in Exercise 15).
Compute the regression coefficient for the same association (with VO2 as the
dependent variable). Statistics – Linear models and related – Linear regression.
Specify the dependent (Y) and the independent (X) variables.
Is the association statistically significant? Give also a 95% confidence interval for the
coefficient..
Some of the subjects are wearing a mask, which might influence the measurements.
Create the scatterplot from A over again, but use different markers for those with and
without a mask. Impose the regression lines for each of the two groups (with and
without mask) separately. What do you find?
This is a bit tricky in Stata, so we will just give the relevant command (or one version
of it):
Run also the linear regression analysis over again, with MASK as an additional
independent variable. What happens to the estimated effect of HR and its 95%
confidence interval? In the menu, just add MASK to the list of independent variables.
- 18 -
Exercise 20 (Wednesday 22.11): Linear regression
and confounding
In the lectures, we have been working with data from the Oslo health study
(HSCL_EXER.DTA). In these data we have information about physical activity, and
we would like to investigate the association between physical activity and mental
health, as measured by the HSCL10 score. We have collapsed the original physical
activity question into three categories; those who don’t exercise, those who exercise
moderately and those who exercise a lot, code 1, 2 and 3. This is the variable
NewEx1. In this exercise we will learn how to deal with categorical data in a linear
regression analysis.
We will run a linear regression model with the HSCL10 score as dependent variable
and physical activity as independent variable.
In the menu, do the following: Statistics – Linear models and related – Linear
regression. Specify the dependent variable (Y). For independent variables, click the
small box to the right (with three dots on it). Select ‘Factor variable’ as Type of
variable, and select the variable NewEx1 under Variable 1. Click Add to varlist and
OK.
Notice the i. notation that allow Stata to treat the variable NewEx1 as categorical.
NewEx1
2 -.0674394 .1285768 -0.52 0.601 -.321003 .1861242
3 -.1336821 .1415499 -0.94 0.346 -.4128297 .1454655
What happens is that he lower category (category 1) or the variable NewEx1 is chosen
as reference category, and the two other categories are compared to this one. Give an
interpretation to the findings. Focus on the estimated coefficients.
We have seen that there are gender differences with regard to physical activity, and
we also know there are gender differences when it comes to the mental health score.
This leads us to think that gender might act as a confounder in the mental health –
physical activity relationship. Add gender to the model and see what happens to the
estimated coefficients.
- 19 -
Exercise 21 (Wednesday 22.11): Non-parametric tests
We will be using NAUSEA.DTA. The data are taken from a small study on patients
receiving chemotherapy. In addition, they were randomized to receive either an active
antiemetic treatment or placebo (taken from “Practical statistics for medical research”
(D. Altman). The variable Nausea gives measurements of nausea on a 100mm self-
assessment scale (0 being the best, 100 being the worst; remember our discussions in
Exercise 5).
We are interested in if there is any difference between the two treatment groups.
First, have a look at the distribution of nausea in the two groups, e.g. by a histogram.
If you do this by the menu (Statistics – Summaries, tables, and tests – Non-parametric
tests of hypotheses – Wilcoxon rank-sum test), you will see that you can also ask for
the probability of values from one group being larger than from the other group. Try
this, and try also to reverse the order of the groups. To achieve this, you will have to
recode your variable. Trt is coded 1 for active treatment and 2 for placebo, so the
simplest way of reversing the order is to recode the placebo group to 0. You can
achieve this by running
Again; BIRTH.DTA.
In Exercise 14, we used a paired-samples t-test to investigate whether the mean
number of doctor visits in the first trimester differ from the mean number of visits in
the third trimester. The analysis gave a highly significant difference, with a p-value <
0.001. Now, repeat this analysis by a non-parametric test. (Statistics – Summaries,
tables, and tests – Non-parametric tests of hypotheses – Wilcoxons matched-pairs
signed-rank test). Place one variable (e.g. fvt) in the ‘Variable’ box and the other
variable (ttv) in the ‘Expression’ box. Alternatively, run
signrank fvt = ttv (notice the similarity with the command in Exercise 14).
- 20 -