0% found this document useful (0 votes)
32 views88 pages

Chapter 2

The document provides an overview of Chapter 2, which covers categorical data analysis. It will discuss rates, association, Simpson's Paradox, and confounders over 7 units. The first unit introduces categorical variables and interpreting tables and plots from one categorical variable. An example dataset on kidney stone treatments is presented to demonstrate categorical data analysis techniques.

Uploaded by

kayle1535
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views88 pages

Chapter 2

The document provides an overview of Chapter 2, which covers categorical data analysis. It will discuss rates, association, Simpson's Paradox, and confounders over 7 units. The first unit introduces categorical variables and interpreting tables and plots from one categorical variable. An example dataset on kidney stone treatments is presented to demonstrate categorical data analysis techniques.

Uploaded by

kayle1535
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 88

Categorical Data

CHAPTER 2 Analysis

Welcome to GEA1000. My name is Chong Kai. My colleague, Shi Ming and I will be
bringing you through Chapter 2, which deals with categorical data.

1
Overview
2A.
1A. Rates I 1B. Rates II
Association I

2B. 3A. Simpson’s 3B: Simpson’s


Association II Paradox I Paradox II

4.
Confounders

Let us go through an overview of this chapter. This chapter will have a total of 7
units. In general, we will be going through rates, association, a phenomenon known
as Simpson’s Paradox, as well as a new concept known as confounders.

2
Unit 1A: Rates I
By the end of this unit you should be able to
do the following:
1. Identify a categorical variable.
2. Understand and interpret tables and
plots created from 1 categorical variable.

In Unit 1A, we will be doing a gentle introduction on categorical data.

By the end of the unit, you need to be able to do 2 things.


1. Identify what a categorical variable is.
2. Understand and interpret tables and plots created from 1 categorical variable.

Next, let’s jump into the unit proper.

3
RECAP

Types of variables

Categorical variables Numerical variables

Ordinal Nominal Continuous Discrete


Categories come with some One where possible values of
One that can take on all possible
natural ordering and numbers No intrinsic ordering for the the variable form a set of
numerical values in a given
are often used to represent the variables. E.g.: Nationality numbers with “gaps”. E.g.:
range or interval. E.g.: Time
ordering. E.g.: Happiness level Module credits

As a recap, there are 2 main types of variables, categorical variables and numerical
variables.

Do revise Chapter 1 if you need to.

We will be focusing on categorical variables for this chapter.

4
THE
PROBLEM

Now, suppose you are a doctor in a urology department, and you have a patient
coming in to seek treatment for kidney stones. To decide which treatment is best for
your patient, you decided to dig out previous records of kidney stones treatment.

5
A SNAPSHOT OF THE DATA

Size Gender Treatment Outcome

Large Male X Success

Large Male X Success

Small Male Y Success

Large Male Y Failure

Small Male X Success

Large Male Y Success

After asking for the medical records, you were given this dataset to analyse. Note
that there are actually 1050 observations, but only the first 6 are shown here due to
the slide’s capacity.

We will be using this dataset for the rest of this chapter, so let us familiarize
ourselves with it.

In this dataset, we have 4 variables.


1. The size of the kidney stones. They can be either small or large.
2. The gender of the patients. They can be either male or female.
3. The treatment the patient underwent. They took up either Treatment X or
Treatment Y.
4. The outcome of the treatment. The treatment either be a success or a failure.

6
APPLYING THE PPDAC CYCLE
In general, are treatments are helping the
patients?

What to measure:
Variable “Outcome” tells us if the
treatment was a success or not

Sort the data


Plot graphs, tables of the
“Outcome” variable

The overall question we want to tackle is a broad one. We want to use the dataset to
give us some insights on how to treat patients better. To help us frame the issue, we
can try applying the PPDAC cycle to this context.

Firstly, think about the problem, or question we want to answer. We can try
narrowing it down to a simple question for a start. For example, a good starting
question would be to ask if these treatments we are giving the patient are successful
in general. This is a simpler question, and breaks down the problem into easier parts.
Remember that the cycle is a continuous process, and we don’t have to tackle all the
questions in one fell swoop. Asking questions, looking at the data, drawing
conclusions might lead to more questions on its own, perhaps ones that we didn’t
even think about at the start. Looking at the data may reveal interesting trends, and
we may ask more questions on those trends.

Now that we have established the problem, let’s plan on how to answer it. In this
case, we are not conducting the experiment, so we do not have to worry about how
to measure the variable with instruments. Instead, we look at the data that has
already been collected and plan on which variable to focus on. We know from our
data that there is a variable called “Outcome”. This variable tells us whether the
outcome of the treatment is successful or not. Thus, this would be our main focus.

7
We are now ready to analyse the data. Let’s look at some ways we can generate
tables and charts that will give us useful information.

7
ANALYSING 1 CATEGORICAL VARIABLE
- TABLE
Categories of the “Outcome”
Count Rate Percentage
variable

Success 831 831 0.791


rate(Success) = = 0.791
1050 × 100%
= 79.1%

Failure 219 219 0.209


rate(Failure) = = 0.209
1050 × 100%
= 20.9%

Total 1050 1050 1 × 100%


=1
1050 = 100%

Looking through the data, we counted 831 “Success”s and 219 “Failure”s. We can
hence conclude that we should generally recommend our patient to go for
treatment, since there are more successful treatments than failed ones.

Alternatively, we can calculate the rate of successful treatments, which is 831/1050


(i.e. successful treatments / all treatments) = 0.79, or 79%. Here, we can see that a
majority of the treatments are successful, since rate(Success) > rate(Failure).

We will be using rates for much of this chapter. Intuitively, we can think of a rate as a
fraction, proportion, or a percentage. This is useful for understanding some of its
properties. For example, we note that 0%< rate(X) <100%, or 0<rate(x)<1.

8
Analyzing 1 categorical variable - Plot
Dodged bar plot for “Outcome” Stacked bar plot for “Outcome”
900 1200
831
800
1000
700 219
600 800

500
Counts

Counts
Failure 600 Failure
400 Success Success
300 400 831
219
200
200
100

0 0
Outcome Outcome

Apart from looking at a table of numbers, we can also use data visualization to help
us arrive at the same conclusion! Since we are dealing with “Outcome”, which is a
categorical variable and we have calculated the counts for “Outcome”, we can
visualize these counts in the form of a bar plot!

The x-axis indicates the variable “Outcome”, whereas the y-axis shows the number of
“Success”s and “Failure”s counted in the variable “Outcome”.

The bar plot on the left is known as a dodged bar plot and we can see that the levels
“Success” and “Failure” of the “Outcome” variable are beside each other.
The bar plot on the right is known a stacked bar plot. It is named as such because the
levels “Success” and “Failure” are stacked on top of each other.

From these plots, we can see that there are many more successes than failures.
Hence, we might recommend our patient to go for some form of treatment
based on the best/limited information we have.
Note that we are plotting absolute numbers over here, which is different from what
we are going to plot in the next slide.

9
Analysing 1 categorical variable - Plot
100% Stacked Bar plot for “Outcome”
100%

90% 20.9%
80%

70%
Percentage

60%

50% Failure

40% Success
79.1%
30%

20%

10%

0%
Outcome

Another way to visualize the problem would be to show the proportions of


successes/failures out of the total number of treatments. This gives us a clearer idea
of the difference in proportions. Notice that in this case, the y-axis is different.
We have normalized the y-axis to become 100%.

Since the levels are stacked, and the values are in terms of percentages, this is also
known as a “100% Stacked bar plot”.

Again, we see that rate of successful treatments take up a huge portion of the bar
plot, and hence, we would recommend that our patient should go for some form of
treatment at the very least.

A side note: The success of the treatment is defined as having the stones eliminated
or reduced significantly. Hence, the failure just means that the stones were not able
to be eliminated. Kidney stones treatments caused little morbidity and mortality in
general. Our conclusion might be different if the treatments have high stakes in the
first place.

10
Conclusion
Table and bar plots gave us the same
conclusion

79% success

21% failure

Should go for treatment

To recap, we used 2 methods to summarize the variable “Outcome”. By drawing a


table and using visualization techniques, we concluded that our patient should go for
treatment.

11
Summary
We have learned:
• Use of tables and plots to summarize a categorical variable
• Calculation of rates

In summary, this unit is about the exploration of 1 categorical variable. We have


looked at how we can explore the variable “outcome” by using tables and plots as
forms of summary.

We have also introduced the idea of rates, as a way to compare the successes and
failures. We will continue to use rates as we explore more than 1 categorical variable
in future units.

12
Unit 1B: Rates II
By the end of this unit you should be able to
do the following:
1. Understand and interpret tables and
plots created from 2 categorical
variables.
2. Calculate marginal, conditional and joint
rates.

In Unit 1B, we will be doing diving deeper into what we have discussed in Unit 1A.

By the end of the unit, you need to be able to do 2 things.


1. Understand and interpret tables and plots created from 2 categorical variables.
2. Calculate marginal, conditional and joint rates.

Let us jump into the unit proper.

13
Size Gender Treatment Outcome

Large Male X Success

Large Male X Success


WHICH
TREATMENT
Small Male Y Success
TO
Large Male Y Failure CHOOSE?
Small Male X Success

Large Male Y Success

Now that we have seen that the treatments are generally successful, we feel the
need to recommend them. However, we are faced with another choice.

Recall that there are actually 2 types of treatment: Treatment X and Treatment Y.
How do we decide which one to recommend? Is one better than the other?

Can we use the data to help us make the decision again?

14
PPDAC CYCLE – A NEW QUESTION

Which treatment is better?

Key variable of interest:


Treatment variable

Sort the data


Plot graphs, tables of the
“Treatment” and
“Outcome” variable

In the previous unit, we have discovered that the treatments are generally
successful. That is good news, and we recommend our patient to seek treatment.

Now that we have answered our first simple question, what more can we ask? Well,
naturally, after discovering that the treatments are successful in general, we would
like to ask if there are specific treatments that are working much better than the
rest. Thus, this leads to the next iteration of the PPDAC cycle. It’s a cycle because the
question we started off with gave us a conclusion, and now that conclusion leads us
to think of new questions. The question we want to know now is: Which treatment is
better? This is a natural build up after realising that the treatments are successful in
general.

The planning step is quite similar, just that the variable of interest is now treatment
type.

For our analysis, although we have turned our attention to the “Treatment” variable,
we cannot ignore the variable “Outcome”. This is because we want to know how the
treatment type affects the outcome. Hence, we are going to be analysing two
categorical variables here: which are “Treatment”, and “Outcome”.

15
Outcome
Row
Success Failure
Treatment Total

X 542 158 700


2 x 2 Table
Y 289 61 350

Column 831 219 1050


Total

We will first draw up what is known as a 2x2 contingency table to summarize the
variables “Outcome” and “Treatment”. Traditionally, we would put the dependent
variable, which is “Outcome” in this case, on the columns of the table, and the other
variable “Treatment” on the rows of the table.

Recall that there were 831 successful treatments, of which 542 of them were from
Treatment X, and 289 of them were from Treatment Y. Focusing on the column
“Success”, we can put 831 under “Column Total”, 542 and 289 under “X” and “Y”
respectively.
Similarly, there were 219 unsuccessful treatments. 158 of them were from
Treatment X, whereas 61 of them were from Treatment Y.

Focusing on the row total, we also noticed that there are 700 patients who went for
Treatment X, and 350 patients who went for Treatment Y.

Note that the column total values for the Success and Failure columns should add up
to the same value as the sum of the 2 row total values, as they are nothing but the
total number of observations in our dataset.

16
Marginal rates /
proportions / percentages
Outcome Row • What proportion of the total number of patients
Success Failure
Treatment Total underwent Treatment Y?
350 1 1
X 542 158 700 • rate Y = = = 33 %
1050 3 3

• What proportion of the total number of patients had


Y 289 61 350
a successful treatment?
831
Column Total 831 219 1050 • rate Success = = 0.791 = 79.1%
1050

• Calculations above are called marginal rates /


proportions / percentages.

Now, before we jump into analysing our data, we will first look at calculating us some
rates.

Suppose I ask you this question: “ What proportion of the total number of patients
underwent Treatment Y?”
You can then answer that by taking the number 350 to divide by the total number of
observations, which is 1050. Hence, we can answer that the rate(Y) = 1/3 or 33 1/3%.

Similarly, if I ask you what is the rate of successful treatments? You know that you
have to use 831 to divide by 1050. And now, we know that rate(Success) = 0.791 or
79.1% .

Notice that we used 2 numbers in the margin of the table that relate to just one of
the categorical variables each time. These calculations are therefore known as
marginal rates.

17
Conditional rates /
proportions / percentages
Outcome Row • If we focus on patients who underwent Treatment
Success Failure X, what proportion of them had a successful
Treatment Total
treatment?
X 542 158 700 542
• rate Success given X = = 0.774 = 77.4%
700
Y 289 61 350 • Calculation above is known as a conditional
proportion / percentage.
Column Total 831 219 1050
• An even shorter way of writing this is to use a
vertical bar in place of given: rate Success X)

Next, let’s answer another type of question.

Suppose now that we want to focus on patients who had undergone Treatment X,
and I ask, what proportion of them have had a successful treatment?

We know that we have to zoom in on patients who took Treatment X, and see how
many of them had a successful treatment. Referring back to the table, we see that
there were 700 patients who chose Treatment X, and 542 of them had a successful
treatment.
Hence, the answer we are looking for is 542 / 700 = 0.774 or 77.4%.

For this question, notice that we have to focus on the language. Our starting point is
that the patients have chosen Treatment X. This information sets the condition for
calculating the rate. Once the condition is set, we focus on the Treatment X
population.

The rate we have just calculated is an example of a conditional rate. In general, a


conditional rate is one that is based on a given condition. Here, the given condition is
that the patient chose Treatment X.

18
Note that instead of writing “Success given X”, we can replace “given” with a vertical
bar. This is just a shorter way of writing given.

18
Joint rates / proportions /
percentages

• What is the proportion of patients who chose


Outcome Row Treatment Y and had a failure?
Success Failure
Treatment Total
61
• rate Y and failure = = 0.0581 = 5.81%
1050
X 542 158 700
• NOT a conditional rate.

Y 289 61 350 • Calculation is known as a joint rate/ proportion


/ percentage.
Column 831 219 1050
Total

Lastly, what if I ask the following question: “ What is the proportion of patients who
chose Treatment Y and had an unsuccessful treatment?”

We can go to the table and immediately identify that there are 61 of such cases.
Hence, the rate of patients who chose Treatment Y and had a failure is 61/1050,
which is the same as 0.0581 or 5.81%.

Note that this is NOT a conditional rate, as we are looking at all 1050 observations as
our total.

This calculation is also known as a joint rate.

19
Which treatment is better?
Outcome Success Failure Row Total Treatment X has 542
Treatment successful cases.
X 542 158 700
Treatment Y has 289
successful cases.
Y 289 61 350
“We should recommend
Treatment X!”?

Column Total 831 219 1050 More patients choosing


Treatment X as compared
to Y.

Now, you must be wondering, why did we go through the trouble of understanding
how to calculate rates? Can’t we just look at the absolute numbers and compare?

Fine! Let’s look at absolute numbers then. For Treatment X, there are 542 successful
treatments. For Treatment Y, there are 289 successful treatments. This must mean
that Treatment X is obviously better than Treatment Y!

Before running back to our patient and giving a recommendation for Treatment X,
let’s sit back and observe the table again. Notice how there are a lot more patients
going for Treatment X than Treatment Y? Could Treatment X have more successful
cases to tell simply because more patients have chosen it? Or could it be that
Treatment X is actually better than Treatment Y?

20
Making it fair!

Given that I pick


Compare success
some treatment, Treatment Y is
rate of Treatments Fair comparison
what is the rate of better!
X and Y
success?

• rate Success X) = • For Treatment X,


542
= 0.774 = 77.4% roughly 77 out of 100
700
patients had a
• rate Success Y) = successful treatment.
289
= 0.826 = 82.6% • For Treatment Y,
350
roughly 83 out of 100
patients had a
successful treatment.

You would immediately notice that things are not fair. To make a fair comparison
between Treatment X and Treatment Y, we need some form of normalization. Finding
the rate of success is a form of normalisation.

Now, instead of looking at how many successful cases there are in Treatment X and Y,
we can look at the rate of success in Treatment X and Y.
In other words, what we are interested in are the following:
1. What is the rate of success given that we are looking at Treatment X?
2. What is the rate of success given that we are looking at Treatment Y?

As a recap, these rates are the conditional rates we have introduced previously.

To find the rate of success given treatment X, we first have to focus on the patients
who chose Treatment X. It turns out that out of the 700 patients who chose
Treatment X, 542 of them had a successful treatment. Hence, the answer we are
looking for is 542 / 700, which is 0.774, or 77.4%.

Similarly, to find the rate of success given treatment Y, we have to focus on the
patients who chose Treatment Y. It turns out that out of the 350 patients who chose
Treatment Y, 289 of them had a successful treatment. Hence, the answer we are

21
looking for is 289 / 350, which is 0.826 or 82.6%.

To help you understand better, we can look at the rates from another angle. The
rates are telling us the following:
For Treatment X, 77 out of 100 patients had a successful treatment. For Treatment Y,
83 out of 100 patients had a successful treatment.

Refer back to the table from the previous slide to make sure you are familiar with
these calculations.

Since we identified that the rate of success for Treatment Y is higher than the rate of
success for Treatment X, we can now say that Treatment Y is better than Treatment
X. Notice how the conclusion will be different if we used absolute numbers instead?

21
Table with row percentages

Outcome Success (row %) Failure (row %) Row Total (row %)


Treatment

X
542 (77.4%) 158 (22.6%) 700 (100%)

Y
289 (82.6%) 61 (17.4%) 350 (100%)

Column Total
831 (79.1%) 219 (20.9%) 1050 (100%)

Since we are looking at rates instead of absolute numbers, we can add the
percentages to our existing table. Note that we have calculated the rates /
percentages across the rows, hence, the percentages are also known as row
percentages.

22
Analysing 2 categorical variables - plot
Dodged bar plot for “Outcome” by Stacked bar plot for “Outcome” by
“Treatment” “Treatment”
600 800
542
700
500
600 158
400
500
Counts

Counts
289
300 400
Failure Failure
300 61
200 158 Success 542 Success
200
100 61 289
100

0 0
X Y X Y
Treatment Treatment

Lastly, we will be looking at some plots to end off the unit.

We can look at dodged and stacked bar plot for the counts of “Outcome”, broken
down by Treatment. These numbers are the ones you have observed from your 2 x 2
table you have looked at previously.
Again, we can see that both bar plots tell us that there are a lot more successful
treatments in Treatment X than in Treatment Y. However, it is also easy to see that
these 2 treatments have very different number of patients to begin with.

23
Analysing 2 categorical variables - plot
100% Stacked Bar plot for “Outcome” by “Treatment”
100%
90% 22.6% 17.4%
80%
70%
Percentage

60%
50%
Failure
40% 77.4% 82.6%
Success
30%
20%
10%
0%
X Y
Treatment

Instead of plotting the absolute numbers, we can plot the rate of success for each
treatment. From the 100% stacked bar plot, we can immediately see that Treatment
Y has a higher rate of success as compared to Treatment X.

As a recap, we should be using rates to compare. Using both the table and the 100%
stacked bar plot tell us that Treatment Y is a better treatment.

A more precise way of saying this is that Treatment Y is positively associated with the
success of the treatment.

We will discuss more on association in the next unit.

24
Summary
We have learnt how to analyse 2 categorical variables from the perspective of:
• Tables – 2x2 table
• Plots – Bar plots / 100% stacked bar plots

In summary, we have learned how to analyse 2 categorical variables.

By looking at some tables and plots, it makes it easier for us to see the differences
between the categories.
These have led us to conclude that there is a relationship between the outcome and
the treatment type.

Looking back, we have utilised these tools to help us answer our questions and even
ask new ones, and this is in line with the PPDAC cycle.

25
Unit 2A: Association I

By the end of this unit, you


should be able to do the
following:
1.Understand and apply association
2.Understand and apply symmetry
rule

Let us look at Unit 2A.

By the end of the unit, you should be able to

1. Understand and apply association.


2. Understand and apply symmetry rule

26
Used rates to conclude that Treatment Y Caution: Association, not
is better than Treatment X. causation!
Relationship between “Treatment Y is “Treatment X is
Associative Not sure if success of
the type of treatment positively associated negatively associated
relationship between treatment is due to the
and the outcome of with the success of the with the success of the
the 2 variables treatment or not
the treatment treatment.” treatment.”

Tend to see treatment Tend to see Treatment


Y and successful X and unsuccessful
treatments go hand in treatments go hand in
hand. hand.

Continuation from Unit 1

Suppose we guess that the type of treatment involved does not affect the outcome
of the treatment. In this case, we can say that the type of treatment is not related to
the outcome of the treatment.

However, this was not the case as we have concluded in Unit 1. We mentioned that
we see a higher success rate with Treatment Y than with Treatment X. Because of
this difference, we say that there is a relationship between the type of treatment and
the outcome of the treatment.

More specifically, we say that Treatment Y is positively associated with the success of
the treatment. This means that we tend to see that Treatment Y and successful
treatments go hand in hand.
On the other hand, we say that Treatment X is negatively associated with the success
of the treatment. This is because we tend to see that Treatment X and failed
treatments go hand in hand.

Note that we are using the word “associated” over here. We do not actually know if
the outcome of the treatment was entirely due to the treatment. This is because the
data was derived from an observational study. Hence, it might be erroneous for us to
say that the type of treatment and the outcome of the treatment have a causal

27
relationship. Moving forward, it is important to note that we will be focusing on
associative relationship rather than a causal one.

27
Is there an association?
Suppose we have A and B as characteristics in a population. We shall assume that some
people have A, and some do not have A (labelled as NA). We assume the same about B.
Association absent Association present

rate(A | B) = rate(A | NB) rate A B) ≠ rate A NB)

Rate of A is not affected by the presence


or absence of B.
rate(A | B) > rate(A | NB) rate(A | B) < rate(A | NB)

Presence of A when B is present Presence of A when B is present


A and B are not associated. is stronger than when B is is weaker than when B is
absent. absent.

Positive association between A Negative association between


and B. A and B.

Now, let’s find out how to identify an association.

Suppose we have 2 characteristics in a population, A and B. We shall assume that


there are some people with A and some people without A (which is denoted by NA).
In other words, the population can only be split into 2 groups with reference to A.
The same is applied to B.

If the rate of A given B is the same as the rate of A given NB, then it means that the
rate of A is not affected by the presence or absence of B. Hence, there is no
association between A and B.

However, if the rate of A given B is not the same as the rate of A given NB, then there
are 2 situations we can break this down into.

1. The rate of A given B is more than the rate of A given NB. This means that the
presence of A when B is present, is stronger than when B is absent. Hence, we
say that there is a positive association between A and B.
2. The rate of A given B is less than the rate of A given NB. This means that the
presence of A when B is present, is weaker than when B is absent. Hence, we say
that there is a negative association between A and B.

28
Linking back to our data set
Checking for association Compare
between 2 variables
• Outcome of treatment • rate(A | B) = rate(Success | X) = 0.774
• A: Success • rate(A | NB) = rate(Success | Y) = 0.826
• NA: Failure
• Treatment
• B: Treatment X
• NB: Treatment Y

Conclusion
• rate(A | B) < rate(A | NB)
• Presence of A is weaker when B is present.
• Less successful treatments when we see Treatment X: Treatment X
is negatively associated to a successful treatment.
• More successful treatments when we see Treatment Y: Treatment Y
is positively associated to a successful treatment.

Now, that was quite a bit of information to take in, so let’s link back to our dataset on
hand so that you can digest the information.

Recall that the 2 variables we were interested in were the outcome of the treatment
and the treatment type. With reference to the outcome of the treatment, you can
split the patients into 2 groups. Group A, which is the group of patients with
successful treatments, and Group NA, which is the group of patients with
unsuccessful treatments. Similarly, we can split the variable “Treatment” into 2
groups. Group B, which is the group of patients who chose Treatment X, and Group
NB, which is the group of patients who chose Treatment Y.

Next, let’s compute and compare the rate of A given B and the rate of A given NB.
This means that we want to calculate the rate of successful treatments given
Treatment X, and the rate of successful treatments given Treatment Y. From previous
calculations, we know that the rates are 0.774 and 0.826 respectively.

Notice that the rate of A given B is smaller than the rate of A given NB. Hence, the
presence of A when B is present, is weaker than when B is absent. This means that
there are less successful treatments when looking at Treatment X, as compared to
Treatment Y. Hence, we say that Treatment X is negatively associated to a successful

29
treatment.

In the same vein, since there are more successful treatments when looking at
Treatment Y, as compared to Treatment X, we say that Treatment Y is positively
associated to a successful treatment.

29
2 rules that govern rates

Suppose we have A and B as characteristics in a population.


We shall assume that some people have A, and some do not
have A (labelled as NA). We assume the same about B.

Basic rule on rates (to be


Symmetry rule discussed in Unit 2B:
Association II)

Now, we shall look at 2 rules regarding rates. Again, we have 2 population


characteristics A and B. It is assumed that some people have A, and some people do
not have A (labelled as NA). Similarly, some people have B, and some people do not
have B (labelled as NB).

The first rule we will be looking at is known as the symmetry rule.


The second rule is known as the basic rule on rates. This will be discussed in Unit 2B.

We will look at symmetry rule in the coming slides.

30
Symmetry Rule

rate(A | B) > rate(A NB ⇔ rate(B | A) > rate(B | NA).

rate(A | B) < rate(A NB ⇔ rate(B | A) < rate(B | NA).

rate(A | B) = rate(A NB ⇔ rate(B | A) = rate(B | NA).

The symmetry rule has 3 parts to it.

Firstly, it states that the rate of A given B is more than the rate of A given NB if and
only if the rate of B given A is more than the rate of B given NA.

Secondly, the rate of A given B is less than the rate of A given NB if and only if the
rate of B given A is less than the rate of B given NA.

Lastly, the rate of A given B is equal to the rate of A given NB if and only if the rate of
B given A is equal to the rate of B given NA.

We represent the if and only if by using this bidirectional arrow. It means the two
relationships occur together. So, showing that one side is true implies that the other
side is also true. Let us go through this with more details in the coming slides.

31
rate(A | B) > rate(A NB ⇔ rate(B | A) > rate(B | NA)

1
2

rate(A | B) > rate(A NB → rate(B | A) > rate(B | NA)

rate(B | A) > rate(B NA → rate(A | B) > rate(A | NB)

Let us use the first statement as an example, namely, the rate of A given B is more
than the rate of A given NB if and only if the rate of B given A is more than the rate of
B given NA.

With the use of the phrase “if and only if” or the bidirectional arrow, we can split the
statements into 2 equations.

The 1st equation reads: If the rate of A given B is more than the rate of A given NB,
then it implies that the rate of B given A is more than the rate of B given NA.
The 2nd equation reads: If the rate of B given A is more than the rate of B given NA,
then it implies that the rate of A given B is more than the rate of A given NB.

Note that equation 1 is just reading the original statement forward, while equation 2
is just reading the original statement backward.

We will take a look at equation 1 first.

32
rate(A | B) > rate(A NB → rate(B | A) > rate(B | NA)

More likely to see A


Rate of A given B is
Positive association when B is present as
more than rate of A
between A and B. compared to when
given NB.
B is absent.

Also more likely to


see B when A is Rate of B given A is
present as more than rate of B
compared to when given NA.
A is absent.

What the 1st equation is saying is this: Suppose we know that the rate of A given B is
more than the rate of A given NB, then we can also say that the rate of B given A is
more than the rate of B given NA.

But why can we say that? Let us go through an intuitive explanation.

If we know that the rate of A given B is more than the rate of A given NB, then we
are saying that there is a positive association between A and B. Recall that this
means that we are more likely to see A when B is present as compared to when B is
absent.

In other words, this also means that we are more likely to see B when A is present,
as compared to when A is absent. Hence, the rate of B given A should be more than
the rate of B given NA. Again, this is nothing but saying that A and B are positively
associated.

Next, we are going to look at equation 2.

33
rate(B | A) > rate(B NA → rate(A | B) > rate(A | NB)

More likely to see B


Rate of B given A is
Positive association when A is present
more than rate of B
between B and A. as compared to
given NA.
when A is absent.

Also more likely to


see A when B is Rate of A given B is
present as more than rate of A
compared to when given NB.
B is absent.

What the 2nd equation is saying is this: Suppose we know that the rate of B given A is
more than the rate of B given NA, then we can also say that the rate of A given B is
more than the rate of A given NB.

Again, the explanation to why we can say this is similar.

If we know that the rate of B given A is more than the rate of B given NA, then we
are saying that there is a positive association between B and A. Recall that this
means that we are more likely to see B when A is present as compared to when A is
absent.

In other words, this also means that we are more likely to see A when B is present,
as compared to when B is absent. Hence, the rate of A given B should be more than
the rate of A given NB. Again, this is nothing but saying that A and B are positively
associated.

34
rate(A | B) > rate(A NB → rate(B | A) > rate(B | NA)

1
rate(B | A) > rate(B NA → rate(A | B) > rate(A | NB)

rate(A | B) > rate(A NB ⇔ rate(B | A) > rate(B | NA)

Since the statement can be read from left to right (Equation 1) and from right to left
(Equation 2), we really only need the original statement and the bidirectional arrow.

To summarise, what we are saying is this:


1. If the rate of A given B is more than the rate of A given NB, then we know that
the rate of B given A is more than the rate of B given NA.
2. If the rate of B given A is more than the rate of B given NA, then we know that
the rate of A given B is more than the rate of A given NB.

This is the symmetry rule. Do go through the statement for “less than” and the
statement for “equals to”, and make sure that they make sense to you.

35
Consequence of the symmetry rule
To identify if there is any association, check for either:
1. rate(A | B) ≠ rate(A | NB) OR
2. rate(B | A) ≠ rate(B | NA)
rate(Success | X) < rate(Success | Y):
Negative association between successful treatments and Treatment X

Check:
rate(X | Success) < rate(X | Failure)

Next, let’s dive deeper into the consequence of the symmetry rule.

Recall that to identify association, we mentioned that we want to check if the rate of
A given B is different from the rate of A given NB. If they are different, then there is
an association.

However, since we have the symmetry rule, we can also check if the rate of B given A
is different from the rate of B given NA.

Hence, to check for association, we can check that the rate of A given B is not the
same as the rate of A given NB, OR that the rate of B given A is not the same as the
rate of B given NA.

From our example, we calculated that the rate of successful treatments given
treatment X is less than the rate of successful treatments given Treatment Y. Hence,
we know that there is a negative association between successful treatments and
Treatment X.

Now, it is your turn to check if the rate of treatment X given successful treatments is
indeed less than the rate of treatment X given unsuccessful treatments. You should

36
also end up with the same conclusion that successful treatments and Treatment X
are negatively associated.

36
Summary
We have learned:
• How to identify association
• Symmetry rule and its consequence on identifying association

To summarise, we have first talked about associations. There are three types:
positive, negative, and no association. We can identify each scenario, by comparing a
pair of rates.

Next, we have discussed about the symmetry rule. It gives us flexibility when
identifying association. It allows us to identify association by comparing any
equivalent set of rates.

Identifying association is an important step in understanding how the variables in our


dataset are related to one another. This is an important tool when we analyse any
data set in the future.

37
Unit 2B: Association II

By the end of this unit,


you should be able to
do the following:
1.Understand and apply basic
rule on rates.

Let us look at Unit 2B.

By the end of the unit, you should be able to

1. Understand and apply the basic rule on rates

38
BASIC RULE ON RATES
The overall rate(A) will always lie between
rate(A | B) and rate(A | NB).

The second rule we are going to look at is the basic rule on rates.

This rule dictates that the overall rate(A) will always lie between the 2 conditional
rates, rate of A given B and rate of A given NB.

Do refer back to the start of Unit 2A if you have forgotten what the notations stand
for.

39
Consequences of the basic rule on rates
1. The closer rate(B) is to 100%, the closer rate(A) is to
rate(A | B).
2. If rate(B) = 50%, then
rate A B)+rate A NB)
rate A = 2
.
3. If rate(A | B) = rate(A | NB), then
rate(A) = rate(A | B) = rate(A | NB).

Let’s move on to the consequences of the basic rule on rates.

The first consequence is that the closer the rate of B is to 100%, the closer the rate
of A is to the rate of A given B.

The second consequence is that if the rate of B is 50%, then the rate of A is exactly
halfway between the rate of A given B and the rate of A given NB.

Lastly, if the rate of A given B is the same as the rate of A given NB, then the rate of A
is also the same as these 2 rates.

Obviously, the rule and the consequences are not immediately comprehensible, so
let’s look at some analogies in the coming slides.

40
1. The closer rate(B) is to 100%,
the closer rate(A) is to rate(A | B).
• 2 Cups of bubble tea
• Let A be the level of sweetness
• Represented by the colour “Green” in the cup.

• Let B / NB be the cups: “Cup 1” vs. “Cup 2”

Cup 1
Size: Large cup
Sweetness: 90% Size: Cup 1 + Cup 2
Sweetness: In between 20%
Cup 2
and 90%, but closer to Cup 1
Size: Small cup
Sweetness: 20%

We will first look at the rule and the 1st consequence.

Suppose we have 2 cups of bubble tea and there are 2 variables we are interested in.
One being the level of sweetness, which we can represent as A, and the other being
the cup we are looking at. B represents “Cup 1” and NB represents “Cup 2”.

Let’s say Cup 1 is a large cup with 90% sweetness, and Cup 2 is a small cup with 20%
sweetness. Do take note that these cups are filled to the brim and the colour green
represents the sugar level or the level of sweetness of the cup. It is not an indication
of how full the cup is.

Next, we are going to mix these 2 cups together, and we know that the size of the
final cup on the right is basically the size of Cup1 plus Cup 2. What about the overall
sweetness then?

Intuitively, you should be guessing that the overall sweetness has to be between 20%
and 90%, and that the overall sweetness is going to be closer to the sweetness of
Cup 1, since it takes up a majority of the final cup.

41
1. The closer rate(B) is to 100%,
the closer rate(A) is to rate(A | B).
Sweetness in the final cup is between Sweetness | Cup 1 and
Sweetness | Cup 2

Expect sweetness of the final cup to be nearer


Cup 1 takes up most of the final to the sweetness of the Cup 1.
cup.

Overall rate(A) to be between Overall rate(A) to be closer to rate(A | B) if B


takes up a majority of the overall.
rate(A | B) and rate(A | NB)

To reiterate, the overall sweetness in the final cup is going to be between the
sweetness from Cup 1 and the sweetness from Cup 2.

Since Cup 1 is larger and takes up most of the final cup, the overall sweetness of the
final cup should be closer to the sweetness of Cup 1.

To generalize, all we are saying is that the overall rate of A should always be in
between the conditional rates: rate of A given B and rate of A given NB.
Additionally, we also learn that the overall rate of A will be closer to rate of A given B,
if B is indeed the majority.

ADD TIME

42
2. If rate(B) = 50% , then
rate A B) + rate A NB)
rate A =
2

Cup 1
Size: Small cup
Sweetness: 20% Size: Cup 1 + Cup 2
Sweetness: Exactly in between
20%+90%
20% to 90% = = 55%
Cup 2 2
Size: Small cup
Sweetness: 90%

Next, we will talk about the 2nd consequence.

Similarly, we will be mixing 2 cups, but this time round, we will be mixing 2 small
cups. Since the size of the cups are the same, we would expect that the overall
sweetness in the final cup should be exactly between the sweetness of the 2 small
cups. This happens because Cup 1, which is represented by B, takes up 50% of the
final cup. Likewise, Cup 2, which is represented by NB, takes up 50% of the final cup.

Note that you can replace the 2 small cups with 2 large cups, and the result will still
hold. What is important here is that the sizes of the cups are the same. In other
words, the 2 cups must take up an equal proportion of the final cup for this
consequence to take effect.

43
3. If rate(A | B) = rate(A | NB), then
rate(A) = rate(A | B) = rate(A | NB).

Cup 1
Size: Small / Large cup
Sweetness: 20%
Size: 2 Cups added together
Sweetness: Exactly 20%
Cup 2
Size: Small / Large cup
Sweetness: 20%

Lastly, let’s look at the 3rd consequence.

We have the last scenario, where the sizes of the 2 cups can be different, but the
sweetness in these 2 cups are the same. Then, adding them together will not change
the overall level of sweetness at all.

For example, if I have a small cup with 20% level of sweetness, and I mix it with a
large cup with 20% level of sweetness, then the overall sweetness in the final cup
will remain at 20%.

44
If cups are of the same size, sweetness will be
exactly half of the sweetness of the original cups.
Linking back • If Rate(B) = 50%, overall rate of A will be
to exactly in between the rate of A given B and
the rate of A given NB.
Consequences If sweetness is the same for both cups, the
2 and 3 sweetness of the final cup will also be the same,
regardless of the sizes of the original cups.
• If rate(A | B) = rate(A | NB), then rate(A) is the
same as the 2 rates.

You should now find consequences (2) and (3) understandable.

Let’s go through the 2nd consequence again. If the cups are of the same size, then the
level of sweetness will be exactly half of the combined levels of sweetness when
mixed.
Generalizing this, we now know that if the rate of B is 50%, then the overall rate of A
will be exactly in between the rate of A given B and the rate of A given NB.

For the last consequence, note that the level of sweetness must be the same for
both cups. It does not matter what the sizes of the original cups are.
Hence, even if we mix 2 cups of different sizes together, the level of sweetness
should not change.
Generalizing this, we have the following: if the rate of A given B is the same as the
rate of A given NB, then the overall rate of A is the same as the 2 conditional rates.

45
Linking back to data set at hand
• rate(Success) = 0.79
Overall rate of successful treatments

• rate(Success | X) = 0.774
Groups: Treatment X and Treatment Y • rate(Success | Y) = 0.826
• rate(Success) in between the conditional rates

• Treatment X takes up a majority of the treatments.


700 2
Overall rate of success closer to • rate(X) = = 0.667 = 66 %
1050 3
rate(Success | X)
• Follows statement (1)

Before closing off the unit, let’s look back to our kidney stones data set. Recall that
we are interested in the overall rate of successful treatments and it worked out to be
0.79.

We then split the population into 2 groups, namely, patients who went for Treatment
X, and patients who went for Treatment Y. After some calculation, we know that the
rate of successful treatments in the respective groups are 0.774 and 0.826. Notice
that the overall rate of successful treatments is indeed in between the 2 conditional
rates.

Additionally, notice that the overall rate of successful treatments is nearer to the rate
of successful treatments given that the patients have chosen Treatment X. This is
because Treatment X takes up a majority of the treatments. To be precise, it took up
66 2/3% of all the treatments.

Hence, the above shows that the basic rule on rates is indeed working.

46
Summary
We have learned:
• What is the basic rule on rates
• The consequences of basic rule on rates

In this unit, we have learnt about basic rule on rates, an interesting mathematical
relationship to do with rates.

This is useful because it allows us to say something about the conditional rates when
we have information on the overall rate, and vice versa.

Often, we might not have all the information we want. Hence, making best use of
the limited information we have is an important skill when analysing the data.

47
Unit 3A:
Simpson’s Paradox I
By the end of this unit you should be able to
do the following:
1. Identify Simpson’s paradox
2. Analysis using the slicing method

This unit is about an interesting phenomenon known as the Yule-Simpsons paradox,


or simply called Simpson’s paradox. We will use our example to demonstrate this
phenomenon.

By the end of this unit, you should be able to the following:


1. Identify a Simpson’s paradox
2. Analyse it using the slicing method

48
PPDAC CYCLE – A RECAP
Are the treatments helping?

Yes. In general, there is a high


rate of success.

More specifically, which


treatment is better?
ANYTHING
ELSE?

Treatment Y is positively
associated to success rate

Let’s recap on what we have achieved so far. Remember that we are role playing as
doctors in the urology department, and our main research question is that we want
to know which treatment, X or Y, we should be giving to the patients.

So far, we have made use of the PPDAC cycle to frame our approach. We have gone
through several rounds of this cycle, starting with a question which leads to a
conclusion, which leads us to ask a further question. We started off by looking at the
data set and found out that as a whole, the treatments are successful. After that, we
also found out that treatment Y is positively associated to success rate. Which
suggests that treatment Y is better than treatment X.

With that conclusion, are we done with our analysis? We have come a long way in
answering our research question and it sounds as if we have the answer we want.
Should we start sending all our patients for treatment Y? Before we do that however,
maybe it’s a good idea to take a look once more at the data just to see if there’s
anything we missed, that might be useful.

49
Size Gender Treatment Outcome

Large Male X Success

Large Male X Success

Small Male Y Success


WHAT ABOUT
Large Male Y Failure
OTHER
VARIABLES?
Small Male X Success

Large Male Y Success

Looking at the dataset, we notice there are a few more variables such as gender and
stone size which we have been ignoring. We were ignoring them because we were
interested in only which treatment to use. But could these other variables be useful
to us? Maybe we can try investigating to see if they can provide us with any other
useful information.

For this example, let’s try exploring just the stone size variable to see if it can give us
any insights.

50
Exploring the “stone size” variable
What would be a useful visualisation?

In general, when exploring a variable, we should think about doing some simple
plots or summary statistics. This is part of EDA. This will help us see any immediately
noticeable patterns, giving us insights into the data and expose interesting trends we
might have missed. So it’s a good idea to do this.

Now, selecting the right graph to use can be quite tricky at times. There may be a lot
a choices, and we need to think about which one best suits our needs. In this
example, since we are dealing with a binary categorical variable and we interested in
associations, it’s a good idea to do a stacked bar chart.

51
Analysing 2 categorical variables - Plot
100% Stacked Bar plot for “Outcome” by
“Treatment”
Overall,
100%
17.4%
Y Treatment Y is better
90% 22.6%
80%
70%
Percentage

60%
50% All stones Success Failure Total
Failure
40% 77.4% 82.6% All stones Yes No Grand Total
Success
30%
X X 542 542 158 158 700
700
20% Y 289 61 350
10% Y Grand289
Total 831 61 219 350
1050
0%
X Y
Total 831 219 1050
Treatment

Firstly, let’s try plotting the overall stacked bar charts for treatment X and treatment
Y across all stone types. We will actually obtain the same chart as the one shown in a
previous unit. This is accompanied by the table on the right, which shows the
number of successes and failures for each treatment type. Looking at the chart, we
can see that there is a higher rate of success for treatment Y as compared to
treatment X. Thus, Treatment Y is POSITIVELY associated to success. This makes
sense. It is what we have concluded earlier.

Notice that we did not take the “stone size” variable into account when plotting
these charts. In other words, we didn’t care if they were large or small stones, we
simply counted the number of successes and number of failures across all stone
types. Thus, this chart actually shows the overall success rates of X and Y.

What if we tried to separate the data according to small and large stones? Will we
still see a similar pattern?

52
Plot across large stones only
100% Stacked Bar plot for “Outcome” by
“Treatment” for large stones
Across large stones,
100% X Treatment X is better
90%
27.6% 31.3%
80%
70%
Large
Percentage

60%
Success Failure Total
50% stones
Failure
40% Large stones Yes No Grand Total
72.4% Success 381 145 526
68.8% X
30%
X 381 145 526
20%
Y Y 55 55 25
25 80 80
10%
Grand Total 436 170 606
0%
X Y Total 436 170 606
Treatment

rate(Success | X) > rate(Success | Y)

Let’s try filtering the data for the large stones only. This means we are only counting
the successes and failures of the large stone cases. If we plot the same graph as
earlier, this is the chart we will obtain. Note that the numbers in the chart may not
add to 100% due to rounding off.

Take a close look at the graphs. Is this result surprising? Which treatment is now
positively associated to success?

We notice that the green part is now higher for treatment X than Y. In other words,
there’s a higher proportion of successes in X than Y, meaning that the
rate(success | X) > rate(success | Y).

Therefore, Treatment X is positively associated to success (when looking at the data


for large stones only). This suggests that Treatment X is better than Treatment Y in
treating large kidney stones. This is opposite of the overall association that we just
saw in the previous slide, which is surprising.

By the way, as a sidenote, we can get the values shown in the chart by directly
calculating the rates off the table. Let’s try this exercise.

53
Exercise
Across large stones,
X Treatment X is better

Large
Success Failure Total
stones

X 381 145 526


381
rate Success | X = = 0.724
526 55 25 80
55 Y
rate Success | Y = = 0.688
80 436 170 606
Total

rate(Success | X) > rate(Success | Y)


Treatment X is positively associated to success

To find the rate of success given treatment X, we take the number of treatment X
that was a success, which is 381, divided by the total number of all the treatment
done by X, which is 526. This gives us 0.724, or 72.4%. The rate of success given
treatment Y is calculated similarly.

We then compare the rates and observe that rate(success | X) > rate(success | Y),
which shows that Treatment X is positively associated to success.

If you need further practice, you can try calculating the rates for the later examples,
and check if you’ve managed to get the same value as shown in the corresponding
bar chart.

54
Plot across small stones only
100% Stacked Bar plot for “Outcome” by
“Treatment” for small stones
100% Across small stones,
90%
7.5% 13.3% X Treatment X is better
80%
70%
Percentage

60%
Small
50%
92.5%
Success Failure Total
40% 86.7% Failure stones
Success Small stones Yes No Grand
30%
X 161 13 Total 174
20% X 161 13 174
10%
Y Y 234 234 3636 270 270
0% Grand Total 395 49 444
X Y
Total 395 49 444
Treatment

Coming back to our graphs, let’s now try plotting across small stones only. Looking at
the charts, again we observe that
Treatment X is positively associated with success. Because there’s a higher success
rate for treatment X, which is 92.5% as compared to treatment Y, which is 86.7%.

So filtering for small stones only again leads to an opposite result from that of the
overall association.

55
Analysing 3 categorical variables - plot
100% Stacked Bar plot for "Outcome" by "Treatment"
100%
7.5%
90%
13.3%
27.6% 31.3%
80%

70%

60%
Percentage

50%
Failure
92.5%
40%
86.7% Success

72.4% 68.8%
30%

20%

10%

0%
X Y X Y
Large Small

For convenience, we can combine the previous two charts side by side, showing that
X is associated with success when separated into only small stones and only large
stones.
This type of chart is sometimes referred to as a sliced bar graph. It is a way of
comparing across 3 categorical variables. In this case, the three variables are stone
size, success, and treatment type.

56
A paradox on our hands
Overall,
Y Treatment Y is better
Is X or Y
better?

Across large stones,


X Treatment X is better

Across small stones,


X Treatment X is better

In summary, we can see that although Treatment Y is better overall, if we focus only
on the success rates amongst large stones, Treatment X is better than Treatment Y.
Similarly, when we focus on the small stones, Treatment X is better than Treatment Y.
What a strange phenomenon!

This phenomenon is known as Simpson’s Paradox. It is a phenomenon in which a


trend appears in several groups of data but disappears or reverses when the groups
are combined. In a practical sense, we return back to our research question that we
really want to know the answer to: Which treatment is better? Which one should we
be using to treat the patients?

Pause here for a moment and think about which treatment you would use… we will
continue our discussion in the next unit.

57
Unit 3B:
Simpson’s Paradox II
By the end of this unit you should be able to
do the following:
1. Explain a Simpson’s paradox

In this unit we will continue our discussion of the Simpson’s paradox.

58
A paradox on our hands
Overall,
Y Treatment Y is better
Is X or Y
better?

Across large stones,


X Treatment X is better

Across small stones,


X Treatment X is better

Previously, we found that Treatment Y was positively associated with success overall,
but individually across large and small stones, Treatment X was positively associated
with success. This was an example of a Simpson’s paradox, a phenomenon whereby
the direction of association gets reversed when the groups are combined.

As a sidenote, in this example there are only two subgroups – which is small stones
and large stones. In examples where there are more than two subgroups, we will call
it a Simpson’s paradox as long as majority of the individual subgroup rates are
opposite from the overall association. For example, if there happen to be three
subgroups, as long as 2 out of 3 of them are opposite from overall, we will call that
scenario a Simpson’s paradox.

Coming back to our example, we are left wondering about which treatment was
actually better for the patients. Surprisingly, there is a reasonable answer to this
question.

59
A paradox explained

Across large stones,


X Treatment X is better

Across small stones,


X Treatment X is better

Here’s one way to think about it.


Suppose that a patient turns up at your clinic, and it turns out that he has a large
kidney stone. What treatment would you give? According to what we found from the
data, you should be giving him treatment X, since X is positively associated with
success.
Now, suppose another patient turns up with a small kidney stone. What treatment
would you give? Again, you should be giving him treatment X for the same reason as
above.

So the answer is that you should be giving treatment X in every case (and the
paradox is explained).
How then, do we explain why Treatment Y seems better overall?

60
Analysing 3 categorical variables - Table
Large stones Small stones Total (Large + Small)

Successful Total number rate(Success) Successful Total rate(Success) Successful Total rate(Success)
treatments of treatments in % treatments number of in % treatments number of in %
treatments treatments

X 381 526 72.4% 161 174 92.5% 542 700 77.4%

Y 55 80 68.8% 234 270 86.7% 289 350 82.6%

To help us explain why the overall success rate of Y is higher, let’s delve a little
deeper into the table of numbers. Here, we have rearranged the table to include the
calculated rates. This table shows us the overall number of successful treatments,
the number of total treatments, and the success rates from the graphs earlier. This is
without taking stone size into account.

If we account for stone size by further dividing the numbers into their subgroups, this
is the table we obtain. It is divided into the large and small kidney stones, shaded in
red and blue respectively. This method of subgroup analysis is called “slicing”.

Let’s look at the rates highlighted in yellow. We can see the Simpson’s paradox
occurring, whereby the rate of success amongst small and large stones is better
individually for treatment X, but the overall rate is better for Y.

61
Analysing 3 categorical variables - Table
Large stones Small stones Total (Large + Small)

Successful Total rate(Success) Successful Total rate(Success) Successful Total number rate(Success)
treatments number of in % treatments number of in % treatments of treatments in %
treatments treatments

X 381 526 72.4% 161 174 92.5% 542 700 77.4%

Y 55 80 68.8% 234 270 86.7% 289 350 82.6%

Now, let’s look at the numbers highlighted in yellow. We see that treatment X has
been used to treat mostly patients with large stones (526 large vs 174 small). Thus,
from the Basic Rule of Rates, we know that the overall success rate of X will lie a lot
closer to 72.4% than 92.5%. As it turns out, the overall success rate of X is 77.4%.

Now, let’s look at the numbers highlighted in green. Treatment Y, on the other hand,
has been used to treat mainly patients with small stones (80 Large vs 270 small) and
that means the overall success rate of Y will lie a lot closer to its success rate across
small stones, which is 86.7%. As it turns out, the overall success rate of Y is very
close at 82.6%.

Due to overall success rate of X being so close to


72.4% and overall success rate of Y being so close
to 86.7%, we have overall success rate of X lower
than overall success rate of Y.

62
Analysing 3 categorical variables - Table
Large stones Small stones Total (Large + Small)

Successful Total rate(Success) Successful Total rate(Success) Successful Total number rate(Success)
treatments number of in % treatments number of in % treatments of treatments in %
treatments treatments

X 381 526 72.4% 161 174 92.5% 542 700 77.4%

Y 55 80 68.8% 234 270 86.7% 289 350 82.6%

Finally, we notice another interesting thing when looking at the success rates for
each stone type. The success rates for large stones ranges from 68.8% to 72.4%.
Whereas the success rates for small stones ranges from 86.7% to 92.5%. We see
that in general, the large stones have a lower rate of success than the small stones.
In other words, this means the large stones are more difficult to cure.

We now have all the pieces of information to make sense of the paradox. We can
summarise it in the following way: Treatment X is in fact better than treatment Y, but
because people have been using treatment X to treat the more difficult cases, this
lowers the overall success rate of treatment X.

63
ANALOGY

It’s a bit like trying to compare the overall rates of success between a doctor who
treats mostly very simple cuts and abrasions, vs another doctor who treats mostly
difficult heart surgeries. Comparing their overall success rates is not a fair
comparison, and we should take the difficulty of the surgeries into account.

In our example, we want to compare the success rates between Treatments X and Y,
but we should take stone size into account, because the large stones are more
difficult to treat than the small stones. We did that by slicing the data into the small
and large stone subgroups, which revealed to us that treatment X was actually better
than treatment Y.

64
rate(success | X) < rate(success | Y)
Negative Association
X

rate(large stone | X) > rate(large stone | Y) rate(success | large stone) <


Positive Association rate(success | small stone)
Large Negative Association
stones

―― View from Association ――

We understand the relationship between the three variables by thinking in terms of


association. The main relationship we were interested in was between the treatment
type and success, in which we found that X was negatively associated with success
rate.

However, this negative association can be explained by the positive association X has
with large stones, which in turn are negatively associated to success. In other words,
Treatment X appeared to be worse than Y because they were treating a lot of
difficult cases. We will work through the calculations in the next unit.

Do note here that we still say that X is overall negatively associated with success
rate, despite knowing that X is actually better for treating patients. As a reminder,
when two variables are negatively associated, it just means that when one increases,
the other tends to decrease. We are NOT saying that one variable caused, or led to,
the other one decreasing. So coming back to this example, we are not saying that
treatment X led to lower rate of success, which we know based off the Simpson’s
paradox, is not true.

65
X

Association Association
Stone
size

Confounding variable

Simpson’s paradox ⇒ confounder


Confounder ⇏ Simpson’s paradox

We can see that stone size was a variable that was associated to the other two
variables whose relationship we are investigating, thus affecting the conclusion of
our study. There is a special name given to such variables, and they are known as
“confounders”. More on confounders will be discussed in the next unit.

For now, we will note that when a Simpson’s paradox occurs, it implies that there is
definitely a confounding variable present. This however, does not mean that a
confounder necessarily leads to a Simpson’s paradox.

66
Summary
We learnt how to analyse 3 categorical variables from the perspective of:
• Tables – slicing by subgroups
• Graphs – sliced bar graph

This unit was all about the Simpson’s paradox. But we can apply what we have learnt
in general to 3 categorical variables. We looked at the problem from different
perspectives.

From a graphical viewpoint, we learnt about the sliced bar graph as a method of
visualising 3 categorical variables.

From a table of numbers perspective, we learnt about how to slice the data by
subgroups to compare rates. This is like an extension of the 2x2 contingency table
from previous units.

With that, we conclude our discussion on Simpson’s paradox.

67
Unit 4
Confounders
By the end of this unit you should be able to
do the following:
1. Define a confounder (ie. confounding
variable)
2. Identify possible confounding variables
in a study

In this unit, we will continue our discussion of confounding variables. By the


end of this unit you should be able to,
1. Define a confounder
2. Identify possible confounding variables in a study

68
Introduction

Have you ever taken one of those surveys before, and felt as if you were wasting
precious time answering lots of background questions on yourself such as your age,
gender, and other personal information? I mean, why can’t surveys just ask the main
question they are interested in and be done with it?

For example, if researchers wanted to know if drinking coffee helps people score
better at math tests, can’t they just collect information on those two variables:
coffee consumption and test score? Why do they still want to know all kinds of
background information, such as your age or your gender, or even your household
income? What is this information even used for?

69
X

Association Association
Stone
size

Confounding variable
Definition:
A confounder is a third variable that is associated to both the independent
and dependent variable whose relationship we are investigating

To give us a sense of how such variables can provide useful information, we continue
from our earlier unit on Simpson’s paradox. Recall that in the previous unit we found
out that stone size was a confounder. This confounding variable that distorted our
conclusion of the relationship between treatment type and success.

We will define a confounder as a third variable that is associated to the other two
variables whose relationship we are investigating. Note that we do not specify the
direction of association here. As long as the variable is associated in some way to the
main variables, we will call it a confounder.

Let’s work through the calculations that we left out in the previous unit. Firstly, we
will show that stone size is associated to treatment type.

70
Stone size associated to treatment type
Large Small Total
X
X 526 174 700

Y 80 270 350 Positive


Association
Total 606 444 1050

526
rate Large | X = = 0.751 Large
700 Stones
80 Since 0.751 > 0.229,
rate Large | Y = = 0.229 Large stones positively associated to treatment X
350

This is the same way of calculating rates and proving association as we have learnt in
the previous units.

First, we draw up an appropriate 2x2 table of the two variables we are interested in.
In this case, treatment type and stone size.
Next, we calculate the rate of the large stones amongst treatment X, which is the
number of large stones treated by X/total number treated by X = 526/700 = 0.751, or
75%.
We then compare this to the rate of large stones amongst treatment Y, which is the
number of large stones treated by Y/total number treated by Y = 80/350 = 0.229, or
23%.
This means that there is a higher proportion of large stones being assigned to
treatment X as compared to treatment Y. Thus,
Large stones are positively associated to treatment X.

This is just one method of proving association. You can use the symmetry rule to
compare an appropriate pair of rates instead. You make also choose to make use of
software to do the calculation, or show association by a graphical comparison.

71
Stone size associated to success
Success Failure Total

Large 436 170 606

Small 395 49 444


Negative
Association
Total 831 219 1050

436 Large
rate Success | Large = = 0.719 Stones
606
Since 0.719 < 0.890,
395
rate Success | Small = = 0.890 Large stones negatively associated to success
444

Next we will show that stone size is associated to success.

From the 2x2 table, we can calculate the rate of success given large stones is 0.719,
which is lower than the rate of success given small stones, which is 0.890. In other
words:
Large stones are negatively associated with success.

72
rate(success | X) < rate(success | Y)
Negative Association
X

rate(success | large stone) <


rate(large stone | X) > rate(large stone | Y) rate(success | small stone)
Positive Association Large
Negative Association
stones

―― View from Association ――

Thus, we end up with this diagram.


We have shown that stone size is associated with treatment type, by comparing
rates. And we have shown that stone size is associated with success.
Therefore, stone size is a confounder, when we are investigating the relationship
between treatment type and success.

73
Recall:
Treatment
Size Outcome
Type

Large X Success

Large X Success After slicing,


X Treatment X is better
Small Y Success

Large Y Failure

Small X Success

Large Y Success

Stone size is a confounder, and we used slicing to control for this confounder, as
covered in the previous unit. This helped us to finally conclude that treatment X was
better, after taking stone size into account.

Luckily for us at that time, the data set did contain information on the stone size for
each patient in that study, so we were able to slice the data by stone size and
compare the success rates within subgroups.

Now, can you imagine if we were not so lucky, and if we didn’t have data on the
stone size? If this entire column of the stone size variable was to be deleted, will we
still be able to use slicing to control for the confounder?

74
Treatment
Size Outcome
Type

Large X Success DO WE STILL


Large X Success
OBSERVE SIMPSON’S
Small Y Success
PARADOX?
Large Y Failure
No,
Small X Success Y Treatment Y is better

Large Y Success

We have to measure a variable in order to check if it is a confounder!

If the stone size column was to be removed, we would not be able to divide the data
into large and small stone subgroups, and therefore slicing would not work. Without
the data on stone size, we are unable to compare the individual subgroup rates the
way we did in the previous unit, and this means we would not have discovered that
stone size was a confounder, or that a Simpson’s paradox was present. Therefore, we
would have concluded that Treatment Y was better.

This leads us to an obvious but important learning point when it comes to designing
our studies:
In order to determine if a variable is a confounder or not, we have to first measure
and collect data on it. If we don’t measure it, we won’t know if it is a confounder or
not. And so, this is one reason why many studies try to collect background
information on their study subjects.

However, if we think about it further, this doesn’t completely guarantee that there
are no confounders left. Let me explain in more detail.

75
THE PROBLEM
We must measure a variable in order
to check if it is a confounder

??? We need to collect data on lots of


variables

This is not feasible


(costly, difficult to analyse)

The problem we are about to discuss arises mainly for non-randomised studies, such
as observational studies. You may refer back to previous chapter if you’ve forgotten
what this is.

Let’s summarise what we have learnt. We learnt that in order to check if a variable is
a confounder, we have to first measure it. This would suggest to us that when we
conduct studies, we must collect data on lots of different variables that we suspect
to be confounders. And in fact, even variables that we do not suspect to be
confounders could still be at risk of being confounders. Now, collecting information
on variables is costly in practice, and even if we do manage to collect all the
information we need, the analysis will be difficult as the data needs to be sliced
multiple times and this would cause the sample to be spread too thin across
subgroups.

Thus we are stuck with the problem that we can never be fully sure that all
confounders have been measured and controlled for. In other words, we will never
be sure if our conclusion was truly correct. Such studies offer a only limited
conclusion as we will never know if there are hidden confounders that are affecting
the result. This is also the reason why observational studies are said to only be able
to provide evidence of ‘Association’ and not “Causation”, which was also mentioned

76
in a previous chapter.

76
THE PROBLEM
We must measure a variable in order
to check if it is a confounder

RANDOMISATION We need to collect data on lots of


variables

This is not feasible


(costly, difficult to analyse)

Rather than frantically measuring and slicing data, is there a better way to deal with
this problem?

Another way to address this problem is to do randomised assignment.


Instead of trying to check for confounders one by one, randomised assignment
attempts to work as a general solution across all confounders. Let us examine how
this process works in more detail.

77
rate(success | X) < rate(success | Y)
Negative Association
X

rate(success | large stone) <


rate(large stone| X) > rate(large stone| Y) rate(success | small stone)
Positive Association Large
Negative Association
stones
No association

The effect of randomly assigning stone size to treatment type

Fundamentally, the confounder occurs due to association, which in turn is caused by


unequal proportions of variables. For our kidney stone example, stone size was a
confounder was partly because the large stones were disproportionately allocated to
treatment X. If we had instead allocated it in the same proportions to treatments X
and Y, then there would no longer be any association between stone size and
treatment type. Thus, stone size would no longer be a confounder. Remember that a
confounder must be associated to both main variables, so removing just one
association is enough.

As we have learnt in chapter 1, random assignment tends to give us equal


proportions across the two groups, thus making it a good solution, especially for
confounders that we did not measure.
However, there are still times where we may be unlucky and end up with a
disproportionate allocation even after random assignment.

78
Randomisation is not always possible

I want
Treatment X!

Randomisation is not always possible in studies. In our kidney stone example, it’s
difficult to force patients to undergo a treatment that they are unwilling to do.
People usually have the right to choose which treatment group they want to be in,
and this makes the assignment process non-random, or in this case an observational
study. In our example, this resulted in patients with large kidney stones going for
treatment X due to their preference.

So in practice, there may be ethical reasons, or other constraints that prevent us


from doing random assignment. And so we have no choice but to rely on the next
best thing: which was the method of slicing for suspected confounders.

79
Summary Proving Association

rate A B) ≠ rate A NB)

OR
Main
variables X
rate B A) ≠ rate B NA)

OR

Confounding Stone
variable size

(prove using association)

Let’s summarise what we have learnt in this chapter. In unit 3, we continued our
investigation on the relationship between the main variables we were interested in:
which are treatment type and success.

We learnt that “stone size” led to an interesting phenomena known as “Simpson’s


paradox”, which caused us to amend our view and to conclude that treatment X was
actually better. A Simpson’s paradox implies that stone size is a confounder, but we
can also directly prove that stone size is a confounder, by proving that it was
associated to both of the main variables. Now there are a few methods of proving
association, as we have learnt in units 1 and 2. We can do it by mathematical
calculation of rates, or by symmetry rule we can compare any equivalent set of rates,
or we can even use the computer to generate graphs. Essentially, all the above listed
methods involve the comparison of rates.

After we have used association to prove the existence of confounders, we finally


discussed how it would impact the design of studies in general. We talked about the
difficulty of controlling for confounders, and we linked these concepts back to topics
discussed in chapter 1. And with that, we hopefully have come to understand how to
design studies better.

80
Chapter 2 end
We learnt how to analyse categorical variables from the perspective of:
• Tables
• Graphs

This concludes the series on categorical variables. We hope you have found these
methods useful in analysing one, two, and three categorical variables. Remember
that at the most basic level, it’s really all about rates.

81

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy