0% found this document useful (0 votes)
20 views25 pages

Chapter12 Measures of Association

Uploaded by

dr.poojasthosar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views25 pages

Chapter12 Measures of Association

Uploaded by

dr.poojasthosar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter 12.

A Primer to Biostatistics for Busy


Clinicians
Michael Glick, D.M.D., and Barbara L. Greenberg, Ph.D., M.Sc.

In This Chapter:
All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under U.S. or applicable copyright law.

Research Design and Clinical Interpretation


• Experimental Trial
• Observational Studies
Measures of Association
• Mean Difference
• Standard Mean Difference
• Absolute Risk
• Relative Risk
• Odds Ratio
• Absolute Risk Reduction and Relative Risk Reduction
• Hazard Ratio
Hypothesis and Significance Testing
Confidence Intervals (CIs)
• How to Interpret a CI
Probability and the Normal Curve
• Standard Deviation and Standard Error
Sample Size Considerations
• Why Is Sample Size Important?

Introduction
Copyright 2020. American Dental Association.

Scientific literacy is about an understanding of appropriate use of statistics and statistical


concepts, as well as recognition of incorrect use.1 In today’s world of rapidly communicated
health information, among and by both health care professionals and the lay public, lack of
understanding of statistical concepts is troubling and has even been equated to scientific
illiteracy.2

Statistics has its jargon, a language that enables data to be translated into useful information and
EBSCO Publishing : eBook Academic Collection (EBSCOhost) - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES
AN: 2328506 ; American Dental Association, Alonso Carrasco-Labra, Romina Brignardello-Petersen,, Michael Glick, Amir Azarpazhooh, Gordon Guyatt.; How to Use
Evidence-Based Dental Practices to Improve Clinical Decision-Making
Account: s5940188.main.ehost 182
knowledge that can be communicated among health care professionals and between health care
professionals and patients. Statistics also provides evidence that can inform patient care. It is
important to realize that many statistical terms are the same as everyday words, but their
connotation may be different. Successfully navigating the professional literature requires an
understanding of basic statistical concepts. Although some of these concepts have previously
been addressed in the oral health literature,3 this chapter will provide a primer on commonly
used statistical concepts and relevant research study design issues.

Research Design and Clinical Interpretation


Applied epidemiologic and clinical research can broadly be divided into experimental research, in
which exposure is assigned to a participant, and observational research, in which exposure is not
assigned but is instead “observed” as being present or absent (Figure 12.1).4 If there is a
comparison group in an observational study, it is characterized as analytical, and when no
comparison group is included, as descriptive. The appropriate research design is a function of
the question being asked and logistics. In some instances, it is constrained by available data
and/or resources.

Exposure is a term used to describe a factor that is thought to be associated with or predictive of
an outcome, such as a disease or a condition. For example, examining the association between
sugar (the exposure) and the risk of developing caries (the outcome) may be the aim of a study.

Figure 12.1. Algorithm for Classification of Types of Clinical Research

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 183
Adapted from Grimes DA, Schulz KF. “An overview of clinical research: the lay of the land.” The Lancet 2002
;359(9300):57-61

Experimental Trial
A randomized controlled trial (RCT) is considered the gold standard for answering questions of
therapy (that is, determining the magnitude of the beneficial and harmful effects of health care
interventions) and is the most rigorous study design. The hallmark of an RCT is the random
allocation or assignment of study participants to treatment, intervention, or exposure groups. The
main purpose of randomization (that is, randomly allocating trial participants) is to minimize
selection bias on the part of the investigator (see Chapters 3 and 13). In addition, randomization
increases comparability of the treatment groups for variables we can measure, as well as those
we are not aware of or cannot measure, thus minimizing the impact of potential confounders. (A

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 184
confounder is a factor that is associated with both the exposure and the outcome but does not lie
in the causative pathway.) However, randomization does not ensure the study groups are indeed
similar for all known confounders, and investigators should always assess comparability of the
study groups at baseline for known relevant clinical and demographic characteristics (risk factors
that are likely related to the exposure and outcome of interest). If the study groups are not
comparable for all important risk factors that could affect the relationship of the exposure and the
outcome, any observed association or difference could be due to a third factor, a confounder,
that is linked to the exposure and the outcome. Another important design element of RCTs is
blinding or masking. In this situation, study participants, clinicians, researchers, outcome
adjudicators, and analysts can be unaware of which treatment group a particular patient has
been assigned to. When both the investigators and the study participants are unaware of the
group assignment, this is sometimes referred to as double blinding. Blinding is an important
design strategy to reduce participant and investigator bias. A well-designed and implemented
RCT can therefore minimize selection bias, information bias, and confounding (see Chapter 13).
One of the advantages of an RCT is the certainty of the temporal relationship (which one comes
first) between an exposure (for example, a treatment or an intervention) and an outcome. A
potential concern with RCTs is the often-restrictive inclusion criteria for participant selection. RCT
participant selection usually targets one specific condition among a select demographic who,
other than the condition of interest, are considered healthy. Therefore, results from RCTs may
sometimes be difficult to generalize or apply (external validity) to the total population from which
the study participants were selected. The total population is likely to have many
characteristics/risk factors or other conditions that have not been eliminated in the study
population, so the study results may or may not be applicable to the total population from which
the study participants were selected. For example, it would represent a threat to applicability
when a study that looks at the success rate of immediately versus nonimmediately placed dental
implants uses exclusion criteria for the study population that could affect the outcome (success
rate) by excluding based on factors that are commonly found in the general population, such as
smoking, systemic diseases, medications, periodontal disease, or excluding certain genders or
age groups.

Observational Studies
Analytical observational studies include cohort studies where a group of individuals with and
without the exposure of interest are followed prospectively (forward in time), case-control studies
where individuals with or without the outcome of interest (cases and controls, respectively) are
traced backward in time to determine possible exposure, and cross-sectional studies where
exposure and outcome are measured at the same time (Figure 12.1). Unlike RCTs, the exposure
in observational studies is not assigned but is observed in groups of interest as it happens
naturally.

Cohort studies are prospective in nature and compare outcomes in groups (cohorts) of
participants with an exposure to a similar group of participants without an exposure (but having
the same risk for developing the outcome). It is important to note that participants in the
“exposed” group and the “unexposed” group need to be similar in all aspects except for their
exposure (that is, they have the same characteristics). In this study design, it is also important to
establish that the study participants are free of the disease or outcome of interest at the start of
the study and to have a clear, measurable definition of the outcome. For example, two groups
(cohorts) of children, one group that drinks sugar-sweetened beverages (SSBs, the exposure)
and one group that does not drink SSBs, are followed forward in time. The two groups should be
similar except for the fact that participants in the “exposure” group consume SSBs and
participants in the control group do not consume SSBs. The outcome of interest is the

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 185
development of caries, which will be assessed after a specific period (for example, two years).
Because all participants are free of the outcome (caries) at the onset of the study, this type of
study design can determine if the outcome was associated with the exposure based on the
difference in incidence (the development of the disease [caries] over a specific time period)
between the two groups. Cohort studies can determine incidence rates in the exposed and
nonexposed groups.

In case-control studies, researchers will observe an outcome and then retrospectively try to
determine the presence of past exposure. In this study design, the cases are those with the
outcome of interest and the controls are a comparable group without the outcome of
interest—but, it is important to note, with the same characteristics as the cases. Although the
selection and source of cases and appropriate controls are critical elements in case-control
studies, it is beyond the scope of this chapter to discuss this concern. Using a similar example to
the one above, cases of children with caries (those with the outcome) are compared with children
without caries (those without the outcome) to determine if an exposure, such as consumption of
SSBs, is associated with the presence of caries. Information about prevalence rates or incidence
rates cannot be determined by a case-control study design as the cases and the controls are not
measured from a population-based sample and there is no information on the temporal
relationship between exposure and outcome.

A cross-sectional study will assess the presence or absence of an exposure and the presence or
absence of an outcome at a particular time (that is, the prevalence of the exposure and the
prevalence of the outcome at the same point in time). Researchers may determine, at one
particular time, the presence of children with or without caries who drink or do not drink SSBs. As
this is a snapshot in time, it is not possible to know if the consumption of the SSBs occurred prior
to the development of caries (a temporal relationship), and accordingly, it is not possible to
determine whether drinking SSBs is associated with the development of caries. Cross-sectional
studies cannot be used to claim any causative relationships and are generally used to help guide
development of research questions.

Case reports and case series are purely descriptive and may, in a similar manner to other
observational studies, generate hypotheses about exposure and outcomes that need to be tested
with more complex study designs of greater rigor. Descriptive studies can be used to monitor the
health of populations but cannot be used to assess associations.

Measures of Association
Measures of association quantify the relationship (an analysis of comparison) between
exposure(s) and outcome(s) among groups. There are several different measures of association,
such as mean difference (MD), standardized mean difference (SMD), absolute risk (AR), relative
risk (RR), odds ratio (OR), and hazard ratio (HR). Effect size quantifies a measure of association
as the size of the difference between groups (for example, the MD in number of teeth between
two groups) or an estimate of a treatment’s efficacy as a proportion of the reduction or increase
in the outcome of interest in the intervention and control group (for example, the relative increase
or decrease in developing caries after consuming or not consuming SSBs). An effect size can be
standardized by dividing the measure of effect by the standard deviation (SD) of their difference
(see below for a description of SD).

Mean Difference

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 186
The MD, or the “difference in means,” measures the absolute difference between the mean
values in two study groups. It quantifies the average of the means by which the study
intervention changes the outcome in the study/treatment/intervention group compared with the
means of the control group. Because this estimate is created by subtracting the mean from one
group from the mean of the other group, an MD of 0 indicates no difference between the
experimental and control groups.

Standard Mean Difference


The SMD is a summary statistic often used in meta-analyses when the studies all assess the
same outcome but measure it in different scales (for example, measuring pain with two different
types of visual analog scales). In this situation, the results of the different studies must be
standardized to a uniform scale before they can be combined and compared and the results
summarized. The SMD quantifies the intervention effect in each study relative to the variability
observed in the particular study. In meta-analyses, the SMD is calculated for each study in the
meta-analysis and then pooled to get an overall SMD. An SMD of 0 indicates there is no
difference among groups.

Absolute Risk
Understanding the difference between probability and odds is essential in order to be able to
interpret AR, RR, and OR. A probability is the chance of an event occurring as a ratio of all
events. (For example, the probability of getting a 4 when tossing a six-sided die is the ratio of the
event occurring [tossing a 4] to all possible events [tossing a 1, 2, 3, 4, 5, or 6], which equals
1/6). A probability can be any number between 0 and 1.

The odds is the chance that a particular event occurs versus the chance that it does not occur, or
the ratio of the number with the event to the number without the event. For example, the odds of
tossing a 4 is the ratio of the chance (probability) of getting a 4 (1/6) to the probability of not
getting a 4 (5/6) [ (1/6)/(5/6) ], which equals 1/5. In other words, it is the probability of an event
occurring to the probability of that event not occurring. Odds can be any number between 0 and
infinity.

Risk in statistical terms suggests the probability, or the chance, that an event will occur, without
any inference to whether it has a good or bad outcome. A measure of risk, or probability, is
expressed as a number between 0 and 1, or as a percentage. Several different connotations of
risk are used in biostatistics to describe different associations, specifically relationships between
an exposure and an outcome.

As an example, we want to know the relationship between consuming SSBs and the
development of caries. In a hypothetical study, one group of 1,000 children who are not
consuming SSBs is followed for two years, and another group of 1,000 children, with the same
risk factors for developing caries as the first group but who are consuming SSBs, is also followed
for two years (Table 12.1). The AR is the number of children who develop caries in each group
divided by the total number of children in the group during the designated study period (Table
12.1a). Using the data from Table 12.1, we can state that “not consuming SSBs is associated
with an AR of developing caries of 15% (150 out of 1,000) at some point during two years” and
“the AR of developing caries when consuming SSBs is 65% (650 out of 1,000) over a time span
of two years.”

Table 12.1. A Hypothetical Study Involving 2,000 Children—1,000 Who Drink


Sugar-Sweetened Beverages (SSBs) and 1,000 Who Do Not Drink SSBs—Over a Time

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 187
Period of Two Years

Table 12.1a. Absolute Risk and Absolute Risk Reduction Based on the Data Presented in
the Hypothetical Study in Table 12.1

Absolute risk (AR) of developing caries when drinking SSBs (risk with exposure)

AR of developing caries when not drinking SSBs (risk without exposure)

Absolute risk reduction (ARR) (the risk reduction of developing caries when switching from drinking to
not drinking SSBs)

Relative Risk
The relative risk (RR), also known as the risk ratio, is the proportion of participants who
developed the outcome in the cohort with the exposure as a ratio of the proportion of participants
who developed the outcome in the cohort without the exposure (Table 12.1b). It can also be
defined as the probability of an outcome occurring in a treatment, or intervention, group divided
by the probability of an outcome occurring in a comparison, or control, group, or vice versa. In
other words, the RR is the incidence of the outcome in the exposed group relative to the
incidence of the outcome in the nonexposed group and provides a measure of the risk of
developing disease if exposed. The RR is the measure of association for cohort studies and
clinical trials (Table 12.2). Using the data and the formula in Table 12.1b, we can state, “There is
an RR of developing caries of 4.33, over a period of two years, if consuming SSBs compared
with not consuming SSBs,” or, “People consuming SSBs have 4.33 times the risk of developing
caries compared with those not consuming SSBs, over a period of two years,” or conversely,
“People not consuming SSBs have 0.23 times the risk of developing caries compared with those

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 188
consuming SSBs, over a period of two years.” An RR of 1 suggests no difference in risks, an RR
of more than 1 indicates an increased risk, and an RR of less than 1 indicates reduced risk.

Table 12.1b. Relative Risk and Relative Risk Reduction Based on the Data Presented in
the Hypothetical Study in Table 12.1

Relative risk (RR), or risk ratio, of developing caries when drinking SSBs compared with developing
caries when not drinking SSBs =

Relative risk, or risk ratio, of developing caries when not drinking SSBs compared with developing
caries when drinking SSBs =

Relative risk reduction (RRR) if not drinking SSBs =

Odds Ratio
Because case-control studies do not have a true denominator of “at risk” individuals and the
temporal relationship of exposure to an outcome is not clearly established, case-control studies
cannot use the RR as a measure of association and will instead use the odds ratio (OR) as a
measure of association (Table 12.2).

The OR in a case-control study is the ratio of the odds of individuals in the disease group having
the exposure divided by the odds of individuals in the comparison group having the exposure (
Table 12.1c); in other words, it is the odds of having the exposure in the cases compared with
the odds of having the exposure in the controls.

When warranted, odds can be converted to risks and subsequently to RR (Table 12.3). An OR
approximates an RR when the prevalence of disease is low, typically below 10%.5 (This is
illustrated in Table 12.3, where it is noticeable how low odds approximates the risk.) An RR is an
inappropriate measure in a case-control study.

Cohort studies can also use an OR as the measure or association; in this case, the OR is the
odds of experiencing the outcome or disease in the group exposed to a risk factor compared with
the odds of experiencing the outcome or disease in the group not exposed to the same risk factor
(Table 12.1.c). Results from RCTs are usually reported as an RR or as an OR. In RCTs, ORs are
interpreted similarly to ORs in cohort studies (Table 12.2).

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 189
Table 12.1c. Odds and Odds Ratios Based on the Data Presented in the Hypothetical
Study in Table 12.1

Case-Control Study

Cohort Study

Table 12.2. Interpretation of Relative Risk and Odds Ratio in Different Study Designs

Relative Risk Odds Ratio


Experimental Studies
Randomized The risk of developing disease among The odds that an exposed person
Trial (for those who are exposed relative to the risk develops disease relative to the odds
comparing of developing disease among those who that a non-exposed person develops
treatments or are not exposed; the ratio of the incidence disease.
interventions) of new disease among the exposed
relative to the non-exposed. An OR > 1 suggests the exposure is
positively associated with the disease,
An RR > 1 suggests the exposure is a risk an OR < 1 suggests the exposure is
factor for developing the disease, an RR negatively associated with the disease,
< 1 suggests an exposure is protective and an OR = 1 suggests no association
against developing the disease, and an between exposure and disease.
RR = 1 suggests there is no association
between the exposure and disease.
Observational Studies
Cohort Study The risk of developing disease among The odds that an exposed person
those who are exposed relative to the risk develops disease relative to the odds
of developing disease among those who that a non-exposed person develops
are not exposed; disease.

the ratio of the incidence of new disease An OR > 1 suggests the exposure is

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 190
among the exposed relative to the positively associated with the disease,
non-exposed. an OR < 1 suggests the exposure is
negatively associated with the disease,
An RR > 1 suggests the exposure is a risk and an OR = 1 suggests no association
factor for developing the disease, an RR between exposure and disease.
< 1 suggests an exposure is protective
against developing the disease, and an
RR = 1 suggests there is no association
between the exposure and disease.
Case-Control Cannot be calculated directly The odds of those with disease having
Study been exposed relative to the odds of
those without disease having been
exposed.

An OR > 1 suggests the exposure is


positively associated with the
exposure, an OR < 1 suggests the
exposure is negatively associated with
the disease, and an OR = 1 suggests
no association.
Cross-Sectional Cannot be calculated Cannot be calculated unless there is a
Study comparison group; in that case, similar
to interpretation for a case-control
study.
Case Cannot be calculated Cannot be calculated.
Report/Case
Series

Table 12.3. Converting Odds to Risk and Risk to Odds

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 191
Absolute Risk Reduction and Relative Risk Reduction
Understanding the difference between AR and RR, and absolute risk reduction (ARR) and
relative risk reduction (RRR), is important in order to make appropriate clinical decisions. In the
example in Table 12.1, there is a different AR for having caries among children who do not
consume SSBs compared with the AR for having caries among children who consume SSBs.
The ARR is the difference of the AR in the test group and the control group. As seen from the
data in Table 12.1a, not consuming SSBs is associated with an ARR of having caries of 0.50
(0.65 minus 0.15), or stated differently, “not consuming SSBs will reduce the AR for having caries
from 650 in 1,000 (65%) to 150 in 1,000 (15%)” or “500 fewer cases of caries can be expected
among 1,000 patients who do not consume SSBs compared with 1,000 patients who consume
SSBs over a period of two years.”

Looking again at Table 12.1, there is a relationship between the two ARs that can be quantified
with RR: the proportion, or relative change, between the AR for caries among children who do
not consume SSBs and the AR among children who consume SSBs (Table 12.1b).

In the hypothetical study depicted in Table 12.1, the RRR is the risk reduction for developing
caries associated with not consuming SSBs (Table 12.1b). As the RR of not consuming SSBs
compared with consuming SSBs is 0.23, the RRR of not consuming sugar is 77% (1 minus 0.23),
or a 77% reduction in the risk of developing caries in the group that is not consuming SSBs
compared with the group that is consuming SSBs.

Although the RRR was 77%, the ARR was 50% (65% minus 15%). The difference between the
RRR and the ARR is more dramatic when the prevalence of a disease is low. For example, if the

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 192
AR for developing caries is 1.6% and a particular diet would decrease this risk to 1%, there is an
RRR of 37.5% but an ARR of only 0.6%. Both of these concepts will inform practice, but in
different ways.

Hazard Ratio
The HR is another measure of association that deals with time-to-event data, also known as
survival data. Hazard is the instantaneous event rate, which is expressed as the probability for an
individual to have an event of interest at a particular time (assuming they are event-free up to
that time). The HR quantifies risk as the ratio of hazards in the treatment group and the control
group at a particular point in time. It is the hazard of developing an event in the intervention
group relative to the hazard of developing the event in the control group at any particular time
along the follow-up period. An HR of 1.0 means the event rates are the same in both groups; an
HR of 2.0 means that, at any particular time during the study follow-up, twice as many patients in
the treatment group are having an event proportionally to the control group. An HR of 0.5 means
that, at any particular time, half as many patients in the treatment group are experiencing the
event proportionally to the controls. In a hypothetical clinical study, the reported HR is 0.45,
which means that patients in the treatment group at any point in time along the follow-up are 55%
less likely to experience the event. Although the HR takes into account not only the total number
of events but also the timing of each event (that is, the event rate), the RR measures the
cumulative risk over the total time period of interest.

Hypothesis and Significance Testing


When researchers are trying to determine whether an association exists between two factors (for
example, consuming or not consuming SSBs), or between patients’ characteristics (for example,
patients with low education levels or high education levels), and the presence of an outcome, the
ideal situation would be to recruit the whole population to whom the results would be applied to
into the study. It is not difficult to understand that such an approach would have serious
implementation issues, and a massive amount of resources and time would need to be allocated.
As a way to solve this conundrum, researchers take a sample, a portion of the whole population,
expecting that this sample will provide good representation of the individuals, factors, or
characteristics under study. Extrapolating the study sample findings to the whole population is
called inferential statistics.(“Population” is a term used in statistics to describe the entire
“universe” of individuals from which researchers draw their study sample.) Inferential statistics
differ from descriptive statistics, where collected data are only used to describe the study sample
without making inferences to a population.

Users of the dental literature will find that there are two types of hypotheses: the null hypothesis
(H0) and the research (or alternative) hypothesis (Ha). The null hypothesis states that there is no
(that is, null) association between the predictor or exposure and outcome variable, or therefore
there is no difference in the outcome between the study groups, and further, any observed
difference is due to chance alone (Box 12.1 and Figure 12.2).

Box 12.1. Examples of Hypotheses


Null hypothesis (H0)
The incidence of caries in the group of children consuming sugar-sweetened beverages (SSBs)
compared with those children not consuming SSBs is the same.

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 193
Research (or alternative) hypothesis (Ha)
1. The incidence of caries in the group of children consuming SSBs compared with those children not
consuming SSBs is different.
2. The incidence of caries in the group of children consuming SSBs is higher compared with those
children not consuming SSBs.
3. The incidence of caries in the group of children consuming SSBs is lower compared with those
children not consuming SSBs.

Figure 12.2. Hypothesis Testing

The null hypothesis is the basis for statistical testing because researchers only have to test to
one value, a 0% difference; the question is whether the observed study data are consistent with
the null hypothesis, which states that there is no difference. The alternative hypothesis can be
nondirectional and just state there is an association (or a difference) or directional (better or
worse, higher or lower) (Box 12.1 and Figure 12.2).

Statistical significance testing is the method used to support or reject inferences based on
observed sample data. In other words, from the observed study sample, can researchers infer
what is “true” for the population? In terms of hypothesis testing, the purpose of statistical testing
is to determine whether the observed study data support (warrant) rejecting or not rejecting the
null hypothesis. Failure to reject the null hypothesis is not exactly the same as accepting the null

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 194
hypothesis; the correct conclusion is that there is not enough evidence to reject the null
hypothesis of no association/difference. Nonsignificant results mean that there is not enough
evidence to reject the null hypothesis and suggest that there is an association/difference.
(“Enough” is quantified by a predetermined value or level called the alpha level [see more about
the alpha level below]). If the results of a study to assess if there is an association/difference are
not significant, concluding that the two groups under comparison are the same or equivalent is
incorrect and a typical misunderstanding. If the statistical testing suggests the study observations
are consistent with the null hypothesis of no association/difference, the null hypothesis is not
rejected and one cannot state that there is a difference between the groups. If the statistical
significance testing indicates that it is unlikely that the study results are consistent with the null
hypothesis (that is, it is unlikely that there is a difference), the null hypothesis is rejected and we
proceed with the alternative hypothesis, or we reject the null hypothesis in favor of the alternative
hypothesis.

Significance testing is based on a designated, or predetermined, alpha level and calculated


probability (P) values. The alpha level is the probability chosen by the investigator to be the
threshold of statistical significance, or stated differently, the predetermined threshold of
confidence to reject or not reject the null hypothesis. It is an arbitrary value, but the accepted
convention for most studies is to set the alpha at 0.05. The alpha level is also the probability of
committing a type I error given the null hypothesis is true—concluding that there is a statistically
significant difference when in fact there is not a statistically significant difference, also known as
a false-positive result (Figure 12.2). An alpha of 0.05 suggests that 5% of the time the
investigator will conclude that there is an association or difference, when in fact there is not a real
association/difference and the observed results are merely due to chance. Another way of
interpreting an alpha of 0.05 is to say one is willing to accept a 5% risk of committing a type I
error (getting a false-positive result), or that 5% of the time you will falsely conclude there is a
difference when in fact there really is no true difference (Figure 12.2).

A P value less than alpha is considered statistically significant, and the null hypothesis is rejected
in favor of the alternative hypothesis; a P value equal to or greater than alpha is not considered
statistically significant, and based on the study data, the null hypothesis is not rejected. The P
value is the probability that an outcome (result) as extreme (or as unusual), or even more
extreme (or more unusual), as that obtained from the study could have occurred by chance alone
assuming the null hypothesis is true. The P value is the measure of evidence for or against the
null hypothesis (no association/difference) and indicates the observed results are either
statistically significant or not statistically significant. It is a binary, yes or no, outcome; for this
reason, it is incorrect to say that there is a trend toward statistical significance or results are
marginally significant. For example, P values of 0.03 and 0.04 are both statistically significant,
but one is not more significant than the other. Also, a P value above but close to 0.05 should not
be interpreted as “marginally statistically significant” or a “trend toward significance.”

Although the P value is frequently used, it is often misinterpreted.6 Therefore, it is important to be


aware of what it does not tell you. The P value does not provide any indication of the direction
(decreased or increased) or magnitude of the association/difference; it also does not indicate the
potential clinical or practical significance of the results (see Chapter 11). Furthermore, the P
value does not indicate the probability that the null hypothesis is true, how true the study results
are, or how likely the results are to be true. There is a consensus in the scientific community that
the P value is often misinterpreted and that when using a P value, it should always be
accompanied by confidence intervals. CIs provide information about the statistical significance
along with information about the direction and strength of the association or magnitude of effect
(see the following section on CIs).

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 195
Confidence Intervals (CIs)
When faced with clinical choices, patients and clinicians would like as much information as
possible in order to make an informed decision. Consider an example of a clinician talking to a
patient about placing an implant, in which the patient wants to know if this procedure is
associated with any marginal bone loss (MBL). In a hypothetical example, a study shows that the
mean MBL, using the particular implant offered to the patient, was 5 mm. This is important
information, but even more informative would be to know the range of MBL reported in the study.
The 95% CI in the study ranged from 1 to 9 mm, or the mean ±4 mm. Thus, although the mean
MBL was 5 mm, the expected MBL could be as much as 9 mm and as little as 1 mm. Knowing
this information, the patient may decide that the risk of 9 mm MBL is too much, or decide to
accept the risk and agree to have the implant placed as there is also a chance that the MBL will
be much less (as little as 1 mm). Thus, a CI will provide additional information to evaluate the
clinical relevance of the study findings (see Boxes 12.2 and 12.3).

The level of confidence is the certainty that a range, or interval of values, contains the “true” or
accurate value of a population that would be obtained if the experiment were repeated multiple
times. In other words, a 95% CI purports that if a CI is calculated for numerous samples
(experiments), we would expect that the true population value (mean, proportion, etc.) be
included in 95% of the sampled CIs. A 95% CI is commonly used in clinical and applied research,
but if more confidence is desired, researchers can use a CI larger than 95%, such as 99%. A CI
can only be correctly interpreted if we use a random sample, if the samples are selected
independently of each other, if the data are accurately measured, and if the variable being
measured is the right one to make an inference about the population. For example, we cannot
use a measure of bone loss from numerous samples to make an inference about implant failure
in the population. As an aside, it is not correct to state that “there is a 95% probability that the
population mean is contained within a specific 95% CI.” Such a statement would imply that the
population mean could be different depending on which CI we examine. However, we have only
one “true” population mean, which does not change, but we have several different samples, with
CIs, that may or may not contain the population mean. Thus, the 95% probability is about the CIs
and not about the fixed, or “true,” population mean, or stated correctly, “There is a 95%
probability that this sample’s specific 95% CI contains the population mean.”

How to Interpret a CI
A CI provides information about both the direction and magnitude of a treatment effect, which
provides more information on the clinical importance than the P value, as a P value can only
provide a “yes” or “no” answer on whether to reject or not reject the null hypothesis (see the
section on P value). All CIs are reported around a point estimate (a specific sample mean, mean
difference, etc., which is computed from the data of the experiment and is the best estimate for
the observed data) and lower and upper limits (confidence limits or boundaries).

The width of a CI may be determined by several different factors. The obvious one is the sample
size (Table 12.4a). Increasing the number of participants would increase our confidence that the
“true” effect is closer to our measured (observed) effect. The outcome of an experiment with only
four participants would provide us with a measure that may be close to the “true” effect, but
enrolling 100 participants would enhance our confidence that our observed result is even closer
to the “true” effect. However, there may be other factors that could affect the width of a CI.

If researchers conducted experiments to compare implant failures using an experimental

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 196
technique to a failures using a conventional technique, they can calculate the ARR and RRR to
determine which technique is associated with a better or worse outcome. If the ARR (the
difference between the AR in the experimental group and the AR in the control group) increased,
yet the sample size in all experiments remained the same, the RRR might not change but the
width of the CI will narrow (Table 12.4b). The difference between the AR in the experimental and
the control group will determine the width of the CI (that is, the precision). In this case, it is the
ARR and not the sample size that will determine the width of the CI, where a higher ARR will
provide a more precise CI. Thus, both sample size and the number of events observed are key
determinants of the width of a CI.

Table 12.4a. Sample Size and Width of a Confidence Interval (CI)

The five different studies above illustrate that an increase in sample size will result in a more narrow confidence
interval constructed around the relative risk reduction (RRR) (that is, provide more confidence that the “true” RRR
for the implant failure rate is close to our observed 10% absolute risk reduction and 20% RRR).

Table 12.4b. Absolute Risk Reduction and Width of a Confidence Interval (CI)

The five different studies above show that although the sample size and the relative risk reduction (RRR)
remained the same, the width of the confidence interval (CI) constructed around the RRR changed. This is due to
an increase in the absolute risk reduction (the difference between the absolute risk in the control group and the

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 197
absolute risk in the experimental group), which will result in a narrower, and thus more precise, CI. Accordingly, in
this scenario, the number of events, and not the sample size, affected the width of the CI.

There are several other factors that impact the width of the CI. The width of the CI decreases as
the sample size increases; an increase in the SD will increase the width of the CI; and, all
variables remaining equal, increasing the level of confidence desired from 95% to 99% will
increase the width of the CI. Another factor that can impact the width of the CI is the level of
significance; as the significance level decreases (for example, 0.05 to 0.01) with all other
variables being equal, the width of the CI increases.

Treatment effect can be measured by an RR or OR among other measures of association (see


the section on measures of association). If the CI constructed around an RR or OR includes 1,
the result is not statistically significant as 1 is the null value for a ratio (Table 12.5), indicating that
the “true” difference may be 1, which for a ratio indicates no difference between the two groups
compared. This could also raise questions about the potential clinical importance of the finding. If
we are comparing means, an MD (the mean of one study group minus the mean of the other
study group) of 0 is the null value and would suggest no difference. A CI surrounding the MD that
crosses the null value of 0 is considered not statistically significant. This concept applies to
differences of SMD, ARR, HR, and other measures of association. Statistical significance can
thus be assessed using both P values and CIs.

For more information on the calculation of CIs, see Box 12.2, and for more discussion on CIs,
significance, and clinical implications, see Box 12.3.

Treatment effect can be measured by an RR or OR among other


measures of association (see the section on measures of association).

Table 12.5. Assessing Statistical Significance Using Confidence Intervals and P Values

Probability and the Normal Curve

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 198
The use of statistical testing is meant to quantify the probability of getting the observed results if
they were due solely to random variation, also referred to as chance. It does so by comparing
observed outcomes with theoretical outcomes that would be expected because of random
variation. An illustrative way to think about this is a coin toss experiment. What is the probability
of obtaining nine heads and one tails after tossing a coin 10 times? We can calculate this
probability with available formulas, and we can conduct an experiment. After comparing the
observed result in the experiment to the calculated probability (the expected result), we can
make a statement about the chance of the coin being a fair or biased coin.

The experiment consists of tossing a coin 10 times and observing the results (the proportion of
heads compared with tails). Every 10 coin tosses is a trial, and every trial will not result in the
same proportion of heads to tails. But the more trials we conduct, the more our certainty of what
the proportion of heads to tails truly is will increase. After plotting the outcome of each trial, for
example as a histogram, a certain pattern will emerge. The pattern of values of a measured
quantity, in this case the outcome of each 10 coin tosses, is in statistics called a distribution. The
normal distribution, or normal sample distribution curve, is the shape most commonly used to
model expected probabilities under the null hypothesis (Figure 12.3). (The word “normal,” when
used to describe this particular curve, does not have the same meaning as in everyday language
and is used in statistics to denote this particular curve.) The normal curve is used both in
inferential and descriptive statistics and is also referred to as the bell-shaped or Gaussian curve.
The area under the normal curve between different data points corresponds to a probability that
can be calculated or obtained from tables found in most textbooks on statistics (Figure 12.3).
Thus, by plotting our observed results from our trials of 10 coin tosses, we can calculate the
probability of the result we observed and then compare our observed probability with our
expected probability. For example, if our expected probability of obtaining nine heads and one
tails after 10 coin tosses is 1%, and our observed probability of getting nine heads and one tails
after having conducted our trials of 10 coin tosses was 4.5%, we need to determine if the
difference between our expected result and our observed result could inform whether we tossed
a fair or biased coin.

Figure 12.3. Normal Sample Distribution Curve with Standard Deviations and Areas Under
the Curve

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 199
The mean in a normal sample distribution is usually the midpoint of the X-axis. In the examples above, the mean

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 200
is 0. The standard deviation (SD) is evenly distributed around the mean.
The area under the curve between –1 SD and +1 SD is always 68.26%, the area between –2 SDs and +2 SDs is
always 95.44%, and the area between –3 SDs and +3 SDs is always 99.72%.
Conversely, an area of 99% under the curve corresponds to 2.58 SDs; an area of 95% under the curve
corresponds to 1.96 SDs; and an area of 90% under the curve corresponds to 1.65 SDs.

Figure 12.4. The Observed Result of a Study of Marginal Bone Loss after Placement of
Dental Implants

See Box 12.3 for more explanations.


The mean marginal bone loss (MBL) in this study was 0.88 mm, with a standard error of the mean (SEM) of 0.12
mm. The margin of error (ME) is ±0.24 mm associated with a 95% confidence interval (CI). Thus, the 95% CI
constructed around the observed mean is 0.88 ± 0.24 mm, written as mean 0.88; 95% CI, 1.12 to 0.64.

One of the important attributes of the normal curve is that it can be described by two parameters,
the mean and standard deviation (SD) (Figures 12.3 and 12.4). The SD is a measure of the
variability of the observations or a measure of the spread of the data around the mean of a
sample. Although there are an infinite number of “normal curves” depending on the mean and
SD, all have the same property: The calculated area under the curve is proportional to a
probability. Any observation that follows a normal distribution can be located on the normal curve
based on the number of SDs it is from the mean, or center, of the curve. All normal curves have
another shared characteristic: 68% of the area under the curve lies between the mean ± 1 SD,
95% of the area under the curve lies between the mean ± 2 SD, and 2.5% of the area under the
curve lies in each respective tail of the curve beyond the mean ± 2 SDs (Figure 12.3). The area
under the curve in the tails is the probability of an outcome as extreme as or more extreme than
that observed if the null hypothesis is true. For example, if an alpha level of 5%, or 0.05, is
chosen, any observation that lies beyond 95% of the curve (that is, 2.5% under each tail of the

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 201
curve, which is interpreted as less than a 5% chance of obtaining our result if the null hypothesis
is “true”) will be considered not statistically significant and the null hypothesis will be rejected.

Standard Deviation and Standard Error


If we could repeat an experiment many, many times on different samples with the same number
of subjects, the resultant sample statistic would not always be the same (because of variability in
the samples, or dispersion, or chance). Two measures of sampling variability are the standard
deviation (SD) and the standard error (SE). “Deviation” refers to the difference between the
observed values and the estimated values, while the connotation of “error” is variability. It is
important to understand the distinction between these two measures.

The SD is a measure of variation used with interval or ratio data. Interval data have no true zero,
such as in the case of measures of temperature in a Fahrenheit or Celsius scale. As there is no
true zero, it is not possible to say that 25°F is half as hot as 50°F. Ratio data has a true zero,
such as in the case of weight, where 50 lb is twice as heavy as 25 lb. The SD is a measure of the
variability in the observations in a given study sample. The SD is a measure of the spread of the
observations around the study sample mean. In a given study, the SD indicates, on average,
how much the individual observations differ from the study mean. The SD also defines the shape
of the normal curve; a larger SD indicates more scatter about the mean and indicates worse
precision, while a smaller SD indicates less scatter about the mean and better precision. This is
reflected in the width of the normal curve, which increases with a larger SD and decreases with a
smaller SD.

The SE refers to the mean of a study sample and quantifies the variation of the sample mean
compared with the population mean. It indicates how reliable the study sample mean is as an
estimate of the true population mean. Because the SE is calculated by taking the SD and dividing
it by the square root of the study sample size (SE = SD/(sample size)), the SE is always smaller
than the SD. Thus, the SE depends on the study sample size (n) and the variability in the
population from which the study sample was drawn (SD). A larger sample size will decrease the
SE, while a population with a lot of individual variability for the given measurement of interest will
increase the SE. The size of the SE gives an indication of the precision of the parameter
estimate (an estimate of a population variable) and is the primary basis for calculating the CI
used in statistical inference. To summarize, the SD quantifies the variability of the observations in
a study sample, while the SE quantifies the variability of the sample mean from the true
population mean. It is not correct to report the SE to indicate the variability of observations in a
given study, although some investigators find it tempting as the SE is always smaller than the
SD.

Sample Size Considerations


When a clinician is reading a research study, the level of power for the specific study is critical to
interpreting the results; the investigator should always indicate the power of the study in the
methods section. The level of power is the probability of finding a difference when one truly exists
(that is, the probability of a true-positive finding). The desired level of power is, by convention, at
least 80%. Increasing the level of power (for example, from 80% to 90%) for a treatment effect of
interest when mortality is the outcome (a quite infrequent event) will require a markedly larger
sample size. As the concepts of study power and sample size are directly related, sample size

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 202
calculations are also referred to as power analysis. With a large enough sample, even a very
small effect can be found to be statistically significant, although it would not necessarily be
clinically relevant.

The calculated sample size is a function of the desired power of the study, which is directly
related to the designated beta level (Figure 12.2). The beta level is the probability of committing a
type II error (concluding that there is no association/difference when in fact there really is one, or
a false-negative result) (Figure 12.2). The chosen beta level is based on the investigator’s
degree of willingness to accept making a type II error. This is analogous to choosing an alpha
level based on the investigator’s degree of willingness to accept a type I error, or false-positive
result. Power is equal to 1-beta and is the probability of observing an effect in the study sample
that reflects a “true” effect (the effect found in the population, the true-positive finding) as large as
or larger than the observed effect. The convention is to set beta at 20%, which means the
investigator is accepting that there is a one in five chance of missing a true difference (that is, of
getting a false-negative result). A beta of 20% represents a power of 80% (an 80% probability of
finding a “true” effect), which is the minimum acceptable power level to consider that a study was
adequately powered. If it is particularly important to avoid a type I error (for example, making a
false-positive diagnosis of cancer that requires therapeutic choices with potentially dire side
effects), alpha can be set at a lower value such as 0.01; if, on the other hand, it is particularly
important to avoid a type II error (for example, missing a diagnosis of a disease that is highly
contagious and can affect a large number of people), beta can be set lower to, for example, 10%.

A priori determining the expected treatment effect can be the most difficult aspect of sample size
planning. The desired treatment effect can be determined using data from other studies in the
literature that report a statistically significant and clinically meaningful effect or using pilot data to
make an informed estimate. In some instances, there are no data available to use, so the
investigator will use their best judgment and experience to identify a clinically meaningful effect
for a given outcome of interest. For example, if a study is designed to examine changes in the
caries rate after consuming SSBs, the researcher has to predetermine what level of change in
the caries rate would be relevant. The relevance may be decided upon by the cost of a potential
intervention. Although a change of 1% might not be considered important, any change of 10% or
more may be considered enough to decide to implement a specific caries-preventive
intervention. The smaller the magnitude of the expected treatment effect, the larger the sample
that is needed to show a statistically significant difference at alpha = 0.05 and power of 80%.
Thus, all other factors being the same, to find a 1% difference requires a larger sample size than
what would be required to find a 10% difference. When considering the expected magnitude of a
treatment effect, it is critical to pay attention to clinical importance.

It is critical that, for each study, the elements used to conduct the sample size calculation are
clearly specified and the power of the study is at least 80%. A study that has less than 80%
power to detect the desired or observed effect is underpowered, and the results are not
considered reliable. From a statistical standpoint, a study that does not find statistical
significance is referred to as a “negative study.” It is unclear if the nonstatistically significant
result is valid or just the result of a lack of power due to an inadequate sample size. In this case,
money and time (the investigator’s and participants’) have been wasted.

Why Is Sample Size Important?


An adequate sample size is essential to producing informative study results. The larger the
sample size, the smaller the magnitude of a treatment effect that can be detected as statistically
significant. For a given study, there should be sufficient power (at least 80% power) to detect a
difference if there really is a difference. Studies that do not have adequate samples to attain at

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 203
least 80% power are considered underpowered and may produce results that lack validity from a
study design perspective and precision from a statistical perspective. For example, if there is an
underpowered study of the efficacy of drug A compared with drug B and the results indicate that
the effect of drug A is not statistically significantly different from the effect of drug B, the question
is whether that is due merely to the fact that the study sample was not large enough to detect a
difference. Did the investigator commit a type II error? Did the investigator conclude there was no
difference and fail to reject the null hypothesis when in fact there was a real difference (Figure
12.2)? Increasing the sample size makes the hypothesis testing more sensitive by making it
more likely that the null hypothesis will be rejected when, in fact, it is false. The sample size also
impacts the magnitude of the SE and the width of the CI, two essential estimates in research for
making statistical inferences (the larger the sample size, the smaller the SE and the greater
precision, which in turn produces a narrower CI).

Box 12.2. Confidence Interval Constructed around a Mean


Knowing how a confidence interval (CI) is constructed is helpful for the interpretation. Let’s suppose we
want to know the average marginal bone loss (MBL) around single implants placed by every dentist in
New York City. There is no way we can measure the bone loss of all the placed implants. Instead, we
measure bone loss around a random sample of implants (for example, a sample size of 100). We find
that the average mean bone loss is 0.88 mm, and the standard error of the mean (SEM) is 0.12 mm.
The empirical rule states that for a normal sample distribution, 95% of the possible sample means will
fall within two (or more accurately 1.96) SEMs of the population mean. In other words, the length of the
interval that contains this 95% is 1.96 SEMs above and 1.96 SEMs below the population mean (Figure
12.4); ±1.96 SEM is also known as the margin of error (ME). The value 1.96 is sometimes called the
“confidence coefficient” for the 95% CI; the confidence coefficient for the 90% CI is 1.65; the
confidence coefficient for the 99% CI is 2.58. We can now use the “length” of this interval to construct a
CI. If we construct an interval around any given sample mean, will the upper and lower bound include
the population mean? It will do so 95% of the time. We can “move” this interval along the number line
(the X axis), and wherever we place it, we are confident that 95% of the time this interval will contain
the population mean; 2.5% of the time, this interval will have a sample mean that is more than 1.96
SEMs above the population mean (and thus not contain the population mean), and 2.5% of the time,
this interval will have a sample mean that is more than 1.96 SEMs below the population mean (and
thus not contain the population mean). We can therefore state that we are 95% confident that the
population mean is between the lower and upper bound of the interval.
The ME is the amount of error we expect between the population mean or parameter of interest that
we want to infer and the study sample estimate of that mean or parameter of interest. If we are
calculating a 95% CI, we know from the discussion above that we have to multiply our SEM with 1.96
to get the ME for a 95% CI. In our example above, we will get 1.96 x SEM = 1.96 x 0.12 = 0.24 (the
ME) (Figure 12.4). As this is the distance above and below the mean, the upper boundary of the CI will
be 0.88 + 0.24 = 1.12, and the lower boundary will be 0.88 – 0.24 = 0.64. This can be expressed as
“mean 0.88; 95% CI, 1.12 to 0.64.” We can now state that “we are 95% confident that the mean MBL
for all single implants placed by dentists in NYC is between 0.64 and 1.12 mm.” It is up to the clinician
to decide if a bone loss between 0.64 and 1.12 mm is clinically relevant.

Box 12.3 Significance and Clinical Implication


Does providing fluoride to children over a period of two years reduce the decayed, missing, and filled
teeth (DMFT) index?
We performed a study in which 100 children received fluoride (the treatment group, or group A) and
100 children did not receive fluoride (the comparison group, or group B). All children were randomly
assigned (allocated) to group A or group B, had the same initial DMFT score, and had the same risk for
developing caries at the start of the study.
Our study results showed that, after two years, children in the treatment group had an added mean

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 204
DMFT score of 3.50, with a standard deviation (SD) of 1.40, while the control group had an added
mean DMFT score of 5.10, with an SD of 1.60. Given these results, could we infer what the “true”
mean change of the DMFT score would be if we provided the entire population of children with fluoride
for two years? This is not possible, but we could calculate a 95% confidence interval (CI) around the
mean difference (MD) between the DMFT scores for which we would be able to state that “there is a
95% probability that this interval contains the true population mean or difference in DMFT.”
Our study showed an MD (5.10 less 3.50) of 1.60 DMFT score between the groups.
As we are assuming a normal sampling distribution, we can calculate, or look up in a table, the number
of SDs that match 95%. As alluded to earlier, 95% is 1.96 SDs away from the mean. Our margin of
error (ME) can now be calculated, ±1.96 x 0.21 (where 0.21 is the “combined” standard error of the
mean calculated from both samples), which equals ±0.41. Thus, our 95% CI constructed around the
MD in our example is 1.60 ± 0.41. This can be expressed as mean difference 1.60; 95% CI, 1.19 to
2.01.
We can now state that we are 95% confident that if we provided fluoride to the entire population of
children for two years, we would be able to reduce the mean DMFT score to between 1.19 and 2.01 for
any group of 100 children. As this interval did not include 0, our result is statistically significant, but a
clinician must determine whether this result is clinically important. For example, our result would not be
clinically important (or clinically significant) if a clinician determined that only an MD of at least 2.50
would be clinically important.

References
1. Glick M, Carrasco-Labra A. “Misinterpretations, mistakes, or just misbehaving.” J Am Dent Assoc 2019;150(4):237
-9
2. Glick M, Greenberg BL. “The need for scientific literacy.” J Am Dent Assoc 2017;148(8):543-5
3. Greenberg BL, Kantor ML. “The clinician’s guide to the literature.” J Am Dent Assoc 2009;140(1):48-54
4. Grimes DA, Schulz KF. “An overview of clinical research: the lay of the land.” The Lancet 2002;359(9300):57-61
5. Sedgwick P. “What are odds?” BMJ 2012;344:e2853
6. Best AM, Greenberg BL, Glick M. “From tea tasting to t test: a P value ain’t what you think it is.” J Am Dent Assoc
2016;147(7):527-9

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 205
Chapter 13. Issues of Bias and Confounding in
Clinical Studies
Elliot Abt, D.D.S., M.S., M.Sc.; Jaana Gold, D.D.S., M.P.H., Ph.D., C.P.H.; and Julie Frantsve-Hawley,
Ph.D., C.A.E.

In This Chapter:
Confounding
• Control of Confounding
Bias
• Bias in Therapy Studies
• Bias and Prognostic Studies
• Bias in Diagnostic Test Studies
Conclusion

Introduction
Bias and confounding are two phenomena that can distort the results of a study, thus lowering
validity (internal validity) and applicability (external validity). Bias is a systematic error in the
design, conduction, or data analysis that leads to an incorrect assessment of the true effect of an
exposure (or intervention) on an outcome.1 Confounding, on the other hand, is the presence of a
third factor that can alter the association between an exposure and an outcome.

Investigators may make wrong conclusions about the beneficial or harmful effects of a tested
treatment, and it is important for clinicians to understand how bias can impact study results.2
Bias can be intentional—which is considered unethical and one should expect that this not be
done—or unintentional, as a result of poor methodology.

It is important to note that one cannot assess the absolute impact that bias has on a study.
However, one can and should assess the potential or the risk that bias could have impacted
results and conclusions. Bias can also cause associations to be either larger (overestimation) or
smaller (underestimation) than the true associations.3 Oftentimes, little can be done when bias
has occurred as there are no statistical tests that can control for bias. However, it can be
minimized when a study is carefully designed and conducted.4 Potential sources of bias can
differ among study designs. Conversely, confounding can be minimized in the design and/or
analysis phase of a study.

Specific concerns about confounding and the different types of biases found in clinical trials,

EBSCOhost - printed on 11/15/2023 2:31 PM via UNIVERSITY OF ALBERTA LIBRARIES. All use subject to https://www.ebsco.com/terms-of-use 206

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy