0% found this document useful (0 votes)
17 views75 pages

Biostat

The document provides an overview of statistics, including its purposes, types of variables, sources of data, and methods of analysis. It distinguishes between primary and secondary data, outlines descriptive and inferential statistics, and discusses measures of central tendency and dispersion. The content is structured into lessons that cover fundamental concepts and applications in research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views75 pages

Biostat

The document provides an overview of statistics, including its purposes, types of variables, sources of data, and methods of analysis. It distinguishes between primary and secondary data, outlines descriptive and inferential statistics, and discusses measures of central tendency and dispersion. The content is structured into lessons that cover fundamental concepts and applications in research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

MODULE 1

Lesson 1: THE FIELD OF STATISTICS

Statistics

 Set of mathematical procedures for organizing, summarizing, and interpreting information


 Serves 2 purposes:
 Used to organize and summarize the information so that the researcher can see what happened in the
research study and can communicate the results to other.
 Help the researcher to answer the questions that initiated the research by determining exactly what
general conclusions are justified based on the specific results that were obtained.
 Ensure that the information or observations are presented and interpreted in an accurate and informative way –
bring order out of chaos
 Provide researchers with a standardized set of techniques that are recognized and understood throughout the
scientific community

The field of statistics: the study and use of theory and methods for the analysis of data arising from random processes
or phenomena. The study of how we make sense of data.

 The field of statistics provides some of the most fundamental tools and techniues of the scientific method:
 Forming hypotheses
 Designing experiments and obervational studies
 Gathering data
 Summarizing data
 Drawing interferences from data (testing hypotheses)

A statistic also refers to a numerical quantity computed from sample data (mean, median, maximum).

The field of statistics can be divided into:

1. Mathematical statistics: the study and development of statistical theory and methods in the abstract.
2. Applied statistics: the application of statistical methods to solve real problems involving randomly gathered
data, and the development of new statistical methodology motivated by real problems.

Biostatistics

 The branch of applied statistics directed toward applications in the health sciences and biology
 Biostatistics is sometimes distinguished from the field of biometry based upon whether applications are in the
health sciences (biostatistics) or in broader biology (biometry; agriculture, ecology, wildlife biology)
 Focus is on Human life and Health. thus, the areas of application relates to pharmacology, medicine,
epidemiology, public health, anatomy & physiology, and genetics.
 Other branches of applied statistics psychometrics: ecometrics, chemometrics, astrostatistics, and
environmetrics.

Population

 The set of al measurements of interest to a researcher.


 Population is not observed, it can be thought of as existing or conceptual.
 Existing populations are well-defined sets of data containing elements that could be idetified explicitly,
 Conceptual populations are non-existing, yet visualized, or imaginable, set of measurements. This could
be thought of characteristics of all people with a disease. It could also be thought as the outcome if
some treatment were given to a large gorup of subjects. In this last setting, we do not give the
treatment to all subjects, but we are interested in the outcomes if it had been given to all of them.

Sample

 Set of individuals selected from a population, intended to represent the population in a research study.
 Observed set of measurements that are subsets of a corresponding population.
 Used to describe and make inferences concerning the populations from which they arise.
 Sample is representative of the population of interest.

Parameter

 A value that describes a population


 Derived from measurements of the individuals in the population

Statistic

 A value that describes a sample


 Derived from measurements of the individuals in the sample

Lesson 2: TYPES OF VARIABLES

Variable(s)

 Characteristic or condition that changes or has different values for different individuals
 Characteristics that differ from individual to another (height, weight, gender, or personality)
 Can be environmental conditions that change such as temperature, time of day, or the size of the room in which
the research is being conducted.

Types of Variable:

1. Qualitative variable – non-numeric in nature


2. Quantitative variable – can assume values and can be classified into two groups: discrete variable and
continuous variable
 Discrete variable – variables having only integer values. For example, number of trials need by a student
to learn a memorization task.
 Continuous variable – a variable that is not restricted to a particular values (e.g. reaction time, IQ. Equal
size intervals on different parts of the scale are assumed. Synonym for interval variable.

Examples of Types of Data


Quantitative
Continuous Discrete
Blood pressure No. of children
Height, wieght, age No. of attacks of asthma per week
Categorical
Ordinal (ordered categories) Nominal (unordered categories)
Grade of breast cancer Sex (male or female)
Better, same, worse Alive or dead
Disagree, neutral, agree Blood group (O, A, B, AB)
Characteristic questions

Scale Characterisic question Examples


Nominal Is A different from B? Marital status
Eye color
Gender/sex
Religious affiliation
race
Ordinal Is A bigger than B? Stage of disease
Severity of pain
Level of satisfaction
Interval By how many units do A and B differ? Temperature
SAT score
Ration How many times bigger than B is A? Distance
Length
Time until death
Weight

Other type of variable:

Binary variable – observations (i.e. dependent variable) that occur in one of two possible states, often labelled zero and
one. E.g. “improved/not improved” and “complete tas/failed to complete task”. Synonym for dichotomous variable.

Categorical variable – usually an independent or predictor variable that contains indicating membership in one of
several possible categories. E.g. gener and marital status. The categories are often assigned numerical values used as
labels, e.g. 0 = male; 1 = female. Synonym for nominal variable.

Confounding variable - a variable that obscures the effects of another variable.

Control variable – an extraneous variable that an investigator does not wish to examine in a study. Also called a
covariate.

Criterion variable – the presumed effect in a nonexperimental study. AKA outcome variable

Predictor variable – the presumed cause in a nonexperimental study. Often used in correlational studies.

Dependent variable – the presumed effect in an experimental study. The values of the dependent variable depend upon
another variable, the independent variable. Should not be used when writing about nonexperimental designs.

Dummy variable – created by recording categorical variables that have more than two categories into a series of binary
variables. Used in regressions analysis to avoid the unresonable assumption that the original numerical codes for the
categoriescorrespond to an interval scale.

Endogenous variable – inherent part of system being studied and that is determined from system says nothing about its
exogenous variables.

Independent variable – the presumed cause in an experimental study. All other variables that may impact the
dependent variable are controlled. The values of the independent variable are under experimenter control. Stricitly
speaking, “independent variable” should not be used when writing about nonexperimental designs. Synonym for
treatment variable and manipulated variable.

Intervening variable – variable that explains a relation or provides a casual link between other variables. AKA mediating
variable or intermediary variable.

Latent variable – underlying variable that cannot be observed. It is hypothesized to exist in order to explain other
variables, such as specific behavior, that can be observed.

Manifest variable – observed variable assumed to indicate the presence of a latent variable. AKA indicator variable.
Cannot be observed directly, but can look at the indicators such as: vocabulary size, success in one’s occupation, IQ test
score, ability to play complicated games.

Moderating variable – influences or moderates the relation between two other variables and thus produces an
interaction effect.

Ordinal variable – used to rank a sample of individuals with respect to some characteristics, but difference and different
points of the scale are not necessarily equivalent.

Polychotomous variable – can have two more possible values. This includes that all binary variables. The usual reference
is to categorical variables with more than two categories.

Lesson 3: SOURCES OF DATA

Data – measurements or observations

Data set – a collection of measurements or observations

Datum – a single measurement or observatuon and is commonly called a score or raw score

Sources of data:
1. Primary data
2. Secondary data

Primary data

 Firsthand data or raw data


 Data originated for the first time by the researcher through direct efforts and experience, specifically for the
purpose of addressing the research problem
 Quite expensive
 Data collection is under control and supervision of the investigator
 Can be collected through surveys, observations, physical testing, mailed questionnairesm interviews, case
studies, etc.

Advatages:

 Data are original and relevant to the topic


 Can be collected from a number of ways like through surveys, observations, physical testing, mailed
questionnairesm interviews, case studies, emails, and posts etc.
 Can include a lage populations and wide geographical coverage.
 Primary data is current and can give a realistic view to the researcher
 Reliability and accuracy of the data is very high

Disadvantages:

 In interviews, coverage is limited and for wider coverage a great number of the researchers are required.
 A lot of time and efforts are required.
 Some respondents do not give timely responses, it may be fake ro cover up realities.
 There are no control over the data collection.
 Incomplete questionnaire always give a negative impact on research.
 Trained persons are required for data collection.

Secondary data

 Second-hand information which is already collected and recorded by any person other than the user for a
purpose, not relating to the current research problem.
 Readily available of data collected from various sources like census, government publications, internal records of
the organization, reports, books, journal articles, and websites.

Advantages:

 It is cheaper and faster to access


 Provides a way to access the work of the best scholars
 Gives a frame of mind to the researcher
 Save time, efforts, and moey

Disadvantages:

 Data collected by a third party may not be a reliable party, so the reliability and accuracy of data go down
 Data collected in one location may not be suitable for the other one due to variable environmental factor
 With passage of time, the data becomes obsolete and old
 It can distort results of the research
 Raise issues of authenticity and copyright

Key Difference Between Primary and Secondary Data

BASIS FOR COMPARISON PRIMARY DATA SECONDARY DATA


Meaning Primary data refers to the firsthand data Secondary data is the already existing
gathered by the researhcer himself. data, collected by the investigator
earlier.
Data Real-time data Past data
Process Very involved Rapid and easy
Source Surveys, observations, experiments, Government pblications, websites,
questionnaire, presonal interview, etc. books, journal articles, internal
records, etc.
Cost effectiveness Expensive; requires a large amount of time, Economical; relatively inexpensive and
cost, and manpower quickly available
Collection time Long Short
Specific Always specific to the researcher’s needs May or may not be specific to the
researcher’s needs
Available in Crude or raw form Refined form
Accuracy and reliability More reliable and accurate Relatively less
Purpose Collected for addressing the problem Collected for purposes other than the
problem at hand

Lesson 4: OVERVIEW OF METHODS

Two types of Statistical Method:


1. Descriptive statistics
2. Inferential statistics

S. No. Descriptive statistics Inferential statistics


1 Concerned with the describing the target population Make inferences from the sample and generalize
them to the population
2 Organize, analyze, and present the data in a meanigful Compares, test, and predicts future outcomes
manner
3 Finals results are shown in form of charts, tables, and Finals result is the probability scores
graphs
4 Describes the data which is already knwon Tries to make conclusions about the population that
is beyond the data available
5 Tools – measure of central tendency (mean, median, Tools – hypothesis tests, analysis of variance etc.
mode), spread of data (range, standard deviation etc.)
MODULE 2

Lesson 1: INTRODUCTION TO DESCRIPTIVE STATISTICS

Descritive Statistics
 Statistical procedures used to summarize, organize, and simplify data
 Do not allows us to make conclusions beyond the data we have analyzed or each conclusions regarding any
hypotheses we might have made
 Enables to represent the data in a more meaningful way, which allows simpler interpretation of the data

4 Major Types of Descriptive Statistics:

1. Measures of Frequency
 Count, frequency, percent
 Shows how often something occurs
 Use this when you want to show how often a response is given
2. Measures of Central Tendency
 Mean, median, mode
 Locates the distribution by various points
 Use this when you want to show how an average or most commonly indicated response
3. Measures of Dispersion or Variation
 Range, variance, standard deviation
 Identifies the spread of scores by stating intervals
 Variance or Standard Deviation = difference between observed socre and mean
 Use this when you want to show how spread out the data are. It is helpful when your data are so spread
out that it affects the mean
4. Measures of Position
 Percentile ranks, quartile ranks
 Describes how scores fall in relation to one another. Relies on standardized socres
 Use this when you need to compare scores to a normalized score (e.g. a national norm)

Lesson 2: MEASURES OF CENTRAL TENDENCY

Central Tendency
 Statistical measure to determine a single score that defined the center of the distribution.
 The goal is to find the single score that is most typical or most representative of the entire group
 Concept of an average or representative score
 Usually attempts to indetify the “average” or “typical” individual
 Average value can be used to provide a simple description of an entire population
 Useful for making comparisons between groups of individuals or between sets of data
 There is no single, standard procedure for determining central tendency

Mean

 Arithmetic average
 Computed by adding all the score in the distribution and dividing by the number of scores
 Mean of the population is identified by the Greek letter mu, u
 Mean for a sample is indetified by M (standard notation in manuscripts and published research report)
 Amount that each individual gets when the total is distributed equally
 Balance point of a distribution

Advantages:

 It is simple to understand and easy to calculate


 It is affected by the value of every item in the series
 It is rigidly defined
 It is capable of further algebric treatment
 It is calculated value and not based on the position in the series

Disadvantages:

 It is affected by extreme items i.e. very small and very large items
 It can hardly be located by inspection
 In some cases, arithmetic mean does not represent the actual item. For example, average patients admitted in a
hospital is 10.7 per day.
 Arithemetic mean is not suitable in extermely asymmetrical distributions

Median

 Goal: to locate the midpoint of the


distribution
 Identified by the word, median
 Definitions and computations for a
sample and for a population are
identical
Mode

 “customary fashion” or a ‘popular style”


 In a frequency distribution, the score or category has the f=greatest frequency
 No symbols
 Used to determine the typical or most frequent value or any scale of measurement, including the nominal scale

Example: Color of evacuated tubes used in the clinical laboratory for the month of January, sample of n = 100 tests.

Color of Evacuated Tubes f


5
Yellow
Blue 16
Red 42
Lavender 18
Green 7
Grey 12
Selecting a Measure of Central Tendency

 When to use Median


- Extreme socres or skewed distributions
- Undetermined values
- Open-ended distributions
- Ordinal scales
 When to use Mode
- Nominal scales
- Discrete variables
- Describing shape

Lesson 3: MEASURES OF DISPERSION

Variability
 Provides a quantitative measure of the differences between scores in a distribution and describes the degree to
which the scores are spread out or clustered together

Purpose of Measuring Variability


 To obtain an objective measure on how the scores are spread out in a distribution

A good measure of variability serves two purposes:


 Variability describes distribution. Specifically, it tells whether the scores are clustered close together or are
spread out over a large distance. Usually, viariability is defined in terms of distance.
 Variability measures how well an individual socre represents the entire distribution.

Range
 Distance covered by the scores in a distribution, from the smallest score to the largest score
 Probably the most obvious way to describe how spread out the scores are
 Disadvantage: using as a measure of variability, it is determined by the two extreme values and ignores the
other scores in the distribution

Standard Deviation
 Most commonly used
 Most important measure of variability
 Uses the mean of the distribution as reference point and measures variability by considering the distance
between each score and the mean
 Provides a measure of the standard =, or average, or distance from the mean, and describe whether the scores
are clustered closely around the mean or are widely scattered

Sample Variability and Degrees of Freedom

For a sample of n scores, the degrees of freedom, or df, for the sample variances are defined as df = n – 1. The degrees
of freedom determine the number of scores in the sample that are independent and free to vary.
To calculate sample variance (mean squared deviation), find the sum of the squared deviations (SS) and divide by the
number of scores that are free to vary. This number is n – 1 = df. Thus, the formula for sample variance is:

Coeeficient of Variation
In some cases, the vaiance of the variable changes with its mean.
For example, suppose we are measuring the weights of children of various ages.
 5 yr. old children (relatively light, on average)
 15 yr. old children (much heavier, on average)
Clearly, there is much more variability in the weights of 15 yr. olds, but a valid question to ask is “Do 15 yr. olds
children’s weights have more variability relative to their average?”

To calculate variance or standard deviation:

Interquartile Range (IQR)


 Measure that indicates the extent to which the central 50% of values within the dataset are dispersed. It is
based upon, and related to the median.
 It can be divided into quarters by identifying the upper and lower quartiles
 Lower quartile is found one quarter of the way along a dataset when the values have been arranged in order of
magnitude; the upper quartile is found three uarters along the dataset
 The upper quartile lies halfway between the median and the highest value in the dataset whilst the lower
quartile lies halfway between the median and the lowest value in the dataset. The interquartile range is found
by subtracting the lower quartile from the upper quartile.
 Formula: IQR = Q3 – Q1
 By definition, half of the values fall within an interval whose width equal the IQR. If the data are more spread
out, then the IQR tends to increase, and vice versa.
 IQR is more robust measure of spread than the variance or standard deviation. Any number values in the top or
bottom quarters of the data can be moved any distance from the median without affecting the IQR at all. More
practically, a few outliers have little or no effect on the IQR.
 IQR approximately equals ¾ times the standard deviation. This means that for Gaussian distributions, you can
approximate the SD from the IQR by calculating ¾ of the IQR.

Lesson 4: NORMAL DISTRIBUTION

The Normal Distribution aka Gaussian Distribution


 The most important distribution
 Describes well the distribution of random variables that arise in practice, such as the heights or weights of
people, the total annual sales of a firm, exam scores etc. Also, it is improtant for the central limit theorem,
the apporximation of other distributions such as the binomial etc.
 We say that a random variable X follows the normal distribution if the probability density function of X is
given by:
Frequency Distributions
 An organized tabulations of the number of individuals located in each category on the scale of measurement
 Can be structured either as a table or a graph, but in either case, the distribution presents the same two
elements:
1. The set categories that make up the original measurement scale
2. A record of the frequency, or number of individuals in each category
 Presents a picture of how the individual scores are distributed on the measurement scale

MODULE 3
Lesson 1: INTRODUCTION TO INFERENTIAL STATISTICS

Inferential Statistics
 Techniques that allow us to make inferences about a population based on data that we gaher from a sample
 Study results will vary from sample to sample strictly due to random chance
 Inferential statistics allow us to determine how likely it is to obtain a set of results from a single sample
 Known as testing for statistical significance

1. p value and Power


 p value
- the product of hypothesis testing via verious statistical tests and is claimed to be significant most
commonly when the value is 0.05 or less
- the value 0.05 is arbitrary; it is simply a convention amongst statistician that this value is deemed to cut
off level of significance
- demonstrates the probability of making type I error i.e. an error created by rejecting the null hypothesis
- say nothing about what size of an effect is or what the effect size is likely to be on the total population
 Power
- Can be anything from 0.80 (80%) to 0.99 (99%) depending on requirements
- Help to avoid too many subjects being recruited for a study
- A drawback is that a-priori power calculations do no take variation of data into account

2. Type I and Type II Errors


 Type I error
- AKA false positive
- The error of rejecting null hypothesis when it is actually true.
- The error of accepting an alternative hypothesis (the real hypothesis of interest) when the results can be
attributed to chance
- Occurs when we are observing a difference when in truth there is noe. So the probability of making a
type I error in a test with rejection region R is P (R | H is true)
 Type II error
- AKA false negative
- The eroor of not rejecting a null hypothesis when the alternative hypothesis is the true state of nature
- The error of failing to accept an alternative hypothesis when you don’t have adequate power
- Occurs when we are failing to observe a difference when in truth there is one. So the probability of
making a type II error in a test with rejection region R is 1 – P (R | H is true)
- The power of the test can be P (R | H is true)

3. Confidence interval and standard error


 Confidence interval
- A measure of the researchers’ uncertainty in the sample statistic as an estimate of the population
parameter, if less than the whole population is studied
- Set at 95% by convention
- A 95% confidence interval is the estimated range of values within which it is 95% possible or likely that
the precise or true population effect lies
- A pivotal tool in evidence-based practice, because they allow study results to be extrapolated into the
relevant population
- In the calculation of this, three elements are considered:
 The sandard error - whilst standard error shrink with increasing sample size, the researcher
should be seeking to reach an optimal sample size, rather than the maximal size. Testing more
subjects than required in a clinical trial may not be ethical and would be a waste of money and
resources.
 The mean and the variability – i.e. standard deviation, of the effect being studied the less
variability in the sample, the more precise the estimate in the population an therefore a
narrower range.
 The degree of confidence required – if a 99% confidence interval is desired then the range will
have to be wider.
Lesson 2: CONCEPT OF PROBABILITY
Concept of Probability
 Chance behavior is unpredictable in the short run but has a regular and predictable pattern in the long run.
 The probability of any outcome of a random phenomenon is the proportion of time the outcome would occur in
a very long series of repetitions
Terminologies:
 Probability – probability of an event the relative frequency of this set of outcome over an indefinitely large
number of trials
 Event – any set of outcomes of interest
 Sample space – set of all possible outcomes of a random phenomenon
 P (A) is the probability of event A
PROBABILITY RULES

KEY POINTS:
 Probability is a number that can be assigned to outcomes and events. It always is greater than or equal to
zero, and less than or equal to one
 The sum of the probabilities of all outcomes must equal to 1
 If two events have to outcomes in common, the probability that one or the other occurs is the sum of their
individual probabilities
 The probability that an event does not occur is 1 minus the probability that the event does occur
 Two events A and B are independent if knowing that one occurs does not change the probability that the
other occurs

Lesson 3: PROBABILITY DISTRIBUTIONS

 Discrete Probability Distribution


AKA probability-mass function (pmf) is a mathematical relationship or rule that assigns to any possible value r of a
discrete random variable X the probability P (X = r). this assignment is made for all values r that have positive probability.
Binomial Distribution
 Two possible outcomes: Success (S) and Failure (F)
 Repeat the situation n times
 The probability of success, p is constant on each trial
 The probability of failure at each trial is 1 – p = q
 The trials are independent

The Poisson Distribution


 Seocnd most frequently used discrete distribution after the binomial distribution
 Usually associated with rare events

 Continuous Probability Distribution

Normal distribution
 Nost widely used continuous distribution
 Aka Gaussian distribution (after Karl Friedrich Gauss)

Probability-density function (pdf) of the random variable X is a function such that the area under the density-function
curve between any two points a and b is equal to the probability that the random variable X falls between a and B. Thus,
the total area under the density-function curve over the entire range of possile values for the random variable is 1.
MODULE 4
Lesson 1: INTRODUCTION TO HYPOTHESIS TESTING

Statistical inference
 the act of generalizing from the data (“sample”) to a larger phenomenon (“population”) with calculated degree
of certainty.
 act of generalizing and deriving statistical judgments is the
process of inference

[Note: There is a distinction between causal inference and statistical


inference. Here we consider only statistical inference.]
THE NULL HYPOTHESIS AND THE
Two common forms of statistical inference are:
ALTERNATIVE HYPOTHESIS
1. Estimation
2. Null hypothesis tests of significance (NHTS) ARE MUTUALLY EXCLUSIVE
AND EXHAUSTIVE. THEY CANNOT
Hypothesis testing
 One of the most commonly used inferential procedures BE BOTH BE TRUE. THE DATA WILL
 Statistical method that uses sample data to evaluate a DETERMINE WHICH ONE SHOULD
hypothesis about a population
BE REJECTED.
In very simple terms, the logic underlying the hypothesis-testing
procedures is as follows: Gravetter and Wallnau, 2019
1. First, state a hypothesis about a population. Usually the
hypothesis concerns the value of a population parameter.
2. Before selecting a sample, use the hypothesis to predict the characteristics that the sample should have.
3. Next, obtain a random sample from the population.
4. Finally, compare the obtained sample data with the prediction that was made from the hypothesis. If the sample
is consistent with the predictions, we conclude that the hypothesis is reasonable. But if there is a big
discrepancy between the data and the prediction, decide that the hypothesis is wrong.

The Four Steps of a Hypothesis Testing


1. State the hypothesis.
 State hypothesis about the unknown population and stated in terms of population parameters
 Null hypothesis: the first and most important of the two hypotheses
o States that the treatment has no effect
o Generally, it states that there is no change, no effect, no difference, no relationship – nothing
happened
o H0 (H stands for hypothesis, and the zero subscript indicates that this is the zero-effect
hypothesis
o In the context of an experiment, it predicts that the independent variable (treatment) has no
effect on the dependent variable (scores) for the population
 Alternative hypothesis or scientific hypothesis: the second hypothesis and simply the opposite of the null
hypothesis
o States that the treatment has no effect on the dependent variable
o There is a change, a difference, or a relationship for the general population
o In the context of an experiment, it predicts that the independent variable (treatment) does have
an effect on the dependent variable
o Simply states that there will be some type of change. It does not specify whether the effect
will be increased or decreased tips. In some circumstances, it is appropriate for the alternative
hypothesis to specify the direction of the effect – resulting to directional hypothesis
o
2. Set the criteria for a decision
 It will use the data from the sample to evaluate the credibility of the null hypothesis. The date will
provide support for the null hypothesis or tend to refute the null hypothesis.
 We use the null hypothesis to predict the kind of sample mean that ought to be obtained. Specifically,
we determine exactly which sample means are consistent with the null hypothesis and which sample are
at odds with the null hypothesis
 The distribution of sample means is then divided into sections:
1. Sample means that are likely to be obtained if H0 is true; that is, sample means that are close to
the null hypothesis
2. Sample means that are very unlikely to be obtained if Ho is true; that is, sample means that are
very different from the null hypothesis
 Alpha level: level of significance
o Alpha (α) value is a small probability that is used to identify the low-probability samples
o By convention, commonly used alpha levels are α = .05 (5%), α = .01 (1%), and α = .001 (0.1%)
o Critical region: extremely sample values that are very unlikely (as defined by the alpha level) to
be obtained if the null hypothesis is true. The boundaries for the critical region are determined
by the alpha level. If sample data fall in the critical region, the null hypothesis is rejected.
Reversing the point of view, it can also be defined as the region as sample values that provide
convincing evidence that the treatment really does have an effect.

 The boundaries for the Critical Region


o Use alpha-level probability and the unit normal table to determine the exact location for the boundaries
that define the critical region
o In most cases, the distribution of sample means is normal, the unit normal table provides the precise z-
score location for the critical region boundaries

3. Collect data and compute sample statistics


o Raw data from the sample are summarized with the appropriate statistics: For example, the researcher
would compute the sample mean. Not it is possible for the researcher to compare the sample mean (the
data) with the null hypothesis. This the hear of the hypothesis test: comparing the data with the
hypothesis
o Comparison is accomplished by computing a z-score that describes exactly where the sample mean is
located relative to the hypothesized population mean from H0.
o z-score formula for a sample mean:

In the formula, the value of the sample mean (M) is obtained from the sample data, and the value of µ is
obtained from the null hypothesis. Thus, the z-score formula can be expressed in words as follows:
4. Make a decision
o Uses the z-score obtained to make a decision about the null hypothesis according to the criteria
established
o There are two possible outcomes:
1. The sample data are located in the critical region. By definition, a sample value in the critical region
is very unlikely to occur if the null hypothesis is true. Therefore, we conclude that the sample is not
consistent with the H0 and our is to reject the null hypothesis.
For example: Suppose the sample produced a mean tip of M = 16.7 percent. The null hypothesis
states that the population mean is µ = 15.8 percent and, with n = 36 and σ = 2.4, the standard error
for the sample mean is

Thus, a sample mean of M = 16.7 produces a z-score of

With an alpha level of α = .05, this z-score is far beyond the boundary of 1.96. Because the sample z-
score is in the critical region, we reject the null hypothesis and conclude that it has a treatment
e
f
f
e
c
t
.
Lesson 2: p-VALUE AND z SCORE

The p-value approach

p-value
 probability of obtaining a sample outcome, given that the value stated in the null hypothesis is true. The p value
for obtaining a sample outcome is compared to the level of significance.
 When the p value is less than 5% (p < .05), we reject the null hypothesis. We will refer to p < .05 as the criterion
for deciding to reject the null hypothesis, although note that when p = .05, the decision is also to reject the null
hypothesis. When the p value is greater than 5% (p > .05), we retain the null hypothesis.
 When the value is less than .05, we reach significance; the decision is to reject the null hypothesis.
 When the value is greater than .05, we fail to reach significance; the decision is to retain the null hypothesis.

Common misinterpretations of single P values

1. The P value is the probability that the test hypothesis is true; for example, if a test of the null hypothesis gave
P = 0.01, the null hypothesis has only a 1 % chance of being true; if instead it gave P = 0.40, the null hypothesis
has a 40 % chance of being true.
No! The P value assumes the test hypothesis is true—it is not a hypothesis probability and may be far from any
reasonable probability for the test hypothesis. The P value simply indicates the degree to which the data
conform to the pattern predicted by the test hypothesis and all the other assumptions used in the test (the
underlying statistical model). Thus P = 0.01 would indicate that the data are not very close to what the statistical
model (including the test hypothesis) predicted they should be, while P = 0.40 would indicate that the data are
much closer to the model prediction, allowing for chance variation.

2. The P value for the null hypothesis is the probability that chance alone produced the observed association; for
example, if the P value for the null hypothesis is 0.08, there is an 8 % probability that chance alone produced
the association.
No! This is a common variation of the first fallacy, and it is just as false. To say that chance alone produced the
observed association is logically equivalent to asserting that every assumption used to compute the P value is
correct, including the null hypothesis. Thus, to claim that the null P value is the probability that chance alone
produced the observed association is completely backwards: The P value is a probability computed assuming
chance was operating alone. The absurdity of the common backwards interpretation might be appreciated by
pondering how the P value, which is a probability deduced from a set of assumptions (the statistical model), can
possibly refer to the probability of those assumptions.
Note: One often sees ‘‘alone’’ dropped from this description (becoming ‘‘the P value for the null
hypothesis is the probability that chance produced the observed association’’), so that the statement is more
ambiguous, but just as wrong.

3. A significant test result (P ≤ 0.05) means that the test hypothesis is false or should be rejected.
No! A small P value simply flags the data as being unusual if all the assumptions used to compute it (including
the test hypothesis) were correct; it may be small because there was a large random error or because some
assumption other than the test hypothesis was violated (for example, the assumption that this P value was not
selected for presentation because it was below 0.05). P ≤ 0.05 only means that a discrepancy from the
hypothesis prediction (e.g., no difference between treatment groups) would be as large or larger than that
observed no more than 5 % of the time if only chance were creating the discrepancy (as opposed to a violation
of the test hypothesis or a mistaken assumption).

4. A nonsignificant test result (P > 0.05) means that the test hypothesis is true or should be accepted.
No! A large P value only suggests that the data are not unusual if all the assumptions used to compute the P
value (including the test hypothesis) were correct. The same data would also not be unusual under many other
hypotheses. Furthermore, even if the test hypothesis is wrong, the P value may be large because it was inflated
by a large random error or because of some other erroneous assumption (for example, the assumption that this
P value was not selected for presentation because it was above 0.05). P > 0.05 only means that a discrepancy
from the hypothesis prediction (e.g., no difference between treatment groups) would be as large or larger than
that observed more than 5 % of the time if only chance were creating the discrepancy.

5. A large P value is evidence in favor of the test hypothesis.


No! In fact, any P value less than 1 implies that the test hypothesis is not the hypothesis most compatible with
the data, because any other hypothesis with a larger P value would be even more compatible with the data. A P
value cannot be said to favor the test hypothesis except in relation to those hypotheses with smaller P values.
Furthermore, a large P value often indicates only that the data are incapable of discriminating among many
competing hypotheses (as would be seen immediately by examining the range of the confidence interval). For
example, many authors will misinterpret P = 0.70 from a test of the null hypothesis as evidence for no effect,
when in fact it indicates that, even though the null hypothesis is compatible with the data under the
assumptions used to compute the P value, it is not the hypothesis most compatible with the data—that honor
would belong to a hypothesis with P = 1. But even if P = 1, there will be many other hypotheses that are highly
consistent with the data, so that a definitive conclusion of ‘‘no association’’ cannot be deduced from a P value,
no matter how large.

6. A null-hypothesis P value greater than 0.05 means that no effect was observed, or that absence of an effect
was shown or demonstrated.
No! Observing P > 0.05 for the null hypothesis only means that the null is one among the many hypotheses that
have P > 0.05. Thus, unless the point estimate (observed association) equals the null value exactly, it is a mistake
to conclude from P > 0.05 that a study found ‘‘no association’’ or ‘‘no evidence’’ of an effect. If the null P value is
less than 1 some association must be present in the data, and one must look at the point estimate to determine
the effect size most compatible with the data under the assumed model.

7. Statistical significance indicates a scientifically or substantively important relation has been detected.
No! Especially when a study is large, very minor effects or small assumption violations can lead to statistically
significant tests of the null hypothesis. Again, a small null P value simply flags the data as being unusual if all the
assumptions used to compute it (including the null hypothesis) were correct; but the way the data are unusual
might be of no clinical interest. One must look at the confidence interval to determine which effect sizes of
scientific or other substantive (e.g., clinical) importance are relatively compatible with the data, given the model.

8. Lack of statistical significance indicates that the effect size is small.


No! Especially when a study is small, even large effects may be ‘‘drowned in noise’’ and thus fail to be detected
as statistically significant by a statistical test. A large null P value simply flags the data as not being unusual if all
the assumptions used to compute it (including the test hypothesis) were correct; but the same data will also not
be unusual under many other models and hypotheses besides the null. Again, one must look at the confidence
interval to determine whether it includes effect sizes of importance.

9. The P value is the chance of our data occurring if the test hypothesis is true; for example, P = 0.05 means that
the observed association would occur only 5 % of the time under the test hypothesis.
No! The P value refers not only to what we observed, but also observations more extreme than what we
observed (where ‘‘extremity’’ is measured in a particular way). And again, the P value refers to a data frequency
when all the assumptions used to compute it are correct. In addition to the test

10. If you reject the test hypothesis because P ≤ 0.05, the chance you are in error (the chance your ‘‘significant
finding’’ is a false positive) is 5 %.
No! To see why this description is false, suppose the test hypothesis is in fact true. Then, if you reject it, the
chance you are in error is 100 %, not 5 %. The 5 % refers only to how often you would reject it, and therefore be
in error, over very many uses of the test across different studies when the test hypothesis and all other
assumptions used for the test are true. It does not refer to your single use of the test, which may have been
thrown off by assumption violations as well as random errors. This is yet another version of misinterpretation
#1.
11. P = 0.05 and P ≤ 0.05 mean the same thing.
No! This is like saying reported height = 2 m and reported height B ≤ 2 m are the same thing: ‘‘height = 2 m’’
would include few people and those people would be considered tall, whereas ‘‘height B ≤ 2 m’’ would include
most people including small children. Similarly, P = 0.05 would be considered a borderline result in terms of
statistical significance, whereas P ≤ 0.05 lumps borderline results together with results very incompatible with
the model (e.g., P = 0.0001) thus rendering its meaning vague, for no good purpose.

12. P values are properly reported as inequalities (e.g., report ‘‘P < 0.02’’ when P = 0.015 or report ‘‘P > 0.05’’
when P = 0.06 or P = 0.70).
No! This is bad practice because it makes it difficult or impossible for the reader to accurately interpret the
statistical result. Only when the P value is very small (e.g., under 0.001) does an inequality become justifiable:
There is little practical difference among very small P values when the assumptions used to compute P values
are not known with enough certainty to justify such precision, and most methods for computing P values are not
numerically accurate below a certain point.

13. Statistical significance is a property of the phenomenon being studied, and thus statistical tests detect
significance.
No! This misinterpretation is promoted when researchers state that they have or have not found ‘‘evidence of’’
a statistically significant effect. The effect being tested either exists or does not exist. ‘‘Statistical significance’’ is
a dichotomous description of a P value (that it is below the chosen cut-off) and thus is a property of a result of a
statistical test; it is not a property of the effect or population being studied.

14. One should always use two-sided P values.


No! Two-sided P values are designed to test hypotheses that the targeted effect measure equals a specific value
(e.g., zero), and is neither above nor below this value. When, however, the test hypothesis of scientific or
practical interest is a one-sided (dividing) hypothesis, a one-sided P value is appropriate. For example, consider
the practical question of whether a new drug is at least as good as the standard drug for increasing survival time.
This question is one-sided, so testing this hypothesis calls for a one-sided P value. Nonetheless, because two-
sided P values are the usual default, it will be important to note when and why a one-sided P value is being used
instead.

Common misinterpretations of P value comparisons and predictions

Some of the most severe distortions of the scientific literature produced by statistical testing involve erroneous
comparison and synthesis of results from different studies or study subgroups. Among the worst are:

15. When the same hypothesis is tested in different studies and none or a minority of the tests are statistically
significant (all P > 0.05), the overall evidence supports the hypothesis.
No! This belief is often used to claim that a literature supports no effect when the opposite is case. It reflects a
tendency of researchers to ‘‘overestimate the power of most research’’. In reality, every study could fail to reach
statistical significance and yet when combined show a statistically significant association and persuasive
evidence of an effect. For example, if there were five studies each with P = 0.10, none would be significant at
0.05 level; but when these P values are combined using the Fisher formula, the overall P value would be 0.01.
There are many real examples of persuasive evidence for important effects when few studies or even no study
reported ‘‘statistically significant’’ associations. Thus, lack of statistical significance of individual studies should
not be taken as implying that the totality of evidence supports no effect.

16. When the same hypothesis is tested in two different populations and the resulting P values are on opposite
sides of 0.05, the results are conflicting.
No! Statistical tests are sensitive to many differences between study populations that are irrelevant to whether
their results are in agreement, such as the sizes of compared groups in each population. Consequently, two
studies may provide very different P values for the same test hypothesis and yet be in perfect agreement (e.g.,
may show identical observed associations). For example, suppose we had two randomized trials A and B of a
treatment, identical except that trial A had a known standard error of 2 for the mean difference between
treatment groups whereas trial B had a known standard error of 1 for the difference. If both trials observed a
difference between treatment groups of exactly 3, the usual normal test would produce P = 0.13 in A but P =
0.003 in B. Despite their difference in P values, the test of the hypothesis of no difference in effect across studies
would have P = 1, reflecting the perfect agreement of the observed mean differences from the studies.
Differences between results must be evaluated by directly, for example by estimating and testing those
differences to produce a confidence interval and a P value comparing the results (often called analysis of
heterogeneity, interaction, or modification).

17. When the same hypothesis is tested in two different populations and the same P values are obtained, the
results are in agreement.
No! Again, tests are sensitive to many differences between populations that are irrelevant to whether their
results are in agreement. Two different studies may even exhibit identical P values for testing the same
hypothesis yet also exhibit clearly different observed associations. For example, suppose randomized
experiment A observed a mean difference between treatment groups of 3.00 with standard error 1.00, while B
observed a mean difference of 12.00 with standard error 4.00. Then the standard normal test would produce P =
0.003 in both; yet the test of the hypothesis of no difference in effect across studies gives P = 0.03, reflecting the
large difference (12.00 - 3.00 = 9.00) between the mean differences.

18. If one observes a small P value, there is a good chance that the next study will produce a P value at least as
small for the same hypothesis.
No! This is false even under the ideal condition that both studies are independent and all assumptions including
the test hypothesis are correct in both studies. In that case, if (say) one observes P = 0.03, the chance that the
new study will show P ≤ 0.03 is only 3 %; thus the chance the new study will show a P value as small or smaller
(the ‘‘replication probability’’) is exactly the observed P value! If on the other hand the small P value arose solely
because the true effect exactly equaled its observed estimate, there would be a 50 % chance that a repeat
experiment of identical design would have a larger P value. In general, the size of the new P value will be
extremely sensitive to the study size and the extent to which the test hypothesis or other
assumptions are violated in the new study; in particular, P may be very small or very large depending on
whether the study and the violations are large or small.

Common misinterpretations of confidence intervals

Most of the above misinterpretations translate into an analogous misinterpretation for confidence intervals. For
example, another misinterpretation of P > 0.05 is that it means the test hypothesis has only a 5 % chance of being false,
which in terms of a confidence interval becomes the common fallacy:

19. The specific 95 % confidence interval presented by a study has a 95 % chance of containing the true effect size.
No! A reported confidence interval is a range between two numbers. The frequency with which an observed
interval (e.g., 0.72–2.88) contains the true effect is either 100 % if the true effect is within the interval or 0 % if
not; the 95 % refers only to how often 95 % confidence intervals computed from very many studies would
contain the true size if all the assumptions used to compute the intervals were correct. It is possible to compute
an interval that can be interpreted as having 95 % probability of containing the true value; nonetheless, such
computations require not only the assumptions used to compute the confidence interval, but also further
assumptions about the size of effects in the model. These further assumptions are summarized in what is called
a prior distribution, and the resulting intervals are usually called Bayesian posterior (or credible) intervals to
distinguish them from confidence intervals.
Symmetrically, the misinterpretation of a small P value as disproving the test hypothesis could be translated into:

20. An effect size outside the 95 % confidence interval has been refuted (or excluded) by the data.
No! As with the P value, the confidence interval is computed from many assumptions, the violation of which
may have led to the results. Thus, it is the combination of the data with the assumptions, along with the
arbitrary 95 % criterion, that are needed to declare an effect size outside the interval is in some way
incompatible with the observations. Even then, judgements as extreme as saying the effect size has been
refuted or excluded will require even stronger conditions.

As with P values, naive comparison of confidence intervals can be highly misleading:

21. If two confidence intervals overlap, the difference between two estimates or studies is not significant.
No! The 95 % confidence intervals from two subgroups or studies may overlap substantially and yet the test for
difference between them may still produce P < 0.05. Suppose for example, two 95 % confidence intervals for
means from normal populations with known variances are (1.04, 4.96) and (4.16, 19.84); these intervals overlap,
yet the test of the hypothesis of no difference in effect across studies gives P = 0.03. As with P values,
comparison between groups requires statistics that directly test and estimate the differences across groups. It
can, however, be noted that if the two 95 % confidence intervals fail to overlap, then when using the same
assumptions used to compute the confidence intervals, we will find P < 0.05 for the difference; and if one of the
95 % intervals contains the point estimate from the other group or study, we will find P < 0.05 for the difference.

Finally, as with P values, the replication properties of confidence intervals are usually misunderstood:

22. An observed 95 % confidence interval predicts that 95 % of the estimates from future studies will fall inside
the observed interval.
No! This statement is wrong in several ways. Most importantly, under the model, 95 % is the frequency with
which other unobserved intervals will contain the true effect, not how frequently the one interval being
presented will contain future estimates. In fact, even under ideal conditions the chance that a future estimate
will fall within the current interval will usually be much less than 95 %. For example, if two independent studies
of the same quantity provide unbiased normal point estimates with the same standard errors, the chance that
the 95 % confidence interval for the first study contains the point estimate from the second is 83 % (which is the
chance that the difference between the two estimates is less than 1.96 standard errors). Again, an observed
interval either does or does not contain the true effect; the 95 % refers only to how often 95 % confidence
intervals computed from very many studies would contain the true effect if all the assumptions used to compute
the intervals were correct.

23. If one 95 % confidence interval includes the null value and another excludes that value, the interval excluding
the null is the more precise one.
No! When the model is correct, precision of statistical estimation is measured directly by confidence interval
width (measured on the appropriate scale). It is not a matter of inclusion or exclusion of the null or any other
value. Consider two 95 % confidence intervals for a difference in means, one with limits of 5 and 40, the other
with limits of -5 and 10. The first interval excludes the null value of 0 but is 30 units wide. The second includes
the null value but is half as wide and therefore much more precise.

Common misinterpretations of power

The power of a test to detect a correct alternative hypothesis is the pre-study probability that the test will
reject the test hypothesis (e.g., the probability that P will not exceed a pre-specified cut-off such as 0.05). (The
corresponding pre-study probability of failing to reject the test hypothesis when the alternative is correct is one minus
the power, also known as the Type-II or beta error rate). As with P values and confidence intervals, this probability is
defined over repetitions of the same study design and so is a frequency probability. One source of reasonable
alternative hypotheses are the effect sizes that were used to compute power in the study proposal. Pre-study power
calculations do not, however, measure the compatibility of these alternatives with the data actually observed, while
power calculated from the observed data is a direct (if obscure) transformation of the null P value and so provides no
test of the alternatives. Thus, presentation of power does not obviate the need to provide interval estimates and
direct tests of the alternatives.

For these reasons, many authors have condemned use of power to interpret estimates and statistical tests, arguing that
(in contrast to confidence intervals) it distracts attention from direct comparisons of hypotheses and introduces new
misinterpretations, such as:
24. If you accept the null hypothesis because the null P value exceeds 0.05 and the power of your test is 90 %, the
chance you are in error (the chance that your finding is a false negative) is 10%.
No! If the null hypothesis is false and you accept it, the chance you are in error is 100 %, not 10%. Conversely, if
the null hypothesis is true and you accept it, the chance you are in error is 0 %. The 10 % refers only to how
often you would be in error over very many uses of the test across different studies when the particular
alternative used to compute power is correct and all other assumptions used for the test are correct in all the
studies. It does not refer to your single use of the test or your error rate under any alternative effect size other
than the one used to compute power.

25. If the null P value exceeds 0.05 and the power of this test is 90% at an alternative, the results support the null
over the alternative.
This claim seems intuitive to many, but counterexamples are easy to construct in which the null P value is
between 0.05 and 0.10, and yet there are alternatives whose own P value exceeds 0.10 and for which the power
is 0.90. Parallel results ensue for other accepted measures of compatibility, evidence, and support, indicating
that the data show lower compatibility with and more evidence against the null than the alternative, despite the
fact that the null P value is ‘‘not significant’’ at the 0.05 alpha level and the power against the alternative is ‘‘very
high’’.

Z-score statistics

 First specific example of test statistics – indicates that the sample data are converted into a single, specific
statistic that is used to test the hypotheses
 Formal method for comparing the sample data and the population hypothesis
 z-score formula as a recipe
o In a hypothesis test with z-scores, we have a formula (recipe) for z-scores but one ingredient is missing –
specifically, population mean, µ, is unknown. Therefore, we will follow these steps:
1. Make a hypothesis about the value of µ. This is the null hypothesis.
2. Plug the hypothesized value in the formula along with the other values (ingredients).
3. If the formula produces a z-score near zero (which is where z-scores are supposed to be), we
conclude that the hypothesis was correct. On the other hand, if the formula produces an
extreme value (a very unlikely result), we conclude that the hypothesis was wrong.
 Z-score formula as a ratio
( ) ( )

Thus, for example, a z-score of z = 3.00 means that the obtained difference between the sample and the
hypothesis is 3 times bigger than would be expected if the treatment had no effect.

In general, a large value for a test statistic like the z-score indicates a large discrepancy between the sample data
and the null hypothesis. Specifically, a large value indicates that the sample data are very unlikely to have
occurred by chance alone. Therefore, when we obtain a large value (in the critical region), we conclude that it
must have been caused by treatment effect.
Things to Remember When Interpreting P Values

1. P-values summarize statistical significance and do not address clinical significance. There are instances
where results are both clinically and statistically significant - and others where they are one or the
other but not both. This is because p-values depend upon both the magnitude of association and the
precision of the estimate (the sample size). When the sample size is large, results can reach statistical
significance (i.e., small p-value) even when the effect is small and clinically unimportant. Conversely,
with small sample sizes, results can fail to reach statistical significance, yet the effect is large and
potentially clinical important. It is extremely important to assess both statistical and clinical significance
of results.
2. Statistical tests allow us to draw conclusions of significance or not based on a comparison of the p-
value to our selected level of significance. Remember that this conclusion is based on the selected level
of significance ( α ) and could change with a different level of significance. While α =0.05 is standard, a
p-value of 0.06 should be examined for clinical importance.
3. When conducting any statistical analysis, there is always a possibility of an incorrect conclusion. With
many statistical analyses, this possibility is increased. Investigators should only conduct the statistical
analyses (e.g., tests) of interest and not all possible tests.
4. Many investigators inappropriately believe that the p-value represents the probability that the null
hypothesis is true. P-values are computed based on the assumption that the null hypothesis is true. The
p-value is the probability that the data could deviate from the null hypothesis as much as they did or
more. Consequently, the p-value measures the compatibility of the data with the null hypothesis, not
the probability that the null hypothesis is correct.
5. Statistical significance does not take into account the possibility of bias or confounding - these issues
must always be investigated.
6. Evidence-based decision making is important in public health and in medicine, but decisions are rarely
made based on the finding of a single study. Replication is always important to build a body of evidence
to support findings.

Lesson 3: UNCERTAINTY AND ERRORS IN HYPOTHESIS TESTING

Type I Error
 Occurs when a researcher rejects the null hypothesis that is actually true. In a typical research situation, a Type I
error means the researcher concludes that a treatment does have an effect when in fact it has no effect.
 It is not a stupid mistake in the sense that a researcher is overlooking something that should be perfectly
obvious
 Will create a false reporting on research results – thus researchers may try to build theories or develop other
experiments based on the false result
 Occurs when a researcher unknowingly obtains an extreme, nonrepresentative sample – but hypothesis test is
structured to minimize the risk that this will occur
 Probability of a Type I error is equal to the alpha level
 NOTE: the alpha level for a hypothesis test is the probability that the test will lead to a Type I error. That is, the
alpha level determines the probability of obtaining sample data in the critical region even though the null
hypothesis is correct.

Type II Error
 Occurs when a research fails to reject the null hypothesis
 Failure to reject a false null hypothesis – means that a treatment effect really exists, but the hypothesis test fails
to detect it
 In a typical research situation, it means that the hypothesis test has failed to detect a real treatment effect
 Occurs when a sample mean is not in the critical region even though the treatment has an effect on the sample
– often this happens when the effect of the treatment is relatively small
 Consequences are usually not as serious as those of a Type I error
 In general terms, it means that the research data do not show the results that the researcher had hoped to
obtain

Alpha Level

 Determines the probability of a Type I error, thus, tend to be very small probability values
 By convention, the largest permissible value is α = .05
 When there is no treatment effect, an alpha level of .05 means that there is still a 5% risk, or 1-in-20 probability,
of rejecting the null hypothesis and committing a Type I error
 In general, researchers try to maintain a balance between the risk of a Type I error and the demands of the
hypothesis test
 Levels .05, .01, and .001 are considered reasonably good values (provide a low risk of error without placing
excessive demands on the research results)

Lesson 4: ONE-TAILED AND TWO-TAILED HYPOTHESIS TESTING

Directional (One-Tailed) Hypothesis Test- Statistical hypotheses (H0 and H1) specify either an increase or a
decrease in the population mean. That is, they make a statement about the direction of the effect.

The hypotheses for a Directional Test


 First step (the most critical step) is to state the statistical hypotheses. Example: The mean, µ = 20%.
H0: µ ≤ 20%. (With medication improvement of patient condition is not greater than 20% on average)
H1: µ > 20%. (With medication improvement of patient condition is greater than 20% on average.)

 Critical region:

Comparison of One-Tailed vs. Two-Tailed


 Major distinction between one-tailed two-tailed is in the criteria they use for rejecting H0
o One-tailed test allows you to reject the null hypothesis when the difference between the sample and
the population is relatively small, provided the difference is in the specified direction
o Two-tailed requires a relatively large difference independent of direction
Takeaways
Two-Tailed Tests

 More rigorous and, therefore, more convincing than one-tailed test.


 Remember: two-tailed test demands more evidence to reject H0 and thus provides a
stronger demonstration that a treatment effect has occurred.

One-Tailed Tests

 More sensitive, that is, a relatively a small treatment effect may be significant with a
one-tailed test but fail to reach significance with a two-tailed test
 More precise because they test hypotheses about a specific directional effect instead of
an indefinite hypothesis about a general effect

Generalization:

 Two-tailed should be used in research situations when there is no strong directional


expectation or when there two competing predictions. For example: a two-tailed test
would be appropriate for a study in which one theory predicts an increase in scores,
but another theory predicts a decrease.
 One-tailed tests should be used only in situations when the directional prediction is
made before the research is conducted and there is a strong justification for making
the directional prediction.
 In particular, if a two-tailed test fails to reach significance, you should never follow
up with a one-tailed test as a second attempt to salvage a significant result for the
same data

Lesson 5: NORMALITY TESTS

Normality Tests
 Use as basis to perform parametric tests
 If a variable fails a normality test, it is critical to look at the histogram and the normal probability plot to see if an
outlier or a small subset of outliers has caused the non-normality
 If there are no outliers, you might try a transformation (such as, the log or square root) to make the data normal
 If a transformation is not a viable alternative, nonparametric methods may be used
 common misconception that a histogram is always a valid graphical tool for assessing normality (histogram
needs large sample size to display an accurate picture of normality)
 Other graphical displays if histogram is not considered i.e. box plot, the density race, and the normality
probability plot
 Generally, have small statistical power (probability of detecting non-normal data) unless the sample sizes are at
least over 100
1. Shapiro-Wilk W Test
 developed by Shapiro and Wilk (1965) for sample sizes up to 20
 most powerful test in most situations
 the ratio of two estimates of the variance of a normal distribution based on a random sample of n observations
 The numerator is proportional to the square of the best linear estimator of the standard deviation. The
denominator is the sum of squares of the observations about the sample mean.

where Y(i) is the i-th order statistics and ai is the i-th expected value of normalized order statistics. If W is
significantly less than 1, the hypothesis of Normality will be rejected.
 W may be written as the square of the Pearson correlation coefficient between the ordered observations and a
set of weights which are used to calculate
 the numerator. Since these weights are asymptotically proportional to the corresponding expected normal order
statistics, W is roughly a measure of the straightness of the normal quantile-quantile plot. Hence, the closer W is
to one, the more normal the sample is.
 probability values for W are valid for sample sizes greater than 3

2. Anderson-Darling Test
 Developed by Anderson and Darling (1954)
 has been found to be as powerful as the Shapiro-Wilk test

3. Martinez-Iglewics Test
 Developed by Iglewicz and Martinez (1981)
 based on the median and a robust estimator of dispersion
 this test is very powerful for heavy-tailed symmetric distributions as well as a variety of other situations
 test is recommended for exploratory data analysis by Hoaglin (1983)
 Formula:

where Sbi2 is a biweight estimator of scale

4. Kolmogorov-Smirnov Test
 first derived by Kolmogorov (1933) and later modified and proposed as a test by Smirnov (1948)
 based on the maximum difference between the observed distribution and expected cumulative-normal
distribution
 Since it uses the sample mean and standard deviation to calculate the expected normal distribution, the
Lilliefors’ adjustment is used
 the smaller the maximum difference the more likely that the distribution is normal
 shown to be less powerful than the other tests in most situations.
 it is included in the test of normality because of its historical popularity
 Test statistic:

Where, F(X, µ, σ) is the theoretical cumulative distribution function of the normal distribution function and Fn(X)
is the empirical distribution function of the data. If it gives large values of D then it indicates the data are not
normal.

5. D’Agostino-Pearson Omnibus Test


 To assessing the symmetry or asymmetry generally skewness is measured and to evaluate the shape of the
distribution kurtosis is overlooked
 standing based on skewness and kurtosis test and these are also assessing through moments
 Test statistic is:

Where (√ ) and ( ) are the normal approximation equivalent to √ and are sample skewness and
kurtosis respectively. This statistic follows a chi-squared distribution with two degrees of freedom if the
population is from normal distribution K2 leads to rejection of the normality assumption.

6. Jarqua-Bera Test
 Originally proposed by Bowman and Shenton (1975)
 Combines squares of normalized skewness and kurtosis in a single statistic as follows

This normalization is based on normality since S = 0 and K = 3 for a normal distribution and their asymptotic
variances are 6/n and 24/n respectively. Hence under normality the JB test statistic follows also a chi-squared
distribution with two degrees of freedom. A significantly large value of JB leads to the rejection of the normality
assumption.

Checking normality for parametric tests in SPSS

One of the assumptions for most parametric tests to be reliable is that the data is approximately normally distributed.
The normal distribution peaks in the middle and is symmetrical about the mean. Data does not need to be perfectly
normally distributed for the tests to be reliable.

Graphical methods for assessing if data is normally distributed


Plotting a histogram of the variable of interest will give an indication of the shape of the distribution. A normal
approximation curve can also be added by editing the graph. Below are examples of histograms of approximately
normally distributed data and heavily skewed data with equal sample sizes.

It is very unlikely that a histogram of sample data will produce a perfectly smooth normal curve like the one displayed
over the histogram, especially if the sample size is small. As long as the data is approximately normally distributed, with
a peak in the middle and fairly symmetrical, the assumption of normality has been met.
The normal Q-Q plot is an alternative graphical method of assessing normality to the histogram and is easier to use
when there are small sample sizes. The scatter should lie as close to the line as possible with no obvious pattern coming
away from the line for the data to be considered normally distributed. Below are the same examples of normally
distributed and skewed data.

Tests for assessing if data is normally distributed


There are also specific methods for testing normality, but these should be used in conjunction with either a histogram or
a Q-Q plot. The Kolmogorov-Smirnov test and the Shapiro-Wilk’s W test determine whether the underlying distribution
is normal. Both tests are sensitive to outliers and are influenced by sample size:
 For smaller samples, non-normality is less likely to be detected but the Shapiro-Wilk test should be preferred as
it is generally more sensitive.
 For larger samples (i.e. more than one hundred), the normality tests are overly conservative, and the
assumption of normality might be rejected too easily (see robust exceptions below). Any assessment should also
include an evaluation of the normality of histograms or Q-Q plots as these are more appropriate for assessing
normality in larger samples.

Hypothesis test for a test of normality

Null hypothesis: The data is normally distributed

For both of these examples, the sample size is 35 so the Shapiro-Wilk test should be used. For the skewed data, p =
0.002 suggesting strong evidence of non-normality. For the approximately normally distributed data, p = 0.582, so the
null hypothesis is retained at the 0.05 level of significance. Therefore, normality can be assumed for this data set and,
provided any other test assumptions are satisfied, an appropriate parametric test can be used.

What if the data is not normally distributed?

If the checks suggest that the data is not normally distributed, there are three options:
1. Transform the dependent variable (repeating the normality checks on the transformed data): Common
transformations include taking the log or square root of the dependent variable.
2. Use a non-parametric test: Non-parametric tests are often called distribution free tests and can be used instead
of their parametric equivalent.
3. Use a parametric test under robust exceptions: These are conditions when the parametric test can still be used
for data which is not normally distributed and are specific to individual parametric test.
Key non-parametric tests

Although non-parametric tests require fewer assumptions and can be used on a wider range of data types, parametric
tests are preferred because they are more sensitive at detecting differences between samples or an effect of the
independent variable on the dependent variable. This means that to detect any given effect at a specified significance
level, a larger sample size is required for the non-parametric test than the equivalent parametric test when the data is
normally distributed. However, some statisticians argue that non-parametric methods are more appropriate with small
sample sizes.

Where to find non-parametric tests in SPSS

Parametric vs nonparametric tests


 Parametric tests
 assume underlying statistical distributions in the data – several conditions of validity must be met so that the
result is reliable
 For example, Student’s t-test for two independent samples is reliable only if each sample follows a normal
distribution and if sample variances are homogeneous.
 Nonparametric tests
 do not rely on any distribution
 applied even if parametric conditions of validity are not met
 often used to analyze ordinal or nominal data with small sample sizes
 sometimes referred to as a distribution-free method
 When to use:
o when the outcome is an ordinal variable or a rank,
o when there are definite outliers or
o when the outcome has clear limits of detection.
 Advantages:
o Good to use when observations are nominal, ordinal (ranked), subject to outliers or measured
imprecisely
o Relatively simple to conduct
o Assumptions are not necessary (often dubious or even surely untrue)
o Do not rely on large samples
 Disadvantages:
o Lack of statistical power if the assumptions of a roughly equivalent parametric test are valid
o Unfamiliarity
o Computing time (many non-parametric methods are computer intensive)
o Geared toward hypothesis testing rather than estimation
o Tied values can be problematic when these are common, and adjustments to the test statistic may be
necessary.

Hypothesis Testing with Nonparametric Tests


In nonparametric tests, the hypotheses are not about population parameters (e.g., μ=50 or μ1=μ2). Instead,
the null hypothesis is more general. For example, when comparing two independent groups in terms of a
continuous outcome, the null hypothesis in a parametric test is H0: μ1 =μ2. In a nonparametric test the null
hypothesis is that the two populations are equal, often this is interpreted as the two populations are equal in
terms of their central tendency.
MODULE 5
Lesson 1: INTRODUCTION TO EPIDEMIOLOGY

Definition of Epidemiology

 Epidemiology is the (i)study of the (ii)distribution and (iii)determinants of (iv)health-related states or events in
(v)specified populations, and the (vi)application of this study to the control of health problems.
 Come from the Greek word epi, meaning on or upon, demos, meaning people, and logos, meaning the study of
 It is the basic science of public health for two good reasons:
1. Epidemiology is a quantitative discipline that relies on a working knowledge of probability, statistics, and
sound research methods.
2. Epidemiology is a method of causal reasoning based on developing and testing hypotheses grounded in
such scientific fields as biology, behavioral sciences, physics, and ergonomics to explain health-related
behaviors, states, and events.

I. Study
 scientific discipline with sound methods of scientific inquiry at its foundation
 data-driven and relies on a systematic and unbiased approach to the collection, analysis, and interpretation of
data
 basic epidemiologic methods tend to rely on careful observation and use of valid comparison groups to assess
whether what was observed, such as the number of cases of disease in a particular area during a particular time
period or the frequency of an exposure among persons with disease, differs from what might be expected
 also draws on methods from other scientific fields, including biostatistics and informatics, with biologic,
economic, social, and behavioral sciences

II. Distribution
 concerned with the frequency and pattern of health events in a population
o Frequency: refers not only to the number of health events such as the number of cases of meningitis or
diabetes in a population, but also to the relationship of that number to the size of the population

o Pattern: refers to the occurrence of health-related events by time, place, and person. Time patterns may
be annual, seasonal, weekly, daily, hourly, weekday versus weekend, or any other breakdown of time
that may influence disease or injury occurrence

III. Determinants
 the causes and other factors that influence the occurrence of disease and other health-related events
 illness does not occur randomly in a population, but happens only when the right accumulation of risk factors or
determinants exists in an individual

IV. Health-related states or events


 non-communicable and communicable diseases
 chronic diseases, injuries, birth defects, maternal-child health, occupational health, and environmental health
 behaviors related to health and well-being
 examining genetic markers of disease risk

V. Specified populations
 concerned about the collective health of the people in a community or population
 the clinician’s “patient” is the individual; the epidemiologist’s “patient” is the community

VI. Application
 also involves applying the knowledge gained by the studies to community-based practice
 both a science and art
 epidemiologist uses the scientific methods of descriptive and analytic epidemiology as well as experience,
epidemiologic judgment, and understanding of local conditions in “diagnosing” the health of a community and
proposing appropriate, practical, and acceptable public health interventions to control and prevent disease in
the community

Uses
 Assessing the community’s health
 Making individual decision
 Completing the clinical picture
 Searching for cause

Six major tasks of epidemiology in public health practice:


1. public health surveillance
2. field investigation
3. analytic studies
4. evaluation, and
5. linkages
6. policy development (recently added)

1. Public health surveillance


 ongoing, systematic collection, analysis, interpretation, and dissemination of health data to help guide public
health decision making and action
 monitoring the pulse of the community
 sometimes called “information for action”
 Purpose: portray the ongoing patterns of disease occurrence and disease potential so that investigation,
control, and prevention measures can be applied efficiently and effectively – is accomplished the systematic
collection and evaluation of morbidity and mortality reports and other relevant health information, and the
dissemination of these data and their interpretation to those involved in disease control and public health
decision making

2. Field investigation
 objective of an investigation may simply be to learn more about the natural history, clinical spectrum,
descriptive epidemiology, and risk factors of the disease before determining what disease intervention methods
might be appropriate
 “shoe leather epidemiology” – conjuring up images of dedicated epidemiologists beating up the pavement in
search of additional cases and clues regarding source and mode of transmission

3. Analytical studies
 Use of rigorous methods, usually combination of surveillance and field investigations, providing clues or
hypotheses about causes and modes of transmission, and analytic studies evaluating the credibility of those
hypotheses
 Use of a valid comparison group: hallmark of an analytic epidemiology study
 Epidemiologist must be skilled in all aspects of such studies, including design, conduct, analysis, interpretation,
and communication of findings
o Design includes determining the appropriate research strategy and study design, writing justifications
and protocols, calculating sample sizes, deciding on criteria for subject selection (e.g., developing case
definitions), choosing an appropriate comparison group, and designing questionnaires
o Conduct involves securing appropriate clearances and approvals, adhering to appropriate ethical
principles, abstracting records, tracking down and interviewing subjects, collecting and handling
specimens, and managing the data.
o Analysis begins with describing the characteristics of the subjects. It progresses to calculation of rates,
creation of comparative tables (e.g., two-by-two tables), and computation of measures of association
(e.g., risk ratios or odds ratios), tests of significance (e.g., chi-square test), confidence intervals, and the
like. Many epidemiologic studies require more advanced analytic techniques such as stratified analysis,
regression, and modeling.
o Interpretation involves putting the study findings into perspective, identifying the key take-home
messages, and making sound recommendations. Doing so requires that the epidemiologist be
knowledgeable about the subject matter and the strengths and weaknesses of the study

4. Evaluation
 process of determining, as systematically and objectively as possible, the relevance, effectiveness, efficiency,
and impact of activities with respect to established goals
 Effectiveness refers to the ability of a program to produce the intended or expected results in the field;
effectiveness differs from efficacy, which is the ability to produce results under ideal conditions.
 Efficiency refers to the ability of the program to produce the intended results with a minimum expenditure of
time and resources
 May focus on plans (formative evaluation), operations (process evaluation), impact (summative evaluation), or
outcomes — or any combination of these

5. Linkages
 Working in a team especially during outbreak investigations, epidemiologist may act as a member or the leader
of a multidisciplinary team
 Maintaining linkages through official memorandum of understanding, sharing of published or on-line
information for public health audiences and outside partners, and informal networking that takes place at
professional meetings

6. Policy development
 Epidemiologists working in public health regularly provide input, testimony, and recommendations regarding
disease control strategies, reportable disease regulations, and health-care policy.

Lesson 2: DESCRIPTIVE AND ANALYTICAL EPIDEMIOLOGY

An epidemiologist:
 Counts
 Divides
 Compares

The 5W’s of epidemiology:


o What = health issue of concern
o Who = person
o Where = place
o When = time
o Why/how = causes, risk factors, modes of transmission

Descriptive Epidemiology
 time, place, and person

Time
 Occurrence of disease changes over time
 Changes could be regularly, while other are unpredictable
 Uses time data: displayed with a two-dimensional graph (vertical axis or y-axis = number or rate of cases; x-axis =
time periods, i.e. years, months, or days)
Place
 Describing the occurrence of disease by place provides insight into the geographic extent of the problem and its
geographic variation
 Not only to place of residence but to any geographic location relevant to disease occurrence

Person
 may use inherent characteristics of people (for example, age, sex, race), biologic characteristics
 (immune status), acquired characteristics (marital status), activities (occupation, leisure activities, use of
medications/tobacco/drugs), or the conditions under which they live (socioeconomic status, access to medical
care)
 “Person” attributes include age, sex, ethnicity/race, and socioeconomic status

Analytical Epidemiology
 Use descriptive epidemiology to generate hypotheses
 Use to test generated hypotheses
 Key feature of analytic epidemiology = comparison group
 Concerned with the search for causes and effects, or the why and the how
 Used to quantify the association between exposures and outcomes and to test hypotheses about causal
relationships
 Epidemiologic studies fall into two categories: experimental and observational

Experimental studies
 the investigator determines through a controlled process the exposure for each individual (clinical trial) or
community (community trial), and then tracks the individuals or communities over time to detect the effects of
the exposure

Observational studies
 the epidemiologist simply observes the exposure and disease status of each study participant
 Example: John Snow’s studies of cholera in London
 The two most common types of observational studies:
o cohort studies and
o case-control studies;
o a third type is cross-sectional studies
Spot map of deaths from cholera in Golden Square area, London, 1854

Cohort study
 similar in concept to the experimental study
 epidemiologist records whether each study participant is exposed or not, and then tracks the participants to
see if they develop the disease of interest
 differs from an experimental study because the investigator observes rather than determines the participant’s
exposure status
 After a period of time, the investigator compares the disease rate in the exposed group with the disease rate in
the unexposed group. The unexposed group serves as the comparison group, providing an estimate of the
baseline or expected amount of disease occurrence in the community. If the disease rate is substantively
different in the exposed group compared to the unexposed group, the exposure is said to be associated with
illness.
 The Framingham study is a well-known cohort study that has followed over 5,000 residents of Framingham,
Massachusetts, since the early 1950s to establish the rates and risk factors for heart disease
 Example: Nurses Health Study and the Nurses Health Study II are cohort studies established in 1976 and 1989,
respectively, that have followed over 100,000 nurses each and have provided useful information on oral
contraceptives, diet, and lifestyle risk factors
 Prospective or follow-up cohort studies – participants are enrolled as the study begins and are then followed
prospectively over time to identify occurrence of the outcomes of interest
 Retrospective cohort study – both the exposure and the outcomes have already occurred

Advantages

1. Clarity of Temporal Sequence (Did the exposure precede the outcome?): Cohort studies more clearly indicate the
temporal sequence between exposure and outcome, because in a cohort study, subjects are known to be disease-
free at the beginning of the observation period when their exposure status is established. In case-control studies,
one begins with diseased and non-diseased people and then ascertains their prior exposures. This is a reasonable
approach to establishing past exposures, but subjects may have difficulty remembering past exposures, and their
recollection may be biased by having the outcome (recall bias).
2. Allow Calculation of Incidence: Cohort studies allow you to calculate the incidence of disease in exposure groups, so
you can calculate:
 Absolute risk (incidence)
 Relative risk (risk ratio or rate ratio)
 Risk difference
 Attributable proportion (attributable risk %)

3. Facilitate Study of Rare Exposures: While a cohort design can be used to investigate common exposures (e.g., risk
factors for cardiovascular disease and cancer in the Nurses' Health Study), they are particularly useful for evaluating
the effects of rare or unusual exposures, because the investigators can make it a point to identify an adequate
number of subjects who have an unusual exposure, e.g.,

 Exposure to toxic chemicals (Agent Orange)


 Adverse effects of drugs (e.g., thalidomide) or treatments (e.g., radiation treatments for ankylosing spondylitis)
 Unusual occupational exposures (e.g., asbestos, or solvents in tire manufacturing,)

4. Allow Examination of Multiple Effects of a Single Exposure

5. Avoid Selection Bias at Enrollment: Cohort studies, especially prospective cohort studies, reduce the possibility that
the results will be biased by selecting subjects for the comparison group who may be more or less likely to have the
outcome of interest, because in a cohort study the outcome is not known at baseline when exposure status is
established. Nevertheless, selection bias can occur in retrospective cohort studies (since the outcomes have already
occurred at the time of selection), and it can occur in prospective cohort studies as a result of differential loss to
follow up.

Disadvantages of Prospective Cohort Studies


1. You may have to follow large numbers of subjects for a long time.
2. They can be very expensive and time consuming.
3. They are not good for rare diseases.
4. They are not good for diseases with a long latency.
5. Differential loss to follow up can introduce bias.

Disadvantages of Retrospective Cohort Studies


1. As with prospective cohort studies, they are not good for very rare diseases.
2. If one uses records that were not designed for the study, the available data may be of poor quality.
3. There is frequently an absence of data on potential confounding factors if the data was recorded in the past.
4. It may be difficult to identify an appropriate exposed cohort and an appropriate comparison group.
5. Differential losses to follow up can also bias retrospective cohort studies.

Case-control study
 investigators start by enrolling a group of people with disease
 at CDC such persons are called case-patients rather than cases, because case refers to occurrence of disease, not
a person
 As a comparison group, the investigator then enrolls a group of people without disease (controls)
 Investigators then compare previous exposures between the two groups. The control group provides an
estimate of the baseline or expected amount of exposure in that population. If the amount of exposure among
the case group is substantially higher than the amount you would expect based on the control group, then
illness is said to be associated with that exposure
 Key in a case-control study is to identify an appropriate control group, comparable to the case group in most
respects, in order to provide a reasonable estimate of the baseline or expected exposure
Advantages Disadvantages
 They are efficient for rare diseases or diseases  They are subject to selection bias.
with a long latency period between exposure
and disease manifestation.  They are inefficient for rare exposures.

 They are less costly and less time-consuming;  Information on exposure is subject to observation
they are advantageous when exposure data is bias.
expensive or hard to obtain.

 They are advantageous when studying dynamic  They generally do not allow calculation of
populations in which follow-up is difficult. incidence (absolute risk).

Given the greater efficiency of case-control studies, they are particularly advantageous in the following situations:
1. When the disease or outcome being studied is rare.
2. When the disease or outcome has a long induction and latent period (i.e., a long time between exposure and the
eventual causal manifestation of disease).
3. When exposure data is difficult or expensive to obtain.
4. When the study population is dynamic.
5. When little is known about the risk factors for the disease, case-control studies provide a way of testing
associations with multiple potential risk factors. (This isn't really a unique advantage to case-control studies,
however, since cohort studies can also assess multiple exposures.)

Another advantage of their greater efficiency, of course, is that they are less time-consuming and much less costly than
prospective cohort studies.
Cross-sectional study
 sample of persons from a population is enrolled and their exposures and health outcomes are measured
simultaneously
 tends to assess the presence (prevalence) of the health outcome at that point of time without regard to
duration
 weaker than either a cohort or a case-control study because a cross-sectional study usually cannot disentangle
risk factors for occurrence of disease (incidence) from risk factors for survival with the disease

Intervention studies
 Goal: test the efficacy of specific treatments or preventive measures by assigning individual subjects to one of
two or more treatment or prevention options. Intervention studies often test the efficacy of drugs, but one
might also use this design to test the efficacy of differing management strategies or regimens
 Two major types of intervention studies:
o Controlled clinical trials in which individual subjects are assigned to one or another of the competing
interventions, or
o Community interventions, in which an intervention is assigned to an entire group.
 Analogous to a prospective cohort study, except that the investigators assign or allocate the exposure
(treatment) under study
 Randomized clinical trials (RCTs) provide the best opportunity to control for confounding and avoid certain
biases
 Provide the most effective way to detect small to moderate benefits of one treatment over another. However,
in order to provide definitive answers, clinical trials must enroll a sufficient number of appropriate subjects and
follow them for an adequate period of time
 Long and expensive

Prevention trials (or prophylactic trials) versus Therapeutic Trials

Clinical trials might also be distinguished based on whether they are aimed at assessing preventive interventions or
evaluating new treatments for existing disease. The Physicians Health Study established that low-dose aspirin reduced
the risk of myocardial infarctions (heart attacks) in males. Other trials have assessed whether exercise or low-fat diet can
reduce the risk of heart disease or cancer. A study currently underway at BUSPH is testing whether peer counseling is
effective in helping smokers who live in public housing quit smoking. All of these are prevention trials. In contrast, there
have been many trials that have contributed to our knowledge about optimum treatment of many diseases through
medication, surgery, or other medical interventions.

Phases of Trials Evaluating New Drugs


Clinical trials for new drugs are conducted in phases with different purposes that depend on the stage of development.

Phase I trials: ClinicalTrials.gov describes phase I trials as "Initial studies to determine the metabolism and
pharmacologic actions of drugs in humans, the side effects associated with increasing doses, and to gain early evidence
of effectiveness; may include healthy participants and/or patients." Frequently, an experimental drug or treatment
initially is tested in a small group of people (8-80) to evaluate its safety and to explore possible side effects and the
doses at which they occur.

Phase II trials: ClinicalTrials.gov describes these as "Controlled clinical studies conducted to evaluate the effectiveness of
the drug for a particular indication or indications in patients with the disease or condition under study and to determine
the common short-term side effects and risks." The new treatment might be tested in a somewhat larger group (80-200)
to get more information about effectiveness and potential side effects at different dosages.

Phase III trials: ClinicalTrials.gov defines these as "Expanded controlled and uncontrolled trials after preliminary
evidence suggesting effectiveness of the drug has been obtained, and are intended to gather additional information to
evaluate the overall benefit-risk relationship of the drug and provide an adequate basis for physician labeling." These are
typically conducted in larger groups (200-40,000) to formally test effectiveness and establish the frequency and severity
of side effects compared to no treatment, or, compared to currently used treatments ("usual care")

Phase IV refers to post-marketing "surveillance" to collect information regarding risks, benefits, and optimal use. This
phase can be particularly important for identifying rare, but potentially devastating side effects
Individual versus Group (Community) Trials
 Most trials are conducted by allocating treatments or interventions to individual subjects, i.e., the treatment or
intervention is allocated to individuals
 In contrast, group trials allocate the intervention to groups of subjects. These types of trials are generally
conducted when the intervention is inherently operating at a group-level (e.g., changing a law or policy) or
because it would be difficult to give the intervention to some people in the group while withholding it from
others. Group units might be families, schools, or medical practices. A well-known type of group trial is a
community trial, in which the intervention is allocated therapy to entire communities or neighborhoods.

Lesson 3: BIAS AND CONFOUNDERS

Selection Bias

Selection bias can result when the selection of subjects into a study or their likelihood of being retained in the study
leads to a result that is different from what you would have gotten if you had enrolled the entire target population. If
one enrolled the entire population and collected accurate data on exposure and outcome, then one could compute the
true measure of association. We generally don't enroll the entire population; instead we take samples. However, if one
sampled the population in a fair way, such the sampling from all four cells was fair and representative of the distribution
of exposure and outcome in the overall population, then one can obtain an accurate estimate of the true association
(assuming a large enough sample, so that random error is minimal and assuming there are no other biases or
confounding). Conceptually, this might be visualized by equal sized ladles (sampling) for each of the four cells.

The contingency table has columns (diseased and non-diseased) and rows (exposed and non-exposed. In this illustration
the 4 exposure / disease categories have equal-sized ladles in them to convey the idea of unbiased sampling.
Fair Sampling Diseased Non-diseased

Exposed

Non-exposed

However, if sampling is not representative of the exposure-outcome distributions in the overall population, then the
measures of association will be biased, and this is referred to as selection bias. Consequently, selection bias can result
when the selection of subjects into a study or their likelihood of being retained in a cohort study leads to a result that
is different from what you would have gotten if you had enrolled the entire target population. One example of this
might be represented by the table below, in which the enrollment procedures resulted in disproportionately large
sampling of diseased subject who had the exposure.

This contingency table has a larger ladle in the cell tabulating the number of exposed subjects with disease. This is to
indicate that there was a tendency to over-sample this category, for example, a case-control study in which cases were
more likely to be selected if they had been exposed.
Selection Bias Diseased Non-diseased

Exposed

Non-exposed
There are several mechanisms that can produce this unwanted effect:
1. Selection of a comparison group ("controls") that is not representative of the population that produced the
cases in a case-control study. (Control selection bias)
2. Differential loss to follow up in a cohort study, such that the likelihood of being lost to follow up is related to
outcome status and exposure status. (Loss to follow-up bias)
3. Refusal, non-response, or agreement to participate that is related to the exposure and disease (Self-selection
bias)
4. Using the general population as a comparison group for an occupational cohort study ("Healthy worker effect")
5. Differential referral or diagnosis of subjects

 Selection Bias in Case-Control Studies


1. Control selection bias
 occurs when subjects for the "control" group are not truly representative of the population that
produced the cases. Remember that in a case-control study the controls are used to estimate the
exposure distribution (i.e., the proportion having the exposure) in the population from which the cases
arose. The exposure distribution in cases is then compared to the exposure distribution in the controls in
order to compute the odds ratio as a measure of association.
 The “would” criterion
o Used by epidemiologist to test for the possibility of selection bias; they ask "If a control had had
the disease, would they have been likely to be enrolled as a case?" If the answer is 'yes', then
selection bias is unlikely

2. Self-selection bias
 can be introduced into case-control studies with low response or participation rates if the likelihood of
responding or participating is related to both the exposure and the outcome

3. Differential surveillance, referral, or diagnosis of subject


 Could be minimized by more restrictive case selection criteria

 Selection Bias in Cohort Studies


1. Subject selection bias
 more common in a retrospective cohort study, especially if individuals have to provide informed consent
for participation
 can occur if selection or choice of the exposed or unexposed subjects in a retrospective cohort study is
somehow related to the outcome of interest

2. Loss to follow up bias


 The only way to prevent bias from loss to follow-up is to maintain high follow up rates (>80%). This can
be achieved by:
1. Enrolling motivated subjects
2. Using subjects who are easy to track
3. Making questionnaires as easy to complete as possible
4. Maintaining the interest of participants and making them feel that the study is important
5. Providing incentives

3. The “healthy worker” effect


 special type of selection bias that occurs in cohort studies of occupational exposures when the general
population is used as the comparison group
 The general population consists of both healthy people and unhealthy people. Those who are not
healthy are less likely to be employed, while the employed work force tends to have fewer sick people.
Moreover, people with severe illnesses would be most likely to be excluded from employment, but not
from the general population. As a result, comparisons of mortality rates between an employed group
and the general population will be biased.
Information Bias (Observation Bias)
 often referred to as misclassification, and the mechanism that produces these errors can result in either non-
differential or differential misclassification
 Ken Rothman (Epidemiology: An Introduction, Oxford University Press, 2002) distinguishes these as follows:
"For exposure misclassification, the misclassification is nondifferential if it is unrelated to the occurrence
or presence of disease; if the misclassification of exposure is different for those with and without disease, it is
differential. Similarly, misclassification of disease [outcome] is nondifferential if it is unrelated to the exposure;
otherwise, it is differential."
 Nondifferential misclassification of exposure
o means that the frequency of errors is approximately the same in the groups being compared.
Misclassification of exposure status is more of a problem than misclassification of outcome, but a
study may be biased by misclassification of either exposure status, or outcome status, or both
o bias towards the null
o mechanisms
 records maybe incomplete, e.g., a medical record in which none of the healthcare workers
remember to ask about tobacco use
 errors in recording or interpreting information in records,
 errors in assigning codes to disease diagnoses by clerical workers who are unfamiliar with a
patient's hospital course, diagnosis, and treatment. Subjects completing questionnaires or
being interviewed may have difficulty in remembering past exposures.
 Differential misclassification of exposure
o If errors in classification of exposure status occur more frequently in one of the groups being
compared, then differential misclassification will occur, and the estimate of association can be
overestimated or underestimated.
o can cause bias either toward or away from the null, depending on the circumstances
o Mechanisms:
 Recall bias
 occurs when there are systematic differences in the way subjects remember or
report exposures or outcomes
 can occur in either case-control studies or retrospective cohort studies
 Ways to Reduce Recall Bias
o Use a control group that has a different disease (that is unrelated to the
disease under study).
o Use questionnaires that are carefully constructed in order to maximize
accuracy and completeness. Ask specific questions.
o For socially sensitive questions, such as alcohol and drug use or sexual
behaviors, use a self-administered questionnaire instead of an interviewer.
o If possible, assess past exposures from biomarkers or from pre-existing
records.
 Interviewer bias
 A.k.a. recorder bias
 occur when data is collected by review of medical records if the reviewer
(abstractor) interprets or records information differently for one group or if the
reviewer searches for information more diligently for one group
 Ways to Reduce Interviewer Bias
o Use standardized questionnaires consisting of closed-end, easy to
understand questions with appropriate response options.
o Train all interviewers to adhere to the question and answer format strictly,
with the same degree of questioning for both cases and controls.
o Obtain data or verify data by examining pre-existing records (e.g., medical
records or employment records) or assessing biomarkers.
 Misclassification of outcome
o Differential
o Nondifferential
 will generally bias toward the null, but there are situations in which it will not bias the risk ratio.
Bias in the risk difference depends upon the sensitivity (probability that someone who truly has
the outcome will be identified as such) and specificity (probability that someone who does not
have the outcome will be identified as such)

Confounding and Effect Modification


Confounding
 a distortion (inaccuracy) in the estimated measure of association that occurs when the primary exposure of
interest is mixed up with some other factor that is associated with the outcome

In the diagram below, the primary goal is to ascertain the strength of association between physical inactivity and heart
disease. Age is a confounding factor because it is associated with the exposure (meaning that older people are more
likely to be inactive), and it is also associated with the outcome (because older people are at greater risk of developing
heart disease).

In order for confounding to occur, the extraneous factor must be associated with both the primary exposure of interest
and the disease outcome of interest. For example, subjects who are physically active may drink more fluids (e.g., water
and sports drinks) than inactive people, but drinking more fluid has no effect on the risk of heart disease, so fluid intake

is not a confounding factor here.

Or, if the age distribution is similar in the exposure groups being compared, then age will not cause confounding.

There are three conditions that must be present for confounding to occur:
1. The confounding factor must be associated with both the risk factor of interest and the outcome.
2. The confounding factor must be distributed unequally among the groups being compared.
3. A confounder cannot be an intermediary step in the causal pathway from the exposure of interest to the
outcome of interest.

Identifying Confounding
1. A simple, direct way to determine whether a given risk factor caused confounding is to compare the estimated
measure of association before and after adjusting for confounding. In other words, compute the measure of
association both before and after adjusting for a potential confounding factor. If the difference between the two
measures of association is 10% or more, then confounding was present. If it is less than 10%, then there was
little, if any, confounding. How to do this will be addressed in greater detail below.
2. Other investigators will determine whether a potential confounding variable is associated with the exposure of
interest and whether it is associated with the outcome of interest. If there is a clinically meaningful relationship
between an the variable and the risk factor and between the variable and the outcome (regardless of whether
that relationship reaches statistical significance), the variable is regarded as a confounder.
3. Still other investigators perform formal tests of hypothesis to assess whether the variable is associated with the
exposure of interest and with the outcome.

Effects of Confounding
 May account for all or part of an apparent association.
 May cause an overestimate of the true association (positive confounding) or an underestimate of the association
(negative confounding).

Types of Confounding
 Residual confounding
o Residual confounding is the distortion that remains after controlling for confounding in the design
and/or analysis of a study.
o There are three causes of residual confounding:
1. There were additional confounding factors that were not considered, or there was no
attempt to adjust for them, because data on these factors was not collected.
2. Control of confounding was not tight enough. For example, a study of the association
between physical activity and age might control for confounding by age by a) restricting the
study population to subject between the ages of 30-80 or b) matching subjects by age within
20-year categories. In either event there might be persistent differences in age among the
groups being compared. Residual differences in confounding might also occur in a
randomized clinical trial if the sample size was small. In a stratified analysis or in a regression
analysis there could be residual confounding because data on confounding variable was not
precise enough, e.g., age was simply classified as "young" or "old".
3. There were many errors in the classification of subjects with respect to confounding
variables.

 Confounding by indication
o special type of confounding that can occur in observational (non-experimental) pharmaco-epidemiologic
studies of the effects and side effects of drugs.
o type of confounding arises from the fact that individuals who are prescribed a medication or who take a
given medication are inherently different from those who do not take the drug, because they are taking
the drug for a reason

 Reverse causality
o occurs when the probability of the outcome is causally related to the exposure being studied
o The case-control study by Perneger and Whelton may also have been affected by reverse causality.
Diabetes is a leading cause of renal failure in the US, and chronic diabetes is associated with a number of
other health problems such as cardiovascular diseases and infections that could result in a greater use of
analgesics. If so, the dialysis cases whose renal failure resulted from diabetes might have taken more
analgesics because of their diabetes. Nevertheless, it would appear that analgesic use was associated
with an increased risk of renal failure rather than vice versa.
Control of Confounding in Study Design

 Restriction
o This approach to controlling confounding is simple and effective, but it has several limitations:
 It reduces the number of subjects who are eligible (may cause sample size problem).
 Residual confounding can occur if you don't restrict narrowly enough. For example, in the study
on exercise and heart disease, the investigators might have restricted the study to men aged 40-
65. However, the age-related risk of heart disease still varies widely within this range as do
levels of physical activity.
 You can't evaluate the effects of factors that have been restricted for. For example, if the study
is limited to men aged 45-50, you can't use this study to examine the effects of gender or age
(because these factors don't vary within your sample).
 Restriction limits generalizability. For example, if you restrict the study to men, you may not be
able to generalize the findings to women

 Matching
o Instead of restriction, one could also ensure that the study groups do not differ with respect to possible
confounders such as age and gender by matching the two comparison groups.
o For example, for every active male between the ages of 40-50, we could find and enroll an inactive male
between the ages of 40-50. In this way, the groups we are comparing can artificially be made similar
with respect to these factors, so they cannot confound the relationship. This method actually requires
the investigators to control confounding in both the design and analysis phases of the study, because
the analysis of matched study groups differs from that of unmatched studies.
o Like restriction, this approach is straightforward, and it can be effective.
o Following are the disadvantages:
 It can be time-consuming and expensive.
 It limits sample size.
 You can't evaluate the effect of the factors you that you matched for.

Nevertheless, matching is useful in the following circumstances:


 When one needs to control for complex, multifaceted variables (e.g., heredity, environmental
factors)
 When doing a case-control study in which there are many possible controls, but a smaller
number of cases (e.g., 4:1 matching in the study examining the association between DES and
vaginal cancer)

 Randomization in Clinical Trials


MODULE 6
Lesson 1: WHAT IS CAUSE?

What is Cause?

 An event, condition, or characteristic without which the disease would not have occurred (Kenneth Rothman)
 Something that makes a difference (M. Susser)
 Causality (or causation) is the relationship between an event (the cause) and a second event (the effect), where
the second event is understood as a consequence of the first (Wikipedia)

 To be a cause, the factor:


o Must precede the effect
o Can be either a host or environmental factor (e.g., characteristics, conditions, actions of individuals, events,
natural, social or economic phenomena)
o May be positive (presence of a causative exposure) or negative (lack of a preventive exposure)
o Hill's Criteria

 Strength of the association


 Before even considering causality, it is first necessary to establish that there is a valid
association. However, one should also consider the strength of the association. Strong
associations (e.g., risk ratio of 20 for heavy smoking and lung cancer) are unlikely to be entirely
explained by bias or confounding. While strength of the association should be considered, it is
not a requirement, since weak associations can also be causal.

 Consistency
 Associations are more likely to be causal if they are observed repeatedly by different
investigators, in different populations, and with different study designs. While replication
increases one's confidence that the relationship is causal, it is not a requirement for a judgment
of causality.

 Specificity
 Specificity means that a specific cause results in a specific outcome and that a specific outcome
results from a single cause. This is a throwback to Koch's postulates, and many epidemiologists
ignore this criterion because disease outcomes are generally multifactorial. For example, there
are many other factors besides smoking that influence one risk of developing lung cancer. In
addition, smoking causes many other health problems besides lung cancer.

 Temporality
 In order for a causal factor to result in an outcome, it must precede the occurrence of the
outcome in time. This is the only criterion that is necessary for a judgment of causality.
Prospective cohort studies and prospective clinical trials provide stronger evidence of
temporality than retrospective studies or cross-sectional studies.

 Biological gradient
 Biological gradient means that there is a dose-response relationship between the cause and the
outcome, i.e. the probability of the outcome increases as the exposure level increases. For
example, the risk of lung cancer increases as the number of cigarettes smoked per day
increases. This criterion is not a necessary condition for a judgment of causality, because some
causal relationships have threshold doses, or exhibit other non-linear relationships to risk of the
outcome.
 Plausibility/Coherence
 This criterion is met if there is a known biological explanation or a plausible explanation for how
the exposure of interest might result in or contribute to the outcome of interest. For example,
we now know that there are many carcinogens in tobacco smoke, so it is certainly plausible that
inhalation of tobacco smoke might cause lung cancer. Moreover, carcinogens and free radicals
in tobacco smoke can be absorbed from the lungs and enter the blood stream, so it is plausible
that tobacco smoke might cause other adverse outcomes such as heart disease or other cancers.

Hill had a separate criterion called coherence, meaning that the causal relationship did not
conflict with other facts regarding the disease, but many feel that this is basically similar to
plausibility.

 Experiment
 This means that interventions (treatments or risk factor modifications) have predictable effects
on the occurrence of disease. For example, getting smokers to quit smoking will reduce their risk
of getting a variety of smoking-related diseases, or getting an obese person to exercise and lose
weight will reduce their risk of type II diabetes.

 Analogy
 Analogy is this setting means that there are similar cause-effect relationships. For example, if it
is widely accepted that certain drugs taken during pregnancy can cause birth defects, then it is
easier to accept the possibility that a new drug might also cause birth defects.

 Several observations that are part of the nine criteria of Hill (1965) support an observed association representing
a causal hypothesis:
1. consistency in observational and experimental studies;
2. the cause precedes the effect in time (temporality);
3. the observation of an exposure–response relationship, increasing levels of exposure increases or decreases
the likelihood of the outcome as expected; and
4. biological plausibility of the basic science supports the causal relationship

Lesson 2: EXPERIMENTAL RESEARCH METHODS/DESIGNS

Experimental Research Methods


 Have been developed to reduce biases of all kinds as much as possible
 Formally surfaced in educational psychology around the turn of the century, with the classic studies by
Thorndike and Woodworth on transfer (Cronbach, 1957)

What is Experimental Research?


The experimenter’s interest in the effect of environmental change, referred to as “treatments,” demanded designs using
standardized procedures to hold all conditions constant except the independent (experimental) variable. This
standardization ensured high internal validity (experimental control) in comparing the experimental group to the control
group on the dependent or “outcome” variable. That is, when internal validity was high, differences between groups
could be confidently attributed to the treatment, thus ruling out rival hypotheses attributing effects to extraneous
factors. Traditionally, experimenters have given less emphasis to external validity, which concerns the generalizability of
findings to other settings, particularly realistic ones.

Experimental design
 often flaunted as the most "rigorous" of all research designs or, as the standard against which all other designs
are judged
 If implemented an experimental design well (and that is a big "if" indeed), then the experiment is probably the
strongest design with respect to internal validity (Note: internal validity is at the center of all causal or cause-
effect inferences)

Random assignment
 method for assigning cases (e.g., individuals) to groups (e.g., experimental and control) for the purpose of
making comparisons in order to increase one’s confidence that the groups do not differ in a systematic way
 A researcher begins with a collection of cases and then divides the cases into two or more groups using a
random mathematical process
 Useful nonrandom technique used to assign Cases into groups:
o Matching
 A process whereby a researcher deliberately assigns cases into groups based upon relevant
characteristics (a characteristic is considered relevant if in anyway it could affect the dependent
variable during the course of the experiment) in order to create similar groups for comparison
purposes.
 Disadvantage: Individual cases differ in thousands of ways, and the researcher cannot know
which might be relevant.

Symbols used in Experimental Research Design

O = Observation or Measurement (e.g. mathematics score, score


on an attitude scale, weight of subjects, etc.).

O1, O2, O3 ……… On = more than one observation or measurement.

R = Random assignment: subjects are randomly assigned to the


various groups.

X = Treatment which may be a teaching method, counselling


techniques, reading strategy, frequency of questioning and so
forth.

Hierarchy of Study Design


The Experimental Logic

 The Language of Experiments


 Parts of the Experiment (Not all experiments have all these parts, and some have all seven parts plus others.)
1. Independent Variable (stimulus or treatment)
 A condition or treatment introduced into the experiment.
2. Dependent Variable
 Dependent variables or outcomes are physical conditions, social behaviors, attitudes,
feelings, or beliefs of subjects that change in response to a treatment.
3. Pretest
 The measurement of the dependent variable prior to introduction of the independent
variable.
4. Posttest
 The measurement of the dependent variable after the introduction of the independent
variable.
5. Experimental Group
 The group that receives the independent variable.
6. Control Group
 The group that does not receive the independent variable.

Control in Experiments
 Control is crucial in experimental research. Aspects of an experimental situation that are not controlled by the
researcher are alternatives to the treatment for change in the dependent variable and undermine the ability to
establish causality.
 Techniques to Establish Control in Experiments
o Deception
 Occurs when the researcher intentionally misleads subjects through written or verbal
instructions, the actions of others (e.g., confederates or stooges), or aspects of the setting.

Types of Experimental Research Design

1. Classical Experimental Design


 All designs are variations of the classical experimental design, the type of design discussed so far, which
has random assignment, a pretest and a posttest, an experimental group, and a control group.

2. Pre-Experimental Designs
 Weak designs
 Some designs lack random assignment and are compromises or shortcuts. These pre-experimental
designs are used in situations where it is difficult to use the classical design. They have weaknesses that
make inferring a causal relationship more difficult.

o One-Shot Case Study Design


 Also called the one group posttest only design, the one-shot case study design has only
one group, a treatment, and a posttest. Because there is only one group, there is no
random assignment. A weakness of this design is that it is difficult to say for sure that
the treatment caused the dependent variable. If subjects were the same before and
after the treatment, the researcher would not know it.
o One-Group Pretest-Posttest Design
 This design has one group, a pretest, a treatment, and a posttest. It lacks a control group
and random assignment. This is an improvement over the one-shot case study because
the researcher measures the dependent variable both before and after the treatment.
But it lacks a control group. The researcher cannot know whether something other than
the treatment occurred between the pretest and the posttest to cause the outcome.
o Static Group Comparison (Posttest Only)
 Also called the posttest only nonequivalent group design, static group comparison has
two groups, a posttest, and treatment. It lacks random assignment and a pretest. A
weakness is that any posttest outcome difference between the groups could be due to
group differences prior to the experiment instead of to the treatment.

3. Quasi-Experimental and Special Designs


 These designs, like the classical design, make identifying a causal relationship more certain than do pre-
experimental designs. Quasi-experimental designs help researchers test for causal relationships in a
variety of situations where the classical design is difficult or inappropriate. They are called quasi because
they are variations of the classical experimental design. Some have randomization but lack a pretest,
some use more than two groups, and others substitute many observations of one group over time for a
control group. In general, the researcher has less control over the independent variable than in the
classical design.
o Types of Quasi-Experimental Designs
1. Two-Group Posttest-Only Design
 This is identical to the static group comparison, with one exception: The groups
are randomly assigned. It has all the parts of the classical design except a
pretest. The random assignment reduces the chance that the groups differed
before the treatment, but without a pretest, a researcher cannot be as certain
that the groups began the same on the dependent variable.
2. Interrupted Time Series
 In an interrupted time series design, a researcher uses one group and makes
multiple pretest measures before and after the treatment.
3. Equivalent Time Series
 An equivalent time series is another one-group design that extends over a time
period. Instead of one treatment, it has a pretest, then a treatment and
posttest, then treatment and posttest, then treatment and posttest, and so on.

4. Latin Square Designs


1. Researchers interested in how several treatments given in different sequences or time orders affect a
dependent variable can use a Latin square design.
o Types of Latin Square Designs
A. Solomon Four-Group Design
 A researcher may believe that the pretest measure has an influence on the
treatment or dependent variable. A pretest can sometimes sensitize subjects to the
treatment or improve their performance on the posttest.
 Richard L. Solomon (1949) developed the Solomon four-group design to address the
issue of pretest effects. It combines the classical experimental design with the two-
group posttest-only design and randomly assigns subjects to one of four groups.
 Because it has the same two groups as the pretest-posttest control-group design, it
has the same protection against the threats of external validity
 Main advantage gained by adding the two additional groups relates to external
validity
 One problem is that there is no statistical test that can treat all six sets of data all
the same time
The Solomon Four-Group Experimental Design

B. Factorial Designs
 Sometimes, a research question suggests looking at the simultaneous effects of
more than one independent variable. A factorial design uses two or more
independent variables in combination. Every combination of the categories in
variables (sometimes called factors) is examined. When each variable contains
several categories, the number of combinations grows very quickly. The treatment
or manipulation is not each independent variable; rather, it is each combination of
the categories.
 Most efficient in experiments involving the study of effects of two or factors
 all possible combinations of the levels of the factors are investigated in each
replication
 If there are a levels of factor A, and b levels of factor B, then each replicate contains
all ab treatment combinations.
 Main effects:
 The main effect of a factor is defined to be the change in response produced
by a change in the level of a factor.
 The main effect of A is the difference between the average response at A 1
and A2

FACTOR B
B1 B2
A1 20 30
FACTOR A
A2 40 52

 Interaction:
 Interaction among factors occurs when there is difference in response
between the levels of one factor is not the same at all levels of the other
factor
 the failure for the response of treatments of a factor to be the same for
each level of another factor.
 When the simple effects of a factor differ by more than can be attributed to
chance
 At B1, the A effect is:
 At A1, the A effect is:
FACTOR B
B1 B2
FACTOR A A1 20 40
A2 50 12
o Example of interactions:

 Advantages of Factorials
 They are more efficient than one-factor-at-a-time experiments.
 A factorial design is necessary when interactions may be present to avoid
misleading conclusions.
 Factorial designs allow the effects of a factor to be estimated at several
levels of the other factors, yielding conclusions that are valid over a range of
experimental conditions.

Clinical Trial Design


(Taken from https://newonlinecourses.science.psu.edu/stat509/node/19/. Accessed: 29 September 2019)

Good trial design and conduct are far more important than selecting the correct statistical analysis. When a trial is well
designed and properly conducted, statistical analyses can be performed, modified, and if necessary, corrected. On the
other hand, inaccuracy (bias) and imprecision (large variability) in estimating treatment effects, the two major
shortcomings of poorly designed and conducted trials, cannot be ameliorated after the trial. Skillful statistical analysis
cannot overcome basic design flaws.
Piantadosi (2005) lists the following advantages of proper design:

1. Allows investigators to satisfy ethical constraints


2. Permits efficient use of scarce resources
3. Isolates the treatment effect of interest from confounders
4. Controls precision
5. Reduces selection bias and observer bias.
6. Minimizes and quantifies random error or uncertainty
7. Simplifies and validates the analysis
8. Increases the external validity of the trial

The objective of most clinical trials is to estimate the magnitude of treatment effects or estimate differences in
treatment effects. Precise statements about observed treatment effects are dependent on a study design that allows
the treatment effect to be sorted out from person-to-person variability in response. An accurate estimate requires a
study design that minimizes bias.

Piantadosi (2005) states that clinical trial design should accomplish the following:
1. Quantify and reduce errors due to chance
2. Reduce or eliminate bias
3. Yield clinically relevant estimates of effects and precision
4. Be simple in design and analysis
5. Provide a high degree of credibility, reproducibility, and external validity
6. Influence future clinical practice

Controlled Clinical Trials Compared to Observational Studies

Medical research, as a scientific investigation, is based on careful observation and theory. Theory directs the observation
and provides a basis for interpreting the results. The strength of the evidence from a clinical study is proportional to
amount of the control of bias and variability when the study was conducted as well as the magnitude of the observed
effect. Clinical studies can be characterized as uncontrolled observations, observational comparative and controlled
clinical trials.

Case reports and case-series are uncontrolled observational studies.

A case report only demonstrates that a clinical event of interest is possible. In a case report, there is no control of
treatment assignment, endpoint ascertainment, or confounders. There is no control group for the sake of comparison.
The report is descriptive in nature, not a formal statistical analysis.

Case reports are useful in generating hypotheses for future testing. For example, a physician may report that a patient in
his practice, who was taking a specific anorexic drug, developed primary pulmonary hypertension (PPH), a rare condition
that occurs in 1-2 out of every million Americans. Is this convincing evidence that the anorexic drug causes PPH?

A case series carries more weight than a single case report but cannot prove efficacy of a treatment. Case series and
case reports are susceptible to large selection biases. Consider the example of laetrile, an apricot pit extract that was
reputed to cure cancer. Seven case series were reported; the strength of evidence from these studies has been
summarized by US National Cancer Institute (NCI). While a proportion of patients may have experienced spontaneous
remission of cancer, rigorous testing in controlled environments was never performed. After an estimated 70,000
patients had been treated, the NCI undertook a retrospective analysis of laetrile only to decide no definite conclusions
supporting anti-cancer activity could be made (Ellison 1978 abstract). The Cochrane review on laetrile (2015), states,
“there is no reliable evidence for the alleged effects of laetrile or amygdalin for curative effects in cancer patients.” Based
on a series of reported cases, many believed laetrile would cure their cancer, perhaps refusing other effective
treatments, and subjecting themselves to adverse effects of cyanide, for many years, this continued for many years with
anti-tumor efficacy of laetrile unsupported while associated adverse effects were coming to light.
A database analysis is similar to a case series, but may have a control group, depending on the data source. The source
and quality of the data used for this secondary analysis is key. If the analysis attempts to evaluate treatment differences
from data in which treatment assignment was based on physician and patient discretion, nonrandomized and open-
label, bias is likely.

Databases are best used to study patterns with exploratory statistical analyses. For example, the NIH sponsored a
database analysis of interstitial cystitis (IC) during the 1990’s. This consisted of data from over 400 individuals with IC
who underwent various and numerous therapies for their condition. The objective of the database analysis was to
determine if there were patterns of treatments that may be effective in treating the disease. (Rovner et al. 2000).

As another example, in the case of genomic research, specific data mining tools have been developed to search for
patterns in large databases of genetic data, leading to the discovery of particular candidate genes.

An epidemiologic study is often a case-control or a cohort design, both comparative observational studies. An
observational study lacks the key component of an experiment, namely, control over treatment assignment. Commonly
these designs are used in assessing the influence of risk factors for a disease. Subjects meeting entrance criteria may
have been identified through a database search. The choice of the control group is a crucial design component in
observational studies.

In a case-control study, the investigator identifies cases (subjects with the disease) and controls (subjects without the
disease) and retrospectively assesses some type of treatment or exposure. Because the investigator has selected the
cases and controls, relative risk cannot be calculated directly from a case-control study.

In addition, levels of treatment or exposure may be recorded based on a subject’s recall of events that occurred many
years previously, thus recall bias, (systematic differences in accuracy or completeness of recall) can affect the study
results.

In a prospective cohort study, individuals are followed forward in time with subsequent evaluations to determine which
individuals develop into cases. The relationship of specific risk factors that were measured at baseline with the
subsequent outcome is assessed. The cohort study may consist of one or more samples with particular risk factors, called
cohorts. It is possible to control some sources of bias in a prospective cohort study by following standard procedures in
collecting data and ascertaining endpoints. Since the subjects are not assigned risk factors in a randomized manner
however, there may remain covariates that are confounded with a risk factor. Sometimes, a particular treatment group
(or groups) from a randomized trial is followed as a cohort, providing a cohort in which the treatment was assigned at
random.

Prospective studies tend to have fewer design problems and less bias than retrospective studies, but they are more
expensive with respect to time and cost.
An example of a case-control study: A cardiologist identifies 36 patients currently in his practice with a specific form of
cardiac valve disease. He identifies another group of relatively healthy patients and matches two of them to each of the
patients with cardiac valve disease according to age (± 5years) and BMI (± 2.5). He plans to interview all 36 + 72 = 108
patients to assess their use of diet drugs during the past ten years.

A classic example of a cohort study: U.S. National Heart Lung and Blood Institute Framingham Heart Study

Piantodosi (2005) lists the following conditions for convincing non-experimental comparative studies:

1. The treatment of interest occurs naturally.


2. The study subjects provide valid observations for the biological question.
3. The natural history of the disease with standard therapy, or in the absence of therapy, is known.
4. The effect of the treatment is large enough to overshadow random error and bias.
5. Evidence of efficacy is consistent with biological knowledge.

A controlled clinical trial contains all of the key components of a true experimental design. Treatments are assigned by
design; administration of treatment and endpoint ascertainment follows a protocol. When properly designed and
conducted, especially with the use of randomization and masking, the controlled clinical trial instills confidence that bias
has been minimized. Replication of a controlled clinical trial, if congruent with the results of the first clinical trial,
provides verification.

 Experimental Design Terminology


In experimental design terminology, the "experimental unit" is randomized to the treatment regimen and receives the
treatment directly. The "observational unit" has measurements taken on it. In most clinical trials, the experimental units
and the observational units are one and the same, namely, the individual patient

One exception to this is a community intervention trial in which communities, e.g., geographic regions, are randomized
to treatments. For example, communities (experimental units) might be randomized to receive different formulations of
a vaccine, whereas the effects are measured directly on the subjects (observational units) within the communities. The
advantages here are strictly logistical - it is simply easier to implement in this fashion. Another example occurs in
reproductive toxicology experiments in which female rodents are exposed to a treatment (experimental units) but
measurements are taken on the pups (observational units).

In experimental design terminology, factors are variables that are controlled and varied during the course of the
experiment. For example, treatment is a factor in a clinical trial with experimental units randomized to treatment.
Another example is pressure and temperature as factors in a chemical experiment.

Most clinical trials are structured as one-way designs, i.e., only one factor, treatment, with a few levels.

Temperature and pressure in the chemical experiment are two factors that comprise a two-way design in which it is of
interest to examine various combinations of temperature and pressure. Some clinical trials may have a two-way
factorial design, such as in oncology where various combinations of doses of two chemotherapeutic agents comprise
the treatments. An incomplete factorial design may be useful if it is inappropriate to assign subjects to some of the
possible treatment combinations, such as no treatment (double placebo). We will study factorial designs in a later
lesson.

A parallel design refers to a study in which patients are randomized to a treatment and remain on that treatment
throughout the course of the trial. This is a typical design. In contrast, with a crossover design patient are randomized to
a sequence of treatments and they cross over from one treatment to another during the course of the trial. Each
treatment occurs in a time period with a washout period in between. Crossover designs are of interest since with each
patient serving as their own control, there is potential for reduced variability. However, there are potential problems
with this type of design. There should be investigation into possible carry over effects, i.e. the residual effects of the
previous treatment affecting subject’s response in the later treatment period. In addition, only conditions that are likely
to be similar in both treatment periods are amenable to crossover designs. Acute health problems that do not recur are
not well-suited for a crossover study.

Randomization is used to remove systematic error (bias) and to justify Type I error probabilities in experiments.
Randomization is recognized as an essential feature of clinical trials for removing selection bias.

Selection bias occurs when a physician decides treatment assignment and systematically selects a certain type of patient
for a particular treatment. Suppose the trial consists of an experimental therapy and a placebo. If the physician assigns
the healthier patients to the experimental therapy and the less healthy patients to the placebo, the study could result in
an invalid conclusion that the experimental therapy is very effective.

Blocking and stratification are used to control unwanted variation. For example, suppose a clinical trial is structured to
compare treatments A and B in patients between the ages of 18 and 65. Suppose that the younger patients tend to be
healthier. It would be prudent to account for this in the design by stratifying with respect to age. One way to achieve this
is to construct age groups of 18-30, 31-50, and 51-65 and to randomize patients to treatment within each age group.

Age Treatment A Treatment B


18 - 30 12 13
31 - 50 23 23
51-65 6 7

It is not necessary to have the same number of patients within each age stratum. We do, however, want to have balance
in the number on each treatment within each age group. This is accomplished by blocking, in this case, within the age
strata. Blocking is a restriction of the randomization process that results a balance of numbers of patients on each
treatment after a prescribed number of randomizations. For example, blocks of 4 within these age strata would mean
that after 4, 8, 12, etc. patients in a particular age group had entered the study, the numbers assigned to each treatment
within that stratum would be equal.

If the numbers are large enough within a stratum, a planned subgroup analysis may be performed. In the example, the
smaller numbers of patients in the upper and lower age groups would require care in the analyses of these sub-groups
specifically. However, with the primary question as the effect of treatment regardless of age, the pooled data in which
each sub-group is represented in a balanced fashion would be utilized for the main analysis.

Even ineffective treatments can appear beneficial in some patients. This may be due to random fluctuations, or
variability in the disease. If, however, the improvement is due to the patient’s expectation of a positive response, this is
called a "placebo effect”. This is especially problematic when the outcome is subjective, such as pain or symptom
assessment. Placebo effect is widely recognized and must be removed in any clinical trial. For example, rather than
constructing a nonrandomized trial in which all patients receive an experimental therapy, it is better to randomize
patients to receive either the experimental therapy or a placebo. A true placebo is an inert or inactive treatment that
mimics the route of administration of the real treatment, e.g., a sugar pill.

Placebos are not acceptable ethically in many situations, e.g., in surgical trials. (Although there have been instances
where 'sham' surgical procedures took place as the 'placebo' control.) When an accepted treatment already exists for a
serious illness such as cancer, the control must be an active treatment. In other situations, a true placebo is not
physically possible to attain. For example, a few trials investigating dimethyl sulfoxide (DMSO) for providing muscle pain
relief were conducted in the 1970’s and 1980’s. DMSO is rubbed onto the area of muscle pain, but leaves a garlicky taste
in the mouth, so it was difficult to develop a placebo.
Treatment masking or blinding is an effective way to ensure objectivity of the person measuring the outcome variables.
Masking is especially important when the measurements are subjective or based on self-assessment. Double-masked
trials refer to studies in which both investigators and patients are masked to the treatment. Single-masked trials refer
to the situation when only patients are masked. In some studies, statisticians are masked to treatment assignment when
performing the initial statistical analyses, i.e., not knowing which group received the treatment and which is the control
until analyses have been completed. Even a safety-monitoring committee may be masked to the identity of treatment A
or B, until there is an observed trend or difference that should evoke a response from the monitors. In executing a
masked trial great care will be taken to keep the treatment allocation schedule securely hidden from all except those
with a need to know which medications are active and which are placebo. This could be limited to the producers of the
study medications, and possibly the safety monitoring board before study completion. There is always a caveat for
breaking the blind for a particular patient in an emergency situation.

As with placebos, masking, although highly desirable, is not always possible. For example, one could not mask a surgeon
to the procedure he is to perform. Even so, some have gone to great lengths to achieve masking. For example, a few
trials with cardiac pacemakers have consisted of every eligible patient undergoing a surgical procedure to be implanted
with the device. The device was "turned on" in patients randomized to the treatment group and "turned off" in patients
randomized to the control group. The surgeon was not aware of which devices would be activated.

Investigators often underestimate the importance of masking as a design feature. This is because they believe that
biases are small in relation to the magnitude of the treatment effects (when the converse usually is true), or that they
can compensate for their prejudice and subjectivity.

Confounding is the effect of other relevant factors on the outcome that may be incorrectly attributed to the difference
between study groups.

Here is an example: An investigator plans to assign 10 patients to treatment and 10 patients to control. There will be a
one-week follow-up on each patient. The first 10 patients will be assigned treatment on March 01 and the next 10
patients will be assigned control on March 15. The investigator may observe a significant difference between treatment
and control, but is it due to different environmental conditions between early March and mid-March? The obvious way
to correct this would be to randomize 5 patients to treatment and 5 patients to control on March 01, followed by
another 5 patients to treatment and the 5 patients to control on March 15.

Validity
A trial is said to possess internal validity if the observed difference in outcome between the study groups is real and not
due to bias, chance, or confounding. Randomized, placebo-controlled, double-blinded clinical trials have high levels of
internal validity.

External validity in a human trial refers to how well study results can be generalized to a broader population. External
validity is irrelevant if internal validity is low. External validity in randomized clinical trials is enhanced by using broad
eligibility criteria when recruiting patient.

 Clinical Trial Phases


When a drug, procedure, or treatment appears safe and effective based on preclinical studies, it can be considered for
trials in humans. Clinical studies of experimental drugs, procedures, or treatments in humans have been classified into
four phases (Phase I, Phase II, Phase III, and Phase IV) based on the terminology used when pharmaceutical companies
interact with the U.S. FDA. Greater numbers of patients are assigned to treatment in each successive phase.

Phase 0 represents pre-clinical testing in animals to obtain pharmacokinetic information.

Phase I trials investigate the effects of various dose levels on humans, The studies are usually done in a small number of
volunteers (sometimes persons without the disease of interest or patients with few remaining treatment options) who
are closely monitored in a clinical setting. The purpose is to determine a safe dosage range and to identify any common
side effects or readily apparent safety concerns. Data may be collected to provide a description of the pharmacokinetics
and pharmacodynamics of the compound, estimate the maximum tolerated dose (MTD), or evaluate the effects of
multiple dose levels. Many trials in the early stage of therapy development either investigate treatment mechanism
(TM) or incorporate dose-finding (DF) strategies.

To a pharmacologist, a TM trial is a pharmacokinetics study in which an attempt is made to investigate the


bioavailability of the drug at various sites in the human system. To a surgeon, a TM study investigates the operative
procedure. A DF trial usually tries to determine the maximum tolerated dose, or the minimum effective dose, etc. Thus,
phase I (drug) trials can be considered TM and DF trials.

A Phase II trial typically investigates preliminary evidence of efficacy and continues to monitor safety. A Phase II trial
may be the first time that the agent is administered to patients with the disease of interest to answer questions such as:
What is the correct dosage for efficacy and safety in patients of this type? What is the probability a patient treated with
the compound will benefit from the therapy or experience an adverse effect? Most trials in the middle stage of therapy
development investigate safety and efficacy (SE). The experimental drug or treatment is administered to as many as
several hundred patients in Phase II trials.

At the end of Phase II, a decision will be made as to whether or not the drug is promising, and development should
continue. In the U.S. there will be an ‘End of Phase II’ meeting between the pharmaceutical company and the FDA to
discuss safety and plans for Phase III studies. Ineffective or unsafe compounds should not proceed into Phase III trials.

A Phase III trial is a rigorous clinical trial with randomization, one or more control groups and definitive clinical
endpoints. Phase III trials are often multi-center, accumulating the experience of thousands of patients. Phase III trials
address questions of comparative treatment efficacy (CTE). A CTE trial involves a placebo and/or active control group so
that precise and valid estimates of differences in clinical outcomes attributable to the investigational therapy can be
assessed.

If things go well during Phase III, the company with the license for the compound will submit an application for
approval.to market the drug. U.S. FDA approval hinges on ‘adequate and well-controlled’ pivotal Phase III studies that
are convincing of safety and efficacy.

A phase IV trial or expanded safety trial occurs after regulatory approval of the new therapy. As usage of the new drug
becomes widespread, there is an opportunity to learn about rare side effects and interactions with other therapies. An
expanded safety (ES) study can provide important information that was not apparent during the drug development. For
example, a few thousand patients might be involved in all of the SE and CTE trials for a particular therapy. An ES study,
however, could involve >10,000 patients. Such large sample sizes can detect more subtle safety problems for the
therapy, if such problems exist. Some Phase IV studies will have a marketing objective for the company as well as
collecting safety data.

The terminology of phase I, II, III, and IV trials does not work well for non-pharmacologic treatments and does not
account for translational trials

Most trials in the early stage of therapy development either investigate treatment mechanism (TM) or incorporate dose-
finding (DF) strategies.
Some studies performed prior to large scale clinical trials are characterized as translational studies. Translational studies
have as their primary outcome a biological measurement or target that has been derived from an accepted model of the
disease process. The results of the translational study may provide evidence of a mechanism of action for a compound.
Target validation can be an objective of such a study. Large effects on the target are sought. For example, a large change
in the level of a protein, or the activity of an enzyme might support therapeutic activity of a compound. There is an
understanding that translational work may cycle from preclinical lab to a clinical setting and back again. Although the
translational studies have a written protocol, the treatment may be modified during the study. The protocol should
clearly define what would be considered ‘lack of effect’ and the next experimental step for any possible outcome of the
trial.
 Other considerations

Some therapies are not developed in the same manner as drugs, such as disease prevention therapies, vaccines,
biologicals, surgical techniques, medical devices, and diagnostic agents.

Prevention trials are conducted in:

1. healthy individuals to determine if the therapy prevents the onset of disease,


2. patients with early-stage disease to determine if the therapy prevents progression, or
3. patients with the disease to determine if the therapy prevents additional episodes of disease expression.

Vaccine investigations are a type of primary prevention trial They require large numbers of patients and are very costly
because of the numbers and the length of follow-up that is required.

The objective of a diagnostic or screening trial is to determine if an agent can “diagnose” the presence of disease.
Usually, the agent is compared to a “gold standard” diagnostic that assumed to be perfectly accurate in its diagnosis.
The advantage of the newer diagnostic agent is less expense or a less invasive procedure.

Crossover Designs
(Taken from https://newonlinecourses.science.psu.edu/stat509/node/123/. Accessed: 29 September 2019)

A crossover design is a repeated measurements design such that each experimental unit (patient) receives different
treatments during the different time periods, i.e., the patients cross over from one treatment to another during the
course of the trial. This is in contrast to a parallel design in which patients are randomized to a treatment and remain on
that treatment throughout the duration of the trial.

The reason to consider a crossover design when planning a clinical trial is that it could yield a more efficient comparison
of treatments than a parallel design, i.e., fewer patients might be required in the crossover design in order to attain the
same level of statistical power or precision as a parallel design.(This will become more evident later in this lesson...)
Intuitively, this seems reasonable because each patient serves as his/her own matched control. Every patient receives
both treatment A and B. Crossover designs are popular in medicine, agriculture, manufacturing, education, and many
other disciplines. A comparison is made of the subject's response on A vs. B.

Although the concept of patients serving as their own controls is very appealing to biomedical investigators, crossover
designs are not preferred routinely because of the problems that are inherent with this design. In medical clinical trials
the disease should be chronic and stable, and the treatments should not result in total cures but only alleviate the
disease condition. If treatment A cures the patient during the first period, then treatment B will not have the
opportunity to demonstrate its effectiveness when the patient crosses over to treatment B in the second period.
Therefore, this type of design works only for those conditions that are chronic, such as asthma where there is no cure
and the treatments attempt to improve quality of life.

Crossover designs are the designs of choice for bioequivalence trials. The objective of a bioequivalence trial is to
determine whether test and reference pharmaceutical formulations yield equivalent blood concentration levels. In these
types of trials, we are not interested in whether there is a cure, this is a demonstration is that a new formulation, (for
instance, a new generic drug), results in the same concentration in the blood system. Thus, it is highly desirable to
administer both formulations to each subject, which translates into a crossover design.

 Overview of Crossover Designs


The order of treatment administration in a crossover experiment is called a sequence and the time of a treatment
administration is called a period. Typically, the treatments are designated with capital letters, such as A, B, etc.
The sequences should be determined a priori, and the experimental units are randomized to sequences. The most
popular crossover design is the 2-sequence, 2-period, 2-treatment crossover design, with sequences AB and BA,
sometimes called the 2 × 2 crossover design.

In this particular design, experimental units that are randomized to the AB sequence receive treatment A in the first
period and treatment B in the second period, whereas experimental units that are randomized to the BA sequence
receive treatment B in the first period and treatment A in the second period.

We express this particular design as AB|BA or diagram it as:


[Design 1] Period 1 Period 2
Sequence AB A B
Sequence BA B A

Examples of 3-period, 2-treatment crossover designs are:

[Design 2] Period 1 Period 2 Period 3


Sequence ABB A B B
Sequence BAA B A A

and

[Design 3] Period 1 Period 2 Period 3


Sequence AAB A A B
Sequence ABA A B A
Sequence BAA B A A

Examples of 3-period, 3-treatment crossover designs are

[Design 4] Period 1 Period 2 Period 3


Sequence ABC A B C
Sequence BCA B C A
Sequence CAB C A B

and

[Design 5] Period 1 Period 2 Period 3


Sequence ABC A B C
Sequence BCA B C A
Sequence CAB C A B
Sequence ACB A C B
Sequence BAC B A C
Sequence CBA C B A

Some designs even incorporate non-crossover sequences such as Balaam's design:

[Design 6] Period 1 Period 2


Sequence AB A B
Sequence BA B A
Sequence AA A A
Sequence BB B B

Balaam’s design is unusual, with elements of both parallel and crossover design. There are advantages and
disadvantages to all of these designs.
 Disadvantages

The main disadvantage of a crossover design is that carryover effects may be aliased (confounded) with direct treatment
effects, in the sense that these effects cannot be estimated separately. You think you are estimating the effect of
treatment A but there is also a bias from the previous treatment to account for. Significant carryover effects can bias the
interpretation of data analysis, so an investigator should proceed cautiously whenever he/she is considering the
implementation of a crossover design.

A carryover effect is defined as the effect of the treatment from the previous time period on the response at the current
time period. In other words, if a patient receives treatment A during the first period and treatment B during the second
period, then measurements taken during the second period could be a result of the direct effect of treatment B
administered during the second period, and/or the carryover or residual effect of treatment A administered during the
first period. These carryover effects yield statistical bias.

What can we do about this carryover effect?

The incorporation of lengthy washout periods in the experimental design can diminish the impact of carryover effects. A
washout period is defined as the time between treatment periods. Instead of immediately stopping and then starting
the new treatment, there will be a period of time where the treatment from the first period where the drug is washed
out of the patient's system.

The rationale for this is that the previously administered treatment is “washed out” of the patient and, therefore, it
cannot affect the measurements taken during the current period. This may be true, but it is possible that the previously
administered treatment may have altered the patient in some manner, so that the patient will react differently to any
treatment administered from that time onward. An example is when a pharmaceutical treatment causes permanent
liver damage so that the patients metabolize future drugs differently. Another example occurs if the treatments are
different types of educational tests. Then subjects may be affected permanently by what they learned during the first
period.

How long of a wash out period should there be?


In a trial involving pharmaceutical products, the length of the washout period usually is determined as some multiple of
the half-life of the pharmaceutical product within the population of interest. For example, an investigator might
implement a washout period equivalent to 5 (or more) times the length of the half-life of the drug concentration in the
blood. The figure below depicts the half-life of a hypothetical drug.

Actually, it is not the presence of carryover effects per se that leads to aliasing with direct treatment effects in the
AB|BA crossover, but rather the presence of differential carryover effects, i.e., the carryover effect due to treatment A
differs from the carryover effect due to treatment B. If the carryover effects for A and B are equivalent in the AB|BA
crossover design, then this common carryover effect is not aliased with the treatment difference. So, for crossover
designs, when the carryover effects are different from one another, this presents us with a significant problem.

In the example of the educational tests, differential carryover effects could occur if test A leads to more learning than
test B. Another situation where differential carryover effects may occur is in clinical trials where an active drug (A) is
compared to placebo (B) and the washout period is of inadequate length. The patients in the AB sequence might
experience a strong A carryover during the second period, whereas the patients in the BA sequence might experience a
weak B carryover during the second period.

The recommendation for crossover designs is to avoid the problems caused by differential carryover effects at all costs
by employing lengthy washout periods and/or designs where treatment and carryover are not aliased or confounded
with each other. It is always much more prudent to address a problem a priori by using a proper design rather than a
posteriori by applying a statistical analysis that may require unreasonable assumptions and/or perform unsatisfactorily.
You will see this later on in this lesson...

For example, one approach for the statistical analysis of the 2 × 2 crossover is to conduct a preliminary test for
differential carryover effects. If this is significant, then only the data from the first period are analyzed because the first
period is free of carryover effects. Essentially you be throwing out half of your data!

If the preliminary test for differential carryover is not significant, then the data from both periods are analyzed in the
usual manner. Recent work, however, has revealed that this 2-stage analysis performs poorly because the unconditional
Type I error rate operates at a much higher level than desired. We won't go into the specific details here, but part of the
reason for this is that the test for differential carryover and the test for treatment differences in the first period are
highly correlated and do not act independently.

Even worse, this two-stage approach could lead to losing one-half of the data. If differential carryover effects are of
concern, then a better approach would be to use a study design that can account for them.

Prior to the development of a general statistical model and investigations into its implications for, we require more
definitions.
 Definitions with a Crossover Design
First-order and Higher-order Carryover Effects
Within time period j, j = 2, ... , p, it is possible that there are carryover effects from treatments administered during
periods 1, ... , j - 1. Usually in period j we only consider first-order carryover effects (from period j - 1) because:

1. if first-order carryover effects are negligible, then higher-order carryover effects usually are negligible;
2. the designs needed for eliminating the aliasing between higher-order carryover effects and treatment effects are
very cumbersome and not practical. Therefore, we usually assume that these higher-order carryover effects are
negligible.

In actuality, the length of the washout periods between treatment administrations may be the determining factor as to
whether higher-order carryover effects should be considered. We focus on designs for dealing with first-order carryover
effects, but the development can be generalized if higher-order carryover effects need to be considered. We will focus
on:

Uniformity
A crossover design is labeled as:

1. uniform within sequences if each treatment appears the same number of times within each sequence, and
2. uniform within periods if each treatment appears the same number of times within each period.

For example, AB/BA is uniform within sequences and period (each sequence and each period has 1 A and 1 B) while
ABA/BAB is uniform within period but is not uniform within sequence because the sequences differ in the numbers of A
and B.

If a design is uniform within sequences and uniform within periods, then it is said to be uniform. If the design is uniform
across periods you will be able to remove the period effects. If the design is uniform across sequences, then you will be
also be able to remove the sequence effects. An example of a uniform crossover is ABC/BCA/CAB.

Latin Squares
Latin squares historically have provided the foundation for r-period, r-treatment crossover designs because they yield
uniform crossover designs in that each treatment occurs only once within each sequence and once within each period. As
will be demonstrated later, Latin squares also serve as building blocks for other types of crossover designs. Latin squares
for 4-period, 4-treatment crossover designs are:

[Design 7] Period 1 Period 2 Period 3 Period 4


Sequence ABCD A B C D
Sequence BCDA B C D A
Sequence CDAB C D A B
Sequence DABC D A B C

and

[Design 8] Period 1 Period 2 Period 3 Period 4


Sequence ABCD A B C D
Sequence BDAC B D A C
Sequence CADB C A D B
Sequence DCBA D C B A

Balanced Designs
The Latin square in [Design 8] has an additional property that the Latin square in [Design 7] does not have. Each
treatment precedes every other treatment the same number of times (once). For example, how many times is treatment
A followed by treatment B? Only once. How many times do you have one treatment B followed by a second treatment?
Only once. This is an advantageous property for Design 8. This same property does not occur in [Design 7]. When this
occurs, as in [Design 8], the crossover design is said to be balanced with respect to first-order carryover effects.

The designs that are balanced with respect to first order carryover effects are: Designs 1, 2, 3, 5, 6, 8.
When r is an even number, only 1 Latin square is needed to achieve balance in the r-period, r-treatment crossover.
When r is an odd number, 2 Latin squares are required. For example, the design in [Design 5] is a 6-sequence, 3-period,
3-treatment crossover design that is balanced with respect to first-order carryover effects because each treatment
precedes every other treatment twice.

Strongly Balanced Designs


A crossover design is said to be strongly balanced with respect to first-order carryover effects if each treatment precedes
every other treatment, including itself, the same number of times. A strongly balanced design can be constructed by
repeating the last period in a balanced design.

Here is an example:
[Design 9] Period 1 Period 2 Period 3 Period 4 Period 5
Sequence ABCDD A B C D D
Sequence BDACC B D A C C
Sequence CADBB C A D B B
Sequence DCBAA D C B A A

This is a 4-sequence, 5-period, 4-treatment crossover design that is strongly balanced with respect to first-order
carryover effects because each treatment precedes every other treatment, including itself, once. Obviously, the
uniformity of the Latin square design disappears because the design in [Design 9] is no longer is uniform within
sequences.

Uniform and Strongly Balanced Design


Latin squares yield uniform crossover designs, but strongly balanced designs constructed by replicating the last period of
a balanced design are not uniform crossover designs. The following 4-sequence, 4-period, 2-treatment crossover design
is an example of a strongly balanced and uniform design.

[Design 10] Period 1 Period 2 Period 3 Period 4


Sequence ABBA A B B A
Sequence BAAB B A A B
Sequence AABB A A B B
Sequence BBAA B B A A
Lesson 3: INTERNAL AND EXTERNAL VALIDITY
Internal and External Validity
The Logic of Internal Validity
Internal validity means the ability to eliminate alternative explanations of the dependent variable. Variables, other
than the treatment, that affect the dependent variable are threats to internal validity. They threaten the researcher's
ability to say that the treatment was the true causal factor producing change in the dependent variable. Thus, the logic
of internal validity is to rule out variables other than the treatment by controlling experimental conditions and
through experimental designs.

2. Threats to Internal Validity


1. Selection Bias
Selection bias is the threat that subjects will not form equivalent groups. It is a problem in designs
without random assignment. It occurs when subjects in one experimental group have a characteristic
that affects the dependent variable. For example, in an experiment on physical aggressiveness, the
treatment group unintentionally contains subjects who are football, rugby, and hockey players, whereas
the control group is made up of musicians, chess players, and painters. Another example is an
experiment on the ability of people to dodge heavy traffic. All subjects assigned to one group come from
rural areas, and all subjects in the other grew up in large cities. An examination of pretest scores helps a
researcher detect this threat, because no group differences are expected.

2. History
This is the threat that an event unrelated to the treatment will occur during the experiment and
influence the dependent variable. History effects are more likely in experiments that continue over a
long time period. For example, halfway through a two-week experiment to evaluate subject attitudes
toward space travel, a spacecraft explodes on the launch pad, killing the astronauts.

3. Maturation
This is the threat that some biological, psychological, or emotional process within the subjects and
separate from the treatment will change over time. Maturation is more common in experiments over
long time periods. For example, during an experiment on reasoning ability, subjects become bored and
sleepy and, as a result, score lower. Another example is an experiment on the styles of children's play
between grades 1 and 6. Play styles are affected by physical, emotional, and maturation changes that
occur as the children grow older, instead of or in addition to the effects of a treatment. Designs with a
pretest and control group help researchers determine whether maturation or history effects are
present, because both experimental and control groups will show similar changes over time.

4. Testing
Sometimes, the pretest measure itself affects an experiment. This testing effect threatens internal
validity because more than the treatment alone affects the dependent variable. The Solomon four-
group design helps a researcher detect testing effects. For example, a researcher gives students an
examination on the first day of class. The course is the treatment. He or she tests learning by giving the
same exam on the last day of class. If subjects remember the pretest questions and this affects what
they learned (i.e., paid attention to) or how they answered questions on the posttest, a testing effect is
present. If testing effects occur, a researcher cannot say that the treatment alone has affected the
dependent variable.

5. Instrumentation
This threat is related to stability reliability. It occurs when the instrument or dependent variable
measure changes during the experiment. For example, in a weight-loss experiment, the springs on the
scale weaken during the experiment, giving lower readings in the posttest.

6. Mortality
Mortality or attrition arises when some subjects do not continue throughout the experiment. Although
the word mortality means death, it does not necessarily mean that subjects have died. If a subset of
subjects leaves partway through an experiment, a researcher cannot know whether the results would
have been different had the subjects stayed. For example, a researcher begins a weight-loss program
with 50 subjects. At the end of the program, 30 remain, each of who lost 5 pounds with no side effects.
The 20 who left could have differed from the 30 who stayed, changing the results. Maybe the program
was effective for those who left, and they withdrew after losing 25 pounds. Or perhaps the program
made subjects sick and forced them to quit. Researchers should notice and report the number of
subjects in each group during pretests and posttests to detect this threat to internal validity.

7. Statistical Regression
Statistical regression is not easy to grasp intuitively. It is a problem of extreme values or a tendency for
random errors to move group results toward the average. It can occur in two ways:
A. One situation arises when subjects are unusual with regard to the dependent variable.
Because they begin as unusual or extreme, subjects are unlikely to respond further in the
same direction. For example, a researcher wants to see whether violent films make people
act violently. He or she chooses a group of violent criminals from a high security prison,
gives them a pretest, shows violent films, and then administers a posttest. To the
researcher's shock, the criminals are slightly less violent after the film, whereas a control
group of non-prisoners who did not see the film are slightly more violent than before.
Because the violent criminals began at an extreme, it is unlikely that a treatment could make
them more violent; by random chance alone, they appear less extreme when measured a
second time.
B. A second situation involves a problem with the measurement instrument. If many subjects
score very high (at the ceiling) or very low (at the floor) on a variable, random chance alone
will produce a change between the pretest and the posttest. For example, a researcher gives
80 subjects a test, and 75 get perfect scores. He or she then gives a treatment to raise
scores. Because so many subjects already had perfect scores, random errors will reduce the
group average because those who got perfect scores can randomly move in only one
direction-to get some answers wrong. An examination of scores on pretests will help
researchers detect this threat to internal validity.

8. Diffusion of Treatment or Contamination


Diffusion of treatment or contamination is the threat that subjects in different groups will communicate
to each other and learn about the other’s treatment. Researchers avoid it by isolating groups or having
subjects promise not to reveal anything to others who will become subjects. For example, subjects
participate in a daylong experiment on a new way to memorize words. During a break, treatment group
subjects tell those in the control group about the new way to memorize, which control group subjects
then use. A researcher needs outside information such as post-experiment interviews with subjects to
detect this threat.

9. Compensatory Behavior
Some experiments provide something of value to one group, of subjects but not to another, and the
difference becomes known. The inequality may produce pressure to reduce differences, competitive
rivalry between groups, or resentful demoralization. All these types of compensatory behavior can affect
the dependent variable in addition to the treatment. For example, one school system receives a
treatment (longer lunch breaks) to produce gains in learning. Once the inequality is known, subjects in
the control group demand equal treatment and work extra hard to learn and overcome the inequality.
Another group becomes demoralized by the unequal treatment and withdraws from learning. It is
difficult to detect this threat unless outside information is used (see the earlier discussion of diffusion of
treatment).
10. Experimenter Expectancy
Although it is not always considered a traditional internal validity problem, the experimenter's behavior,
too, can threaten causal logic. A researcher may threaten internal validity, not by purposefully unethical
behavior but by indirectly communicating experimenter expectancy to subjects. Researchers may be
highly committed to the hypothesis and indirectly communicate desired findings to subjects. For
example, a researcher studying reactions toward the disabled deeply believes that females are more
sensitive toward the disabled than males are. Through eye contact, tone of voice, pauses, and other
nonverbal communication, the researcher unconsciously encourages female subjects to report positive
feelings toward the disabled; the researcher's nonverbal behavior is the opposite for male subjects.
Here is a way to detect experimenter expectancy. A researcher hires assistants and teaches them
experimental techniques. The assistants train subjects and test their learning ability. The researcher
gives the assistants fake transcripts and records showing that subjects in one group are honor students
and the others are failing, although in fact the subjects are identical. Experimenter expectancy is present
if the fake honor students, as a group, do much better than the fake failing students. A commonly used
technique by researchers in order to reduce the effects of experimenter expectancy is a double-blind
experiment.
Double-Blind Experiment
o The double-blind experiment is designed to control researcher expectancy. In it,
people who have direct contact with subjects do not know the details of the
hypothesis or the treatment. It is double blind because both the subjects and
those in contact with them are blind to details of the experiment.

3. External Validity and Field Experiments


o Even if an experimenter eliminates all concerns about internal validity, external validity remains a
potential problem. External validity is the ability to generalize experimental findings to events and
settings outside the experiment itself. If a study lacks external validity, its findings hold true only in
experiments, making them useless to both basic and applied science.
o Campbell and Stanley (1963) listed four threats to external validity
1. Testing Effect
The effects having been pretested may be sufficient to make groups quite different from
untested people to whom the results of the study will be generalized.
2. Selection Effect
The rigorous criteria used to select subjects may limit generalizability. For example, in many
pharmacological studies the subjects cannot have any illness other than the one for which the
drug is intended. Although this eliminates the confounding effects of other illness, it also does
not represent the reality of multiple comorbidities, especially in people with multiple chronic
illnesses.
3. Experiment Effect
Being involved in a carefully designed and implemented experimental study can be very
different experience from receiving the same treatment in ordinary care settings.
4. Multiple Treatment Effect
This threat occurs when the same subjects are exposed to more than one treatment (using the
subjects as their own comparison group, for example). Campbell and Stanley (1963) comment
that “the effects of prior treatments are not usually erasable”

4. Realism in Experiments
Are experiments realistic? If not, will the affects be replicated outside the laboratory? Two forms of realism
can help us answer some of these questions.
1. Experimental Realism
The degree or impact of an experimental treatment or setting on subjects; it occurs when subjects are
caught up in the experiment and are truly influenced by it.
2. Mundane Realism
Asks: Is the experiment like the real world? Mundane realism mostly answers questions raised about
external validity. Two aspects of experiments can be generalized. One is from the subjects: Are the
subjects similar to the general population? Another aspect is generalizing from an artificial treatment to
everyday life: Is watching a violent horror movie in a classroom similar to watching similar shows over
the course of many years?
a. Reactivity
i. Subjects may react differently in an experiment than they would in real life because they
know that they are in a study.
ii. Types of Reactivity
1. Hawthorne Effect – is a specific kind of reactivity. The name comes from a series
of experiments by Elton Mayo at the Hawthorne, Illinois, plant of Westinghouse
Electric during the 1920’s. He serendipitously discovered that the act of
monitoring an individual may produce changes in the dependent variable.
2. Novelty Effect – another kind of reactivity that produces changes in the
dependent variable as a result of something new being introduced to the
subjects.
3. Demand Characteristics – subjects may pick up clues about the hypothesis or
goal of an experiment and they may change their behavior to what they think is
demanded of them.

9. Field Experiments
Experiments are also conducted in “real life” or field settings where a researcher has less control over the
experimental conditions. The amount of control varies on a continuum. At one end is the highly controlled
laboratory experiment, which takes place in a specialized setting or laboratory; at the opposite end is the field
experiment, which takes place in the "field"-in natural settings such as a subway car, a liquor store, or a public
sidewalk. Subjects in field experiments are usually unaware that they are involved in an experiment and react in
a more natural way.

10. Practical Considerations


Every research technique has informal tricks of the trade. They are pragmatic and based on common sense but
account for the difference between the successful research projects of an experienced researcher and the
difficulties a novice researcher faces. Three are discussed here:
1. Planning and Pilot Tests
All social research requires planning, and most quantitative researchers use pilot tests. During the
planning phase of experimental research, a researcher thinks of alternative explanations or threats
to internal validity and how to avoid them. The researcher also develops a neat and well-organized
system for recording data. In addition, he or she should devote serious effort to pilot testing any
apparatus (e.g., computers, video cameras, tape recorders, etc.) that will be used in the treatment
situation, and he or she must train and pilot test confederates. After the pilot tests, the researcher
should interview the pilot subjects to uncover aspects of the experiment that need refinement.
2. Instructions to Subjects
Most experiments involve giving instructions to subjects to set the stage. A researcher should word
instructions carefully and follow a prepared script so that all subjects hear the same thing. This
ensures reliability. The instructions are also important in creating a realistic cover story when
deception is used.
3. Post-Experiment Interview
At the end of an experiment, the researcher should interview subjects, for three reasons:
a. First, if deception was used, the researcher needs to debrief the subjects, telling them the
true purpose of the experiment and answering questions.
b. Second, he or she can learn what the subjects thought and how their definitions of the
situation affected their behavior.
c. Finally, he or she can explain the importance of not revealing the true nature of the
experiment to other potential subjects.

A Word on Ethics
Ethical considerations are a significant issue in experimental research because experimental research is intrusive (i.e., it
interferes). Treatments may involve placing people in contrived social settings and manipulating their feelings or
behaviors. Dependent variables may be what subjects say or do. The amount and type of intrusion is limited by ethical
standards. Researchers must be very careful if they place subjects in physical danger or in embarrassing or anxiety-
inducing situations. They must painstakingly monitor events and control what occurs. Deception is common in social
experiments, but it involves misleading or lying to subjects. Such dishonesty is not condoned as acceptable and is
acceptable only as the means to achieve a goal that cannot be achieved otherwise. Even for a worthy goal, deception
can be used only with restrictions. The amount and type of deception should not go beyond what is minimally necessary,
and subjects should be debriefed.

Any researcher conducting an experiment must ensure that the dignity and welfare of the subjects are maintained. The
American Psychological Association (APA) published the Ethical Principles in the Conduct of Research with Human
Participants in 1982. The document listed the following principles:
 In planning a study, the researcher must take responsibility to ensure that the study respects human values and
protect the rights of human subjects.
 The researcher should determine the degree of risk imposed on subjects by the study (e.g. stress on subjects,
subjects required to take drugs).
 The principal researcher is responsible for the ethical conduct of the study and be responsible for assistants or
other researchers involved.
 The researcher should make it clear to the subjects before they participate in the study regarding their
obligations and responsibilities. The researcher should inform subjects of all aspects of the research that might
influence their decision to participate.
 If the researcher cannot tell everything about the experiment because it is too technical or it will affect the
study, then the researcher must inform subjects after the experiment.
 The researcher should respect the individual’s freedom to decline to participate in or withdraw from the
experiment at any time.
 The researcher should protect subjects from physical and mental discomfort, harm, and danger that may arise
from the experiment. If there are risks involved, the researcher must inform the subjects of that fact.
 Information obtained from the subjects in the experiment is confidential unless otherwise agreed upon. Data
should be reported as group performance and not individual performance

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy