Module Week 10 11 Biostat Lec
Module Week 10 11 Biostat Lec
Learning Outcomes: At the end of this module, you are expected to:
LEARNING CONTENT
Introduction:
In this section we will first discuss correlation analysis, which is used to quantify the association
between two continuous variables (e.g., between an independent and a dependent variable or between two
independent variables). Regression analysis is a related technique to assess the relationship between an
outcome variable and one or more risk factors or confounding variables. The outcome variable is also called
the response or dependent variable and the risk factors and confounders are called the predictors, or
explanatory or independent variables. In regression analysis, the dependent variable is denoted "y" and the
independent variables are denoted by "x".
[NOTE: The term "predictor" can be misleading if it is interpreted as the ability to predict even beyond
the limits of the data. Also, the term "explanatory variable" might give an impression of a causal effect in a
situation in which inferences should be limited to identifying associations. The terms "independent" and
"dependent" variable are less subject to these interpretations as they do not strongly imply cause and effect.
Lesson Proper:
PEARSON CORRELATION
Introduction
The bivariate Pearson Correlation produces a sample correlation coefficient, r, which measures the strength
and direction of linear relationships between pairs of continuous variables. By extension, the Pearson
Correlation evaluates whether there is statistical evidence for a linear relationship among the same pairs of
Pearson’s correlation
Pearson product-moment correlation (PPMC)
Common Uses
Whether a statistically significant linear relationship exists between two continuous variables
The strength of a linear relationship (i.e., how close the relationship is to being a perfectly straight line)
The direction of a linear relationship (increasing or decreasing)
Note: The bivariate Pearson Correlation cannot address non-linear relationships or relationships among
categorical variables. If you wish to understand relationships that involve categorical variables and/or non-
linear relationships, you will need to choose another measure of association.
Note: The bivariate Pearson Correlation only reveals associations among continuous variables. The bivariate
Pearson Correlation does not provide any inferences about causation, no matter how large the correlation
coefficient is.
Data Requirements
Hypotheses
The null hypothesis (H0) and alternative hypothesis (H1) of the significance test for correlation can be
expressed in the following ways, depending on whether a one-tailed or two-tailed test is requested:
H1: ρ ≠ 0 ("the population correlation coefficient is not 0; a nonzero correlation could exist")
H1: ρ > 0 ("the population correlation coefficient is greater than 0; a positive correlation could exist")
OR
H1: ρ < 0 ("the population correlation coefficient is less than 0; a negative correlation could exist")
Test Statistic
The sample correlation coefficient between two variables x and y is denoted r or rxy, and can be computed as:
where cov(x, y) is the sample covariance of x and y; var(x) is the sample variance of x; and var(y) is the
sample variance of y.
Correlation can take on any value in the range [-1, 1]. The sign of the correlation coefficient indicates the
direction of the relationship, while the magnitude of the correlation (how close it is to -1 or +1) indicates the
strength of the relationship.
MELS 1073 – Biostatistics & Epidemiology (Lecture) | 3
-1 : perfectly negative linear relationship
0 : no relationship
The strength can be assessed by these general guidelines (which may vary by discipline):
Note: The direction and strength of a correlation are two distinct properties. The scatterplots below show
correlations that are r = +0.90, r = 0.00, and r = -0.90, respectively. The strength of the nonzero correlations
are the same: 0.90. But the direction of the correlations is different: a negative correlation corresponds to a
decreasing relationship, while and a positive correlation corresponds to an increasing relationship.
Note that the r = 0.00 correlation has no discernable increasing or decreasing linear pattern in this particular
graph. However, keep in mind that Pearson correlation is only capable of detecting linear associations, so it is
possible to have a pair of variables with a strong nonlinear relationship and a small Pearson correlation
coefficient. It is good practice to create scatterplots of your variables to corroborate your correlation
coefficients.
Data Set-Up
Your dataset should include two or more continuous numeric variables, each defined as scale, which will be
used in the analysis.
Each row in the dataset should represent one unique subject, person, or unit. All of the measurements taken
on that person or unit should appear in that row. If measurements for one subject appear on multiple rows -- for
example, if you have measurements from different time points on separate rows -- you should reshape your
data to "wide" format before you compute the correlations.
MELS 1073 – Biostatistics & Epidemiology (Lecture) | 4
Run a Bivariate Pearson Correlation
To run a bivariate Pearson Correlation in SPSS, click Analyze > Correlate > Bivariate.
The Bivariate Correlations window opens, where you will specify the variables to be used in the analysis. All of
the variables in your dataset appear in the list on the left side. To select variables for the analysis, select the
variables in the list on the left and click the blue arrow button to move them to the right, in the Variables field.
Problem Statement
Perhaps you would like to test whether there is a statistically significant linear relationship between two
continuous variables, weight (in lbs) and height (in inches) (and by extension, infer whether the association is
significant in the population). You can use a bivariate Pearson Correlation to test whether there is a statistically
significant linear relationship between height and weight, and to determine the strength and direction of the
association.
In the sample data, we will use two variables: “Height” and “Weight.” The variable “Height” is a continuous
measure of height in inches and exhibits a range of values from 55.00 to 84.41 (Analyze > Descriptive
Statistics > Descriptives). The variable “Weight” is a continuous measure of weight in pounds and exhibits a
range of values from 101.71 to 350.07.
Before we look at the Pearson correlations, we should look at the scatterplots of our variables to get an idea of
what to expect. In particular, we need to determine if it's reasonable to assume that our variables have linear
relationships. Click Graphs > Legacy Dialogs > Scatter/Dot. In the Scatter/Dot window, click Simple
Scatter, then click Define. Move variable Height to the X Axis box, and move variable Weight to the Y Axis
box. When finished, click OK.
To add a linear fit like the one depicted, double-click on the plot in the Output Viewer to open the Chart Editor.
Click Elements > Fit Line at Total. In the Properties window, make sure the Fit Method is set to Linear, then
From the scatterplot, we can see that as height increases, weight also tends to increase. There does appear to
be some linear relationship.
To run the bivariate Pearson Correlation, click Analyze > Correlate > Bivariate. Select the variables Height and
Weight and move them to the Variables box. In the Correlation Coefficients area, select Pearson. In the Test of
Significance area, select your desired significance test, two-tailed or one-tailed. We will select a two-tailed
significance test in this example. Check the box next to Flag significant correlations.
Click OK to run the bivariate Pearson Correlation. Output for the analysis will display in the Output Viewer.
Output
Tables
The correlations in the main diagonal (cells A and D) are all equal to 1. This is because a variable is always
perfectly correlated with itself. Notice, however, that the sample sizes are different in cell A (n=408) versus cell
D (n=376). This is because of missing data -- there are more missing observations for variable Weight than
there are for variable Height.
If you have opted to flag significant correlations, SPSS will mark a 0.05 significance level with one asterisk (*)
and a 0.01 significance level with two asterisks (0.01). In cell B (repeated in cell C), we can see that the
Pearson correlation coefficient for height and weight is .513, which is significant (p < .001 for a two-tailed test),
based on 354 complete observations (i.e., cases with nonmissing values for both height and weight).
SIMPLE REGRESSION
Introduction
Linear regression is the next step up after correlation. It is used when we want to predict the value of a variable
based on the value of another variable. The variable we want to predict is called the dependent variable (or
sometimes, the outcome variable). The variable we are using to predict the other variable's value is called the
independent variable (or sometimes, the predictor variable). For example, you could use linear regression to
understand whether exam performance can be predicted based on revision time; whether cigarette
consumption can be predicted based on smoking duration; and so forth. If you have two or more independent
variables, rather than just one, you need to use multiple regression.
Data Requirements
When you choose to analyze your data using linear regression, part of the process involves checking to make
sure that the data you want to analyze can actually be analyzed using linear regression. You need to do this
because it is only appropriate to use linear regression if your data "passes" six assumptions that are required
for linear regression to give you a valid result. In practice, checking for these six assumptions just adds a little
bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing
your analysis, as well as think a little bit more about your data, but it is not a difficult task.
Before we introduce you to these six assumptions, do not be surprised if, when analyzing your own data using
SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This is not uncommon when
working with real-world data rather than textbook examples, which often only show you how to carry out linear
regression when everything goes well! However, don’t worry. Even when your data fails certain assumptions,
there is often a solution to overcome this. First, let’s take a look at these six assumptions:
Assumption #1: Your two variables should be measured at the continuous level (i.e., they are either interval or
ratio variables). Examples of continuous variables include revision time (measured in hours), intelligence
(measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so
forth.
Assumption #2: There needs to be a linear relationship between the two variables. Your scatterplot may look
something like one of the following:
Assumption #5: Your data needs to show homoscedasticity, which is where the variances along the line of best
fit remain similar as you move along the line.
Example Problem
In the previous problem, you want to determine how height predicts the weight of students.
Steps
Step 1: Click Analyze > Regression > Linear... on the top menu, as shown below:
The first table of interest is the Model Summary table, as shown below:
This table provides the R and R2 values. The R value represents the simple correlation and is 0.513 (the "R"
Column), which indicates a high degree of correlation. The R 2 value (the "R Square" column) indicates how
much of the total variation in the dependent variable, Weight, can be explained by the independent variable,
Height. In this case, 26.3%.
The next table is the ANOVA table, which reports how well the regression equation fits the data (i.e., predicts
the dependent variable) and is shown below:
This table indicates that the regression model predicts the dependent variable significantly well. How do we
know this? Look at the "Regression" row and go to the "Sig." column. This indicates the statistical significance
of the regression model that was run. Here, p < 0.001, which is less than 0.05, and indicates that, overall, the
regression model statistically significantly predicts the outcome variable (i.e., it is a good fit for the data).
The Coefficients table provides us with the necessary information to predict weight from height, as well as
determine whether height contributes statistically significantly to the model (by looking at the "Sig." column).
Furthermore, we can use the values in the "B" column under the "Unstandardized Coefficients" column, as
shown below:
Example: Predict the weight of X (in lbs.) if his height (in inches) is 72.
***END OF LESSON***