0% found this document useful (0 votes)
4 views17 pages

Module Week 10 11 Biostat Lec

The document is a learning module for MELS 1073 (Biostatistics & Epidemiology) focusing on regression and correlation analysis. It covers the importance of these statistical methods in health research, explains the Pearson correlation coefficient, and outlines the requirements and assumptions for conducting linear regression analysis. Additionally, it provides practical steps for running a bivariate Pearson correlation and simple regression using SPSS, along with examples and interpretations of results.

Uploaded by

timosadwayne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views17 pages

Module Week 10 11 Biostat Lec

The document is a learning module for MELS 1073 (Biostatistics & Epidemiology) focusing on regression and correlation analysis. It covers the importance of these statistical methods in health research, explains the Pearson correlation coefficient, and outlines the requirements and assumptions for conducting linear regression analysis. Additionally, it provides practical steps for running a bivariate Pearson correlation and simple regression using SPSS, along with examples and interpretations of results.

Uploaded by

timosadwayne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

CORRESPONDENCE LEARNING MODULE

MELS 1073 (Biostatistics & Epidemiology) – Lecture


AY 2022-2023

Lesson 10: Regression and Correlation Analysis

Topic: Simple Regression, and Correlation Analysis

Learning Outcomes: At the end of this module, you are expected to:

1. Discuss the importance of regression and correlation analysis in health researches.


2. Evaluate correctly the results of the statistical test.

LEARNING CONTENT

Introduction:

In this section we will first discuss correlation analysis, which is used to quantify the association
between two continuous variables (e.g., between an independent and a dependent variable or between two
independent variables). Regression analysis is a related technique to assess the relationship between an
outcome variable and one or more risk factors or confounding variables. The outcome variable is also called
the response or dependent variable and the risk factors and confounders are called the predictors, or
explanatory or independent variables. In regression analysis, the dependent variable is denoted "y" and the
independent variables are denoted by "x".

[NOTE: The term "predictor" can be misleading if it is interpreted as the ability to predict even beyond
the limits of the data. Also, the term "explanatory variable" might give an impression of a causal effect in a
situation in which inferences should be limited to identifying associations. The terms "independent" and
"dependent" variable are less subject to these interpretations as they do not strongly imply cause and effect.

Lesson Proper:

PEARSON CORRELATION

Introduction

The bivariate Pearson Correlation produces a sample correlation coefficient, r, which measures the strength
and direction of linear relationships between pairs of continuous variables. By extension, the Pearson
Correlation evaluates whether there is statistical evidence for a linear relationship among the same pairs of

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 1


variables in the population, represented by a population correlation coefficient, ρ (“rho”). The Pearson
Correlation is a parametric measure.

This measure is also known as:

 Pearson’s correlation
 Pearson product-moment correlation (PPMC)

Common Uses

The bivariate Pearson Correlation is commonly used to measure the following:

 Correlations among pairs of variables


 Correlations within and between sets of variables

The bivariate Pearson correlation indicates the following:

 Whether a statistically significant linear relationship exists between two continuous variables
 The strength of a linear relationship (i.e., how close the relationship is to being a perfectly straight line)
 The direction of a linear relationship (increasing or decreasing)

Note: The bivariate Pearson Correlation cannot address non-linear relationships or relationships among
categorical variables. If you wish to understand relationships that involve categorical variables and/or non-
linear relationships, you will need to choose another measure of association.

Note: The bivariate Pearson Correlation only reveals associations among continuous variables. The bivariate
Pearson Correlation does not provide any inferences about causation, no matter how large the correlation
coefficient is.

Data Requirements

Your data must meet the following requirements:

1. Two or more continuous variables (i.e., interval or ratio level)


2. Cases that have values on both variables
3. Linear relationship between the variables
4. Independent cases (i.e., independence of observations)
a. There is no relationship between the values of variables between cases. This means that:
i. the values for all variables across cases are unrelated
ii. for any case, the value for any variable cannot influence the value of any variable for
other cases
iii. no case can influence another case on any variable
b. The biviariate Pearson correlation coefficient and corresponding significance test are not robust
when independence is violated.
5. Bivariate normality
a. Each pair of variables is bivariately normally distributed
b. Each pair of variables is bivariately normally distributed at all levels of the other variable(s)
MELS 1073 – Biostatistics & Epidemiology (Lecture) | 2
c. This assumption ensures that the variables are linearly related; violations of this assumption
may indicate that non-linear relationships among variables exist. Linearity can be assessed
visually using a scatterplot of the data.
6. Random sample of data from the population
7. No outliers

Hypotheses

The null hypothesis (H0) and alternative hypothesis (H1) of the significance test for correlation can be
expressed in the following ways, depending on whether a one-tailed or two-tailed test is requested:

Two-tailed significance test:

H0: ρ = 0 ("the population correlation coefficient is 0; there is no association")

H1: ρ ≠ 0 ("the population correlation coefficient is not 0; a nonzero correlation could exist")

One-tailed significance test:

H0: ρ = 0 ("the population correlation coefficient is 0; there is no association")

H1: ρ > 0 ("the population correlation coefficient is greater than 0; a positive correlation could exist")

OR

H1: ρ < 0 ("the population correlation coefficient is less than 0; a negative correlation could exist")

where ρ is the population correlation coefficient.

Test Statistic

The sample correlation coefficient between two variables x and y is denoted r or rxy, and can be computed as:

where cov(x, y) is the sample covariance of x and y; var(x) is the sample variance of x; and var(y) is the
sample variance of y.

Correlation can take on any value in the range [-1, 1]. The sign of the correlation coefficient indicates the
direction of the relationship, while the magnitude of the correlation (how close it is to -1 or +1) indicates the
strength of the relationship.
MELS 1073 – Biostatistics & Epidemiology (Lecture) | 3
-1 : perfectly negative linear relationship

0 : no relationship

+1 : perfectly positive linear relationship

The strength can be assessed by these general guidelines (which may vary by discipline):

.1 < | r | < .3 … small / weak correlation

.3 < | r | < .5 … medium / moderate correlation

.5 < | r | ……… large / strong correlation

Note: The direction and strength of a correlation are two distinct properties. The scatterplots below show
correlations that are r = +0.90, r = 0.00, and r = -0.90, respectively. The strength of the nonzero correlations
are the same: 0.90. But the direction of the correlations is different: a negative correlation corresponds to a
decreasing relationship, while and a positive correlation corresponds to an increasing relationship.

Note that the r = 0.00 correlation has no discernable increasing or decreasing linear pattern in this particular
graph. However, keep in mind that Pearson correlation is only capable of detecting linear associations, so it is
possible to have a pair of variables with a strong nonlinear relationship and a small Pearson correlation
coefficient. It is good practice to create scatterplots of your variables to corroborate your correlation
coefficients.

Data Set-Up

Your dataset should include two or more continuous numeric variables, each defined as scale, which will be
used in the analysis.

Each row in the dataset should represent one unique subject, person, or unit. All of the measurements taken
on that person or unit should appear in that row. If measurements for one subject appear on multiple rows -- for
example, if you have measurements from different time points on separate rows -- you should reshape your
data to "wide" format before you compute the correlations.
MELS 1073 – Biostatistics & Epidemiology (Lecture) | 4
Run a Bivariate Pearson Correlation

To run a bivariate Pearson Correlation in SPSS, click Analyze > Correlate > Bivariate.

The Bivariate Correlations window opens, where you will specify the variables to be used in the analysis. All of
the variables in your dataset appear in the list on the left side. To select variables for the analysis, select the
variables in the list on the left and click the blue arrow button to move them to the right, in the Variables field.

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 5


MELS 1073 – Biostatistics & Epidemiology (Lecture) | 6
Example: Understanding the linear association between weight and height

Problem Statement

Perhaps you would like to test whether there is a statistically significant linear relationship between two
continuous variables, weight (in lbs) and height (in inches) (and by extension, infer whether the association is
significant in the population). You can use a bivariate Pearson Correlation to test whether there is a statistically
significant linear relationship between height and weight, and to determine the strength and direction of the
association.

Download the date set here:


https://drive.google.com/file/d/10_obUnx3VlHOfiV0EWBK3dRtzq_7zbEm/view?usp=sharing

Before the Test

In the sample data, we will use two variables: “Height” and “Weight.” The variable “Height” is a continuous
measure of height in inches and exhibits a range of values from 55.00 to 84.41 (Analyze > Descriptive
Statistics > Descriptives). The variable “Weight” is a continuous measure of weight in pounds and exhibits a
range of values from 101.71 to 350.07.

Before we look at the Pearson correlations, we should look at the scatterplots of our variables to get an idea of
what to expect. In particular, we need to determine if it's reasonable to assume that our variables have linear
relationships. Click Graphs > Legacy Dialogs > Scatter/Dot. In the Scatter/Dot window, click Simple
Scatter, then click Define. Move variable Height to the X Axis box, and move variable Weight to the Y Axis
box. When finished, click OK.

To add a linear fit like the one depicted, double-click on the plot in the Output Viewer to open the Chart Editor.
Click Elements > Fit Line at Total. In the Properties window, make sure the Fit Method is set to Linear, then

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 7


click Apply. (Notice that adding the linear regression trend line will also add the R-squared value in the margin
of the plot. If we take the square root of this number, it should match the value of the Pearson correlation we
obtain.)

From the scatterplot, we can see that as height increases, weight also tends to increase. There does appear to
be some linear relationship.

Running the Test

To run the bivariate Pearson Correlation, click Analyze > Correlate > Bivariate. Select the variables Height and
Weight and move them to the Variables box. In the Correlation Coefficients area, select Pearson. In the Test of
Significance area, select your desired significance test, two-tailed or one-tailed. We will select a two-tailed
significance test in this example. Check the box next to Flag significant correlations.

Click OK to run the bivariate Pearson Correlation. Output for the analysis will display in the Output Viewer.

Output

Tables

The results will display the correlations in a table, labeled Correlations.

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 8


The important cells we want to look at are either B or C. (Cells B and C are identical, because they include
information about the same pair of variables.) Cells B and C contain the correlation coefficient for the
correlation between height and weight, its p-value, and the number of complete pairwise observations that the
calculation was based on.

The correlations in the main diagonal (cells A and D) are all equal to 1. This is because a variable is always
perfectly correlated with itself. Notice, however, that the sample sizes are different in cell A (n=408) versus cell
D (n=376). This is because of missing data -- there are more missing observations for variable Weight than
there are for variable Height.

If you have opted to flag significant correlations, SPSS will mark a 0.05 significance level with one asterisk (*)
and a 0.01 significance level with two asterisks (0.01). In cell B (repeated in cell C), we can see that the
Pearson correlation coefficient for height and weight is .513, which is significant (p < .001 for a two-tailed test),
based on 354 complete observations (i.e., cases with nonmissing values for both height and weight).

Decision and Conclusions

Based on the results, we can state the following:

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 9


 Weight and height have a statistically significant linear relationship (r=.513, p < .001).
 The direction of the relationship is positive (i.e., height and weight are positively correlated), meaning
that these variables tend to increase together (i.e., greater height is associated with greater weight).
 The magnitude, or strength, of the association is approximately moderate (.3 < | r | < .5).

SIMPLE REGRESSION

Introduction

Linear regression is the next step up after correlation. It is used when we want to predict the value of a variable
based on the value of another variable. The variable we want to predict is called the dependent variable (or
sometimes, the outcome variable). The variable we are using to predict the other variable's value is called the
independent variable (or sometimes, the predictor variable). For example, you could use linear regression to
understand whether exam performance can be predicted based on revision time; whether cigarette
consumption can be predicted based on smoking duration; and so forth. If you have two or more independent
variables, rather than just one, you need to use multiple regression.

Data Requirements

When you choose to analyze your data using linear regression, part of the process involves checking to make
sure that the data you want to analyze can actually be analyzed using linear regression. You need to do this
because it is only appropriate to use linear regression if your data "passes" six assumptions that are required
for linear regression to give you a valid result. In practice, checking for these six assumptions just adds a little
bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing
your analysis, as well as think a little bit more about your data, but it is not a difficult task.

Before we introduce you to these six assumptions, do not be surprised if, when analyzing your own data using
SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This is not uncommon when
working with real-world data rather than textbook examples, which often only show you how to carry out linear
regression when everything goes well! However, don’t worry. Even when your data fails certain assumptions,
there is often a solution to overcome this. First, let’s take a look at these six assumptions:

Assumption #1: Your two variables should be measured at the continuous level (i.e., they are either interval or
ratio variables). Examples of continuous variables include revision time (measured in hours), intelligence
(measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so
forth.

Assumption #2: There needs to be a linear relationship between the two variables. Your scatterplot may look
something like one of the following:

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 10


Assumption #3: There should be no significant outliers.

Assumption #4: You should have independence of observations

Assumption #5: Your data needs to show homoscedasticity, which is where the variances along the line of best
fit remain similar as you move along the line.

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 11


Assumption #6: Finally, you need to check that the residuals (errors) of the regression line are approximately
normally distributed

Example Problem

In the previous problem, you want to determine how height predicts the weight of students.

Steps

Step 1: Click Analyze > Regression > Linear... on the top menu, as shown below:

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 12


You will be presented with the Linear Regression dialogue box:

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 13


Step 2: Transfer the independent variable, Height, into the Independent(s): box and the dependent variable,
Weight, into the Dependent: box. You can do this by either drag-and-dropping the variables or by using the
appropriate Right arrow buttons. You will end up with the following screen:

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 14


Step 3: You now need to check four of the assumptions discussed in the Assumptions section above: no
significant outliers (assumption #3); independence of observations (assumption #4); homoscedasticity
(assumption #5); and normal distribution of errors/residuals (assumptions #6). You can do this by using the
Statistics and Plots features, and then selecting the appropriate options within these two dialogue boxes. In our
enhanced linear regression guide, we show you which options to select in order to test whether your data
meets these four assumptions as shown below:

Step 4: Click on the OK button. This will generate the results.

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 15


Results, Decision and Conclusion

The first table of interest is the Model Summary table, as shown below:

This table provides the R and R2 values. The R value represents the simple correlation and is 0.513 (the "R"
Column), which indicates a high degree of correlation. The R 2 value (the "R Square" column) indicates how
much of the total variation in the dependent variable, Weight, can be explained by the independent variable,
Height. In this case, 26.3%.

The next table is the ANOVA table, which reports how well the regression equation fits the data (i.e., predicts
the dependent variable) and is shown below:

This table indicates that the regression model predicts the dependent variable significantly well. How do we
know this? Look at the "Regression" row and go to the "Sig." column. This indicates the statistical significance
of the regression model that was run. Here, p < 0.001, which is less than 0.05, and indicates that, overall, the
regression model statistically significantly predicts the outcome variable (i.e., it is a good fit for the data).

The Coefficients table provides us with the necessary information to predict weight from height, as well as
determine whether height contributes statistically significantly to the model (by looking at the "Sig." column).
Furthermore, we can use the values in the "B" column under the "Unstandardized Coefficients" column, as
shown below:

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 16


to present the regression equation as:

Weight = -97 + 4.087(Height)

Example: Predict the weight of X (in lbs.) if his height (in inches) is 72.

Weight = -97 + 4.087(Height)

Weight = -97 + 4.087(72)

Weight = -97 + 294.264

Weight = 197.26 lbs.

***END OF LESSON***

MELS 1073 – Biostatistics & Epidemiology (Lecture) | 17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy